2015 UK General Election Normalising Data
The unnormalised data from the CSV file looks like this:
Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over.
Pick a primary key
Our first job is to identify a primary key. There are several options to consider:
- First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
- (firstname,surname) would be good
- The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
- Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
SELECT firstname,surname COUNT(1) FROM ge HAVING COUNT(1)>1
- The combination (ons_id,party) is also tempting
- No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
- Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
- It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
SELECT firstname,surname,ons_id FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;
Identifying dependencies
The columns headins are:
ons_id ons_region_id constituency_name county_name region_name country_name constituency_type party_name party_abbreviation firstname surname gender sitting_mp former_mp votes share change
Having decided on our primary key as (ons_id, firstname, surname) we notice the following dependencies:
ons_id -> ons_region_id ons_id -> consitituency_name ons_id -> county_name ons_id -> constituency_type county_name -> ons_region_id ons_region_id -> region_name ons_region_id -> country_name party_abbreviation -> party_name ons_id,firstname,surname -> gender ons_id,firstname,surname -> party_abbreviation ons_id,firstname,surname -> sitting_mp ons_id,firstname,surname -> former_mp ons_id,firstname,surname -> votes ons_id,firstname,surname -> share ons_id,firstname,surname -> change
Decide on tables
Each distinct determiner (the left hand side of the ->
above) will be a table. The determiner will be the primary key in each case.
Bold indicates a primary key, italics indicates a foreign key.
- constituency(ons_id, constituency, county_name, constituency_type)
- county(county_name, ons_region_id)
- region(ons_region_id, region_name, country_name)
- party(party_id, party_name)
- candidate(ons_id, firstname, surname, gender, party_id, sitting_mp, former_mp, votes, share, change)
The Entity Relationship Diagram for this database is:
Implement the tables
We need to start with the tables that do not have out-going foreign keys.
party and region are simple tables
Party:
CREATE TABLE party( party_id VARCHAR(50) PRIMARY KEY, party_name VARCHAR(50) )
Region:
CREATE TABLE region( ons_region_id VARCHAR(10) PRIMARY KEY, region_name VARCHAR(50), country_name VARCHAR(50) )
county has a reference to region
Now that we have the foreign key target in place we can introduce county which refers to ons_region_id
CREATE TABLE county( county_name VARCHAR(50) PRIMARY KEY, ons_region_id VARCHAR(10) NOT NULL, FOREIGN KEY (ons_region_id) REFERENCES region(ons_region_id) )
Notice
- we take care to indicate NOT NULL on the foreign key - this is because the relationship is not optional. Every county MUST HAVE a region.
constituency references county
And now constituency which references county_name:
CREATE TABLE constituency( ons_id VARCHAR(10) PRIMARY KEY, constituency VARCHAR(50) NOT NULL UNIQUE, county_name VARCHAR(50) NOT NULL, constituency_type VARCHAR(10) NOT NULL CHECK (constituency_type IN ('county','borough')), FOREIGN KEY (county_name) REFERENCES county(county_name) )
Notice:
- There are exactly two constituency types, roughly county is in the countryside; borough is in the city.
- constituency names must be unique
candidate is the complicated table
And finally candidate:
CREATE TABLE candidate( ons_id VARCHAR(10), firstname VARCHAR(50), surname VARCHAR(50), gender VARCHAR(10) NOT NULL, party_id VARCHAR(50) NULL, sitting_mp VARCHAR(3) NOT NULL CHECK (sitting_mp IN ('Yes','No')), former_mp VARCHAR(3) NOT NULL CHECK (former_mp IN ('Yes','No')), votes INT NOT NULL, share FLOAT NOT NULL, `change` FLOAT NULL, PRIMARY KEY (ons_id,firstname,surname), FOREIGN KEY (ons_id) REFERENCES constituency(ons_id), FOREIGN KEY (party_id) REFERENCES party(party_id) )
Note
- change is a reserved word and must be enclosed in back-ticks
- party may be NULL. The relationship between candidate and party is optional
- change may be NULL - it gives the change in the vote for a party and only applies if that party had a candidate in the most recent general election before 2015.
DROP TABLES
If you need to start again you must drop the tables in the reverse order. Last created is the first dropped.
DON'T DROP THE TABLES!
DROP TABLE candidate; DROP TABLE constituency; DROP TABLE county; DROP TABLE region; DROP TABLE party;
Add the region data
INSERT INTO region SELECT DISTINCT ons_region_id, region_name, country_name FROM ge
Add the party data
Annoyingly the abbreviation Lab has been used for two different parties "Labour" and "Labour and Co-operative". We are going to ignore the distinction between these parties and treat them as one.
In doing this we are losing data and in general that should not be done. There is some justification in this case. When is comes to calculating who has a majority in parliament the "Labour and Co-operative" party seats count towards "Labour". There are other alliances (mostly in Northern Ireland) where parties are automatically aligned to mainland parties but these are a little more complicated.
That means we cannot use the DISTINCT method. Using MIN will arbitrarily assign the first value (alphabetically) to the pesky "Labour and Co-operative" party members.
INSERT INTO party SELECT party_abbreviation, MIN(party_name) FROM ge GROUP BY party_abbreviation;
INSERT county data
INSERT INTO county SELECT DISTINCT county_name,ons_region_id FROM ge;
INSERT constituencies
INSERT INTO constituency SELECT DISTINCT ons_id, constituency_name, county_name, constituency_type FROM ge;
INSERT candidates
INSERT INTO candidate SELECT ons_id, firstname, surname, gender, NULLIF(party_abbreviation,'Ind'), sitting_mp,former_mp,votes,share,NULLIF(`change`,'') FROM ge;
The function NULLIF gives NULL if the arguments match.
That means that the party abbreviation will be put in place unless the party abbreviation is 'Ind'. Note that this is an important distinction; there may be many Ind candidates in one constituency - they are not in the same party.
DROP the ge table
With reckless abandon we can now drop the flat, unnormalised table ge. It has served its purpose.
DROP TABLE ge;
What could possibly go wrong?