Open Source For Geeks: 05/01/16

Sunday, 1 May 2016

Database Normalization

Background

Database Normalization is a technique to store data in a database in most efficient and organized manner possible. The main aim is to eliminate data redundancies and undesirable characteristics like

Insertion Anomaly
Update Anomaly
Deletion Anomaly

Database normalization is step by step process which we will look into shortly.

Problems faced without database normalization

Lets say you have following table -

Employee_ID	Name	Skill level	Team
1	John	A	Team C
2	Sam	A+	Team B
3	Margaret	B	Team A
4	John	A	Team A

Lets analyze above table data and see whats wrong with it .

Lets saw a new Employee is hired. When you make entry for him in this table his skill level will be null (as his skill sets are not yet tested ) and also his team (which may be determined by his skill set and level). This will result into Insertion Anomaly.
Now lets say John has become proficient in work and his skill level need to be updated. We will need to update all rows corresponding to John (He may be part of different teams) to keep the data consistent. This is called Update Anomaly.
Lets say Sam drops out of a team temporarily then we will have to delete the entire row (which inturn will delete employee details with it). This is called Deletion Anomaly.

Now lets see how we can resolve these anomalies.

Normalization Steps are as follows -

First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
BCNF

Now lets see these in detail -

First Normal Form (1NF)

A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain. (Wiki)

So basically a column can have set of atomic (cannot be broken down) values and each row can only have a single value from this set. So something like below is not allowed in 1 NF -

Name	Skill level	Team
John	A	Team C, Team C
Sam	A+	Team B
Margaret	B	Team A

Instead following table is in 1 NF -

Name	Skill level	Team
John	A	Team C
Sam	A+	Team B
Margaret	B	Team A
John	A	Team A

Second Normal Form (1NF)

A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table.

A non-prime attribute of a table is an attribute that is not a part of any candidate key of the table. (Wiki)

In above table which is in 1 NF you can see that the candidate key (primary key consisting of multiple columns) is {Name,Team} However Name -> Skill level. So it violated 2 NF which states a non prime key (Skill level) should not depend on any proper subset of candidate key (Name).

So we split the table with the subset and dependent column to new table. So that the tables now become -

Name	Skill level
John	A
Sam	A+
Margaret	B

and

Name	Team
John	Team C
Sam	Team B
Margaret	Team A
John	Team A

Third Normal Form (3NF)

3NF ensures the entity is in second normal form, and all the attributes in a table are determined only by the candidate keys of that table and not by any non-prime attributes. (Wiki)

Consider following table

Name	Skill level	Address	Zip Code
John	A	ADDR 1	123
Sam	A+	ADDR 2	345
Margaret	B	ADDR 3	156

This is in 2 NF but not in 3 NF as Zip Code (non prime key) -> Address (non prime key) (Candidate key - Name )So we need to move this dependency (also called transitive dependency) to new table with Zip code as primary key.

Resultant tables will be -

Name	Skill level	Zip Code
John	A	123
Sam	A+	345
Margaret	B	156

and

Zip Code	Asdress
123	ADDR 1
345	ADDR 2
156	ADDR 3

Boyce-Codd Normal Form (BCNF)

For table to be BCNF compliant it should satisfy -

For any non-trivial functional dependency, X → A, X must be a super-key. OR
If X → Y is a trivial functional dependency then (Y ⊆ X)

NOTE : A superkey is a combination of columns that uniquely identifies any row within a relational database management system (RDBMS) table. A candidate key is a closely related concept where the superkey is reduced to the minimum number of columns required to uniquely identify each row.

To Sum it up -