Background
Database Normalization is a technique to store data in a database in most efficient and organized manner possible. The main aim is to eliminate data redundancies and undesirable characteristics like
- Insertion Anomaly
- Update Anomaly
- Deletion Anomaly
Problems faced without database normalization
Lets say you have following table -
Employee_ID | Name | Skill level | Team |
---|---|---|---|
1 | John | A | Team C |
2 | Sam | A+ | Team B |
3 | Margaret | B | Team A |
4 | John | A | Team A |
Lets analyze above table data and see whats wrong with it .
- Lets saw a new Employee is hired. When you make entry for him in this table his skill level will be null (as his skill sets are not yet tested ) and also his team (which may be determined by his skill set and level). This will result into Insertion Anomaly.
- Now lets say John has become proficient in work and his skill level need to be updated. We will need to update all rows corresponding to John (He may be part of different teams) to keep the data consistent. This is called Update Anomaly.
- Lets say Sam drops out of a team temporarily then we will have to delete the entire row (which inturn will delete employee details with it). This is called Deletion Anomaly.
Now lets see how we can resolve these anomalies.
Normalization Steps are as follows -
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- BCNF
First Normal Form (1NF)
A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain. (Wiki)
So basically a column can have set of atomic (cannot be broken down) values and each row can only have a single value from this set. So something like below is not allowed in 1 NF -
Name | Skill level | Team |
---|---|---|
John | A | Team C, Team C |
Sam | A+ | Team B |
Margaret | B | Team A |
Instead following table is in 1 NF -
Name | Skill level | Team |
---|---|---|
John | A | Team C |
Sam | A+ | Team B |
Margaret | B | Team A |
John | A | Team A |
Second Normal Form (1NF)
A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table.A non-prime attribute of a table is an attribute that is not a part of any candidate key of the table. (Wiki)
In above table which is in 1 NF you can see that the candidate key (primary key consisting of multiple columns) is {Name,Team} However Name -> Skill level. So it violated 2 NF which states a non prime key (Skill level) should not depend on any proper subset of candidate key (Name).
So we split the table with the subset and dependent column to new table. So that the tables now become -
Name | Skill level |
---|---|
John | A |
Sam | A+ |
Margaret | B |
Name | Team |
---|---|
John | Team C |
Sam | Team B |
Margaret | Team A |
John | Team A |
Third Normal Form (3NF)
3NF ensures the entity is in second normal form, and all the attributes in a table are determined only by the candidate keys of that table and not by any non-prime attributes. (Wiki)Consider following table
Name | Skill level | Address | Zip Code |
---|---|---|---|
John | A | ADDR 1 | 123 |
Sam | A+ | ADDR 2 | 345 |
Margaret | B | ADDR 3 | 156 |
This is in 2 NF but not in 3 NF as Zip Code (non prime key) -> Address (non prime key) (Candidate key - Name )So we need to move this dependency (also called transitive dependency) to new table with Zip code as primary key.
Resultant tables will be -
Name | Skill level | Zip Code |
---|---|---|
John | A | 123 |
Sam | A+ | 345 |
Margaret | B | 156 |
and
Zip Code | Asdress |
---|---|
123 | ADDR 1 |
345 | ADDR 2 |
156 | ADDR 3 |
Boyce-Codd Normal Form (BCNF)
For table to be BCNF compliant it should satisfy -- For any non-trivial functional dependency, X → A, X must be a super-key. OR
- If X → Y is a trivial functional dependency then (Y ⊆ X)
NOTE : A superkey is a combination of columns that uniquely identifies any row within a relational database management system (RDBMS) table. A candidate key is a closely related concept where the superkey is reduced to the minimum number of columns required to uniquely identify each row.
To Sum it up -