What is Normalization?
Normalization in a database is the process of organizing the data to reduce redundancy. The main idea is to segment a larger table into smaller ones and connect them through a relation.
But why should an end-user like you or me be concerned about Data Normalization?
To answer that question, we first need to understand what could happen if our data is not normalized.
Why is Database Normalization important?
Let’s assume a company stores all its data, such as employee details, personal information, etc., in a single table. This data is accessible to end-users such as developers, database admins (DBAs), etc. But what happens when multiple end-users interact with the database at the same time?
For instance, imagine a DBA is updating the database, and during the process, a developer, completely unaware, performs another operation on the database. These users now have a different view of the database altogether.
Such data mismatch when multiple users interact with the database simultaneously can be termed as an anomaly.
This is where Data Normalization steps in. It helps in avoiding data inconsistencies and provides a more organized way to store your data. So, whether you’re a database administrator or just an end-user, you need to be aware of how your data is getting stored and what it means for the company.
Now that we have a clear idea of what can happen without Data Normalization, let’s look at the various normal forms available for Database Management (DBMS).
Normal Forms
First Normal Form (1NF)
A relation is said to be in First Normal Form (1NF) if it does not contain any multi-valued or composite attribute. In other words, every single attribute in a table has to hold an atomic value; otherwise, it defies the rules of the First Normal Form.
Let’s look at an example:
First Name | Last Name | Movies rented |
Sam | Holland | The Notebook, A walk to remember |
Joe | Dunphy | Harry Potter and the Goblet of Fire |
Harry | Williams | Interstellar, Inception, Gravity |
The last column in the table - ‘Movies rented’ is holding several values. The 1NF version of this table will look like this:
First Name | Last Name | Movies rented |
Sam | Holland | The Notebook |
Sam | Holland | A walk to remember |
Joe | Dunphy | Harry Potter and the Goblet of Fire |
Harry | Williams | Interstellar |
Harry | Williams | Inception |
Harry | Williams | Gravity |
Now you may wonder, how does Data Normalization reduce redundancy if the above conversion doubled the number of existing rows?
This is because we have taken a trivial scenario where we considered only a single table. In the next section, we will discuss the second normal form, and it’ll make more sense as to how Data Normalization reduces the overall redundancy.
Second Normal Form (2NF)
A relation is in Second Normal Form (2NF) if:
- it is in First Normal Form (1NF) and,
- it has no partial dependency
If a non-key attribute can be determined from a proper subset of the candidate key, then the relation is said to have a partial dependency.
Let’s see an example to understand this better -
Subject Taught | Teacher ID | Teacher Age |
Mathematics | 181 | 37 |
Social Sciences | 11 | 29 |
English | 181 | 37 |
Physics | 27 | 45 |
Chemistry | 27 | 45 |
The above table follows 1NF as each attribute holds a single value. However, ‘Teacher Age,’ which is a non-key attribute (as it cannot be used as an identifier - two people can have the same age), is dependent on ‘Teacher ID,’ which is a proper subset of the candidate key. Therefore, this table exhibits partial dependency and does not follow the Second Normal Form.
The 2NF conversion of the adobe table will look like this:
Table 1:
Teacher ID | Teacher Age |
181 | 37 |
11 | 29 |
27 | 45 |
Table 2:
Subject Taught | Teacher ID |
Mathematics | 181 |
Social Sciences | 11 |
English | 181 |
Physics | 27 |
Chemistry | 27 |
Now, we no longer need to store Teacher Age every time we add in a new course. This breakdown reduces the overall redundancy when you are dealing with a large number of rows.
Let’s have a look at the next normal form in the chain.
Third Normal Form (3NF)
A relation is in Third Normal Form (3NF) if -
- it is in 2NF and,
- it has no transitive-dependency for all the non-key attributes.
If A->B and B-> C are two functional dependencies, then A->C is a transitive dependency. If a table has such indirect dependencies, then it does not follow the Third Normal Form.
Alternatively, a relation with a functional dependency of A->B is in 3NF if one of these conditions is true -
- A is the superkey
- B is a prime attribute, i.e., B is a part of the candidate key
Consider the following table -
Employee ID | Employee Name | Employee State | Employee Country | Employee ZIP |
1267 | Sam Holland | California | USA | 421005 |
4582 | Joe Dunphy | Texas | USA | 560051 |
2362 | Harry Williams | Florida | USA | 690087 |
1260 | Alexa Stewart | Alaska | USA | 798423 |
Primary Key: Employee ID
Non-key attributes: Employee Name, Employee State, Employee Country, Employee ZIP
- ‘Employee ZIP’ is dependent on ‘Employee ID.’
- Employee State’ and ‘Employee Country’ are dependent on ‘Employee ZIP.’
Thus by the definition of transitive dependency, ‘Employee State’ and ‘Employee Country’ depend on ‘Employee ID.’ This table is, therefore, not in 3NF. We need to break down the table into two, and the final conversion looks like this -
Table 1:
Employee ID | Employee Name | Employee ZIP |
1267 | Sam Holland | 421005 |
4582 | Joe Dunphy | 560051 |
2362 | Harry Williams | 690087 |
1260 | Alexa Stewart | 798423 |
Table 2:
Employee ZIP | Employee State | Employee Country |
421005 | California | USA |
560051 | Texas | USA |
690087 | Florida | USA |
798423 | Alaska | USA |
The Difference between Second Normal Form (2NF) and Third Normal Form (3NF)
The overview of the three normal forms tells us one thing for sure - that each normal form is stricter than its predecessor. For instance, in 2NF, non-prime attributes are not dependent on prime (or key) attributes, but a non-prime attribute can depend on another non-prime attribute. 3NF eliminates this possibility as non-prime attributes are only dependent on the super key of the relation.
Moreover, 2NF tackles partial dependency, whereas 3NF focuses on avoiding transitive dependency. With 2NF, we saw that the repeating groups were eliminated from the table, whereas 3NF reduced the redundancy altogether. Thus, 3NF is a stronger normalization form.
A direct comparison between 2NF and 3NF is somewhat misleading as it is not an apples-to-apples comparison. 3NF is a more sophisticated case of 2NF, and thus, it wouldn’t be fair to compare these normal forms. The choice of normalization depends on your data and end goal. If you aim to reduce the main redundant data, choose 2NF. However, if you are looking to ensure referential integrity, 3NF is a better choice.