Nor­m­al­isa­tion is one of the basic concepts of re­la­tion­al data modelling. In the re­la­tion­al database model, good database design is char­ac­ter­ised by minimal re­dund­ancy. The reason for this is that redundant data leads to semantic anomalies, which, in turn, make automatic data pro­cessing and database main­ten­ance difficult. Nor­m­al­isa­tion is a strategy to eliminate re­dund­an­cies in re­la­tion­al databases. We'll show you how to implement database normal forms.

What is database nor­m­al­isa­tion? A defin­i­tion

Nor­m­al­isa­tion is an approach to database design used in re­la­tion­al databases to avoid re­dund­ancy.

The re­la­tion­al database model is the most widely used concept in com­pu­ter­ised data man­age­ment. In re­la­tion­al databases, in­form­a­tion is stored as records in tables related by keys. A data record consists of several value ranges that are assigned to specific at­trib­utes using table columns.

The following table shows the stored invoice data of a fic­ti­tious office equipment supplier. John Public has ordered 10 monitors, 12 mouse pads, and 1 office chair for his company. The order from Jane Doe includes 2 laptops and 2 headsets.

In the online shop database, the invoice data is assigned to the at­trib­uted invoice number (“Inv. no.”), date, customer, customer number (“Cust. no.”), address, invoice item number (“Inv. item no.”), product, product number (“Prod. no.”), quantity (“No.”), and price. Each row on the table rep­res­ents one data record. This kind of data set is called a tuple.

The database section shown above is an example of poor database design. At first glance, it’s obvious that the table shows numerous re­dund­an­cies. In addition, the values in the customer and address columns contain multi-value data. This is called a de­nor­m­al­ised database. In other words, it doesn’t follow database nor­m­al­isa­tion rules.

The primary dis­ad­vant­age of de­nor­mal­ized databases is the increased memory re­quire­ment due to redundant values. In addition, at­trib­utes that contain multi-value data are difficult to read and don’t relate easily to one another.

Example: Both customers in the database section listed above are located in Spring­field, Maine. However, since this in­form­a­tion isn’t separated out, the database can’t easily be filtered by customers from the same location.

To avoid duplicate and multi-valued value ranges, three se­quen­tial database normal forms have been developed for re­la­tion­al database models.

A database normal form is a defined target state. Special re­quire­ments have been defined for each normal form, which must be met if this target state is to occur. A database cor­res­ponds exactly to the first, second, or third normal form if all re­quire­ments for the re­spect­ive normal form are fulfilled.

Fact

Nor­m­al­isa­tion is the con­ver­sion of a database table to a higher degree of normal form. Con­ver­sion to a lesser degree of normal form is called de­nor­m­al­isa­tion.

Database nor­m­al­isa­tion: examples of how to re­con­fig­ure a database

To il­lus­trate the con­ver­sion of a re­la­tion­al database into the first, second, and third normal forms, we’ll walk through the in­di­vidu­al stages of re­la­tion­al database nor­m­al­isa­tion using the data in the table above as an example.

First normal form (1NF)

A table in a re­la­tion­al database complies with the first normal form (1NF) when it fulfils the following criteria:

  • All data is atomic
  • All table columns contain identical values

A data set is con­sidered atomic if each item of in­form­a­tion is assigned to a separate data field.

In the below table of billing data, all value ranges that are either non-atomic or don’t contain equi­val­ent data have been high­lighted in red.

As demon­strated in the high­lighted cells, the data in the example table fails to meet either re­quire­ment for first normal form com­pli­ance.

The following procedure should be im­ple­men­ted to normalize these sections:

  1. Divide all multi-value data into separate columns
  2. Check the values in each column for sim­il­ar­ity

To convert the data records in the example table into atomic form, the customer and address fields have to be divided into the more specific at­trib­utes of first name and last name, and street address, city, ZIP code, and country re­spect­ively.

Note

A value is con­sidered atomic depending on the context of its use. If it isn’t necessary to separate first and last names, a person’s full name can be con­sidered atomic. But in practice, it’s con­sidered best to divide multi-part values into the smallest possible units.

Both pounds and pence are currently listed in the price column. Decide on one format for currency to create similar value ranges.

The result is a table that complies with the first normal form but still won’t result in efficient pro­cessing due to double values. Con­vert­ing the table to the second normal form is then re­com­men­ded to eliminate the re­dund­an­cies.

Tip

The first normal form pre­scribes atomic value ranges and enables database queries. Data that is part of a non-atomic value range cannot be queried sep­ar­ately.

Second normal form (2NF)

A table in line with the second normal form must fulfil all the re­quire­ments of the first normal form in addition to the following:

  • Each non-key attribute must be fully func­tion­al, dependent on the primary key

In the in­tro­duc­tion, a re­la­tion­al database is defined as a system of in­di­vidu­al tables that are related to each other by means of keys.

Keys are used in re­la­tion­al databases to uniquely identify data records (tuples). A key that allows you to uniquely name the in­di­vidu­al lines of a database table is called a super key. Such a key can represent the values of a single column or the combined values of several columns.

In the given example, a possible super key results from the invoice number (“Inv. no.”), customer number (“Cust. no.”), and invoice item number (“Inv. item no.”) at­trib­utes, as high­lighted in the table below.

A key con­sist­ing of invoice number, customer number, and invoice item number with the values {124, 12, 1} makes it possible, for example, to clearly designate the data record that rep­res­ents Jane Doe’s laptop purchase:

However, not all in­form­a­tion in the selected super key is required for a unique iden­ti­fic­a­tion. A com­bin­a­tion of invoice number and invoice item number – that is, a subset of the super key – would suffice to identify in­di­vidu­al data records. Such keys with a minimum number of at­trib­utes are called key can­did­ates or alternate keys.

As a rule, one key candidate per table is selected to represent the table. Se­quen­tial numbering is ideal for this. Such a key is called a primary key and specifies the sequence of the data records.

Like any key candidate, the primary key can be a one-part key or – as in the given example – a composite key. The sample table uses a composite primary key that is comprised of the invoice number and invoice item number.

To convert a database table to the second normal form, you not only need to determine the primary key and all non-key at­trib­utes, but also their re­la­tion­ship to one another. Follow these steps:

  1. Check whether all non-key at­trib­utes are entirely func­tion­ally dependent on the primary key. Such de­pend­ency only exists if all primary key at­trib­utes are necessary to uniquely identify the non-key attribute. This also means that tables with one-part primary keys auto­mat­ic­ally cor­res­pond to the second normal form if all pre­requis­ites for the first normal form are fulfilled.
  2. Move all non-key at­trib­utes that are not entirely func­tion­ally dependent on the full primary key to separate tables.

Taking a close look at the example table, note that the pre­requis­ites for the second normal form are not fulfilled because the date column is only dependent on the invoice number (“Inv. no.”), not on the invoice item number (“Inv. item no.”). The same applies to the first name, last name, street address, city, state, and ZIP code.

To convert the data table to the second normal form, all at­trib­utes entirely dependent on the invoice number have been moved to a separate table called “Invoice.”

The table with the balance of the data has been named “Invoice item.”

Following nor­m­al­isa­tion, the invoice number (“Inv. no.”) is found in both tables and links them together. While the attribute functions as the primary key in the “Invoice” table, it is used as a foreign key in the “Invoice item” table and is also part of the table’s composite primary key.

Note

The link via the foreign key enables both tables to be queried together. This is called a join.

The example data now complies with the second normal form. However, it hasn’t yet been possible to com­pletely eliminate re­dund­an­cies. The goal of nor­m­al­isa­tion is usually then the third normal form.

Third normal form (3NF)

If a table is to be converted to the third normal form, all pre­requis­ites of the first and second normal form must be fulfilled as well as the following:

  • No non-key attribute may be trans­it­ively dependent on a key candidate

A trans­it­ive de­pend­ency occurs when a non-key attribute is dependent on another non-key attribute and so in­dir­ectly on its key candidate.

The given database template violates the con­di­tions of the third normal form in several places:

In the “Invoice” table, the first and last name, and the street address, city, state, and ZIP code, depend not only on the primary key (the invoice number), but also on the customer number.

In the “Invoice Item” table, the product and price at­trib­utes depend not only on the primary key, derived from the invoice number and invoice item number, but also on the product number. This specific condition also violates the third normal form.

To remove all de­pend­en­cies between non-key at­trib­utes, the relevant at­trib­utes have been moved to separate tables, linked to each other by foreign keys. This results in the four nor­m­al­ised tables: “Invoice,” “Customer,” “Invoice item” and “Product.”

The primary key of the “Invoice” table is a se­quen­tial invoice number. Each invoice number is assigned a date of invoicing and a customer number.

More detailed in­form­a­tion on each customer is stored in the “Customer” table. The “Invoice” and “Customer” tables are linked via the customer number. This is used as the primary key in the “Customer” table and as a foreign key in the “Invoice” table.

The “Invoice item” table is a central table in the sample database, con­tain­ing in­form­a­tion about which products should appear on which invoice, as well as how many of the items were ordered. The se­quen­tial primary key on the “Invoice item” table is derived from the invoice number and the invoice item number. The re­spect­ive products are only listed as product numbers that act as foreign keys, and link the “Invoice item” table with the “Products” table.

Finally, the “Products” table contains detailed in­form­a­tion on the re­spect­ive products, such as the product de­scrip­tion and price. The primary key is the product serial number.

In the example, splitting two tables into four may not seem very efficient. And indeed, re­dund­an­cies in the data of only two customers are of little im­port­ance. But imagine you want to con­sist­ently process several hundred thousand customer or product records in a re­la­tion­al database without con­tra­dic­tions. This is usually only possible with a database formula that cor­res­ponds to the third normal form.

Note

Duplicate values in re­la­tion­al databases are often un­avoid­able. In reviewing the example as its con­ver­sion unfolds, it’s evident that the linking of database tables by foreign keys can be connected with re­dund­an­cies. These are known as key re­dund­an­cies.

Even if database nor­m­al­isa­tion requires greater pro­gram­ming effort, 3NF – the third normal form – is generally regarded as the standard for re­la­tion­al database formulas, and is only deviated from in ex­cep­tion­al cases. For example, databases complying with the third normal form are sometimes de­nor­m­al­ised to the second normal form. This is because joins across multiple tables are time-consuming for very large databases. De­nor­m­al­isa­tion reduces the number of tables and with it the query time.

Other normal forms

In practice, nor­m­al­isa­tion usually ends with the third normal form. The following normal forms refer to database schemata with special con­di­tions and are then used only in ex­cep­tion­al cases.

Boyce Codd normal form (3.5NF)

Boyce Codd normal form is a tight­en­ing of the third normal form. For 3NF:

  • No non-key attribute may be trans­it­ively dependent on a key candidate

In Boyce Codd normal form, however:

  • No attribute may be trans­it­ively dependent on a key candidate unless it is a trivial de­pend­ency

Boyce Codd normal form is only relevant for database tables with several compound key can­did­ates in which the keys overlap, i.e. if one and the same attribute is a subset of two key can­did­ates.

Database tables complying with the third normal form without multiple key can­did­ates auto­mat­ic­ally then represent Boyce Codd normal form.

The table below shows two key can­did­ates, each composed of two at­trib­utes.

  • Vendor number and product number
  • Vendor and product number

Both keys make it possible to identify each in­di­vidu­al data record. The only non-key attribute is the number. Since the number attribute is not trans­it­ively dependent on any of the key can­did­ates, the table is 3NF compliant.

On the other hand, it is not compliant with Boyce Codd normal form, because there is a de­pend­ency between the vendor number (“V no.”) and vendor (“Vendor”) at­trib­utes. The vendor number attribute is trans­it­ively dependent on the key candidate that combines the vendor and product number; con­versely, the vendor attribute results from the key candidate that combines the vendor number (“V no.”) and product number (“Prod. no.”).

Trans­it­ive de­pend­en­cies can be avoided by dividing the output table into “Number” and “Vendors” tables, which elim­in­ates over­lap­ping key can­did­ates.

Boyce Codd normal form prevents re­dund­an­cies by identi­fy­ing key at­trib­utes listed multiple times by over­lap­ping key can­did­ates. In the above example, con­ver­sion to 3.5NF prevents duplicate values in the vendor column.

Note

A trivial de­pend­ency occurs when an attribute is com­pletely func­tion­ally dependent on itself. Since this is always the case for each attribute in all database con­di­tions, trivial de­pend­en­cies cor­res­pond to the logic of a tautology.

Fourth normal form (4NF)

A database table complies with the fourth normal form if the re­quire­ments of Boyce Codd normal form are fulfilled in addition to the following:

  • There are no mul­ti­val­ued de­pend­en­cies unless they are trivial

A mul­ti­val­ued de­pend­ency always exists if two unrelated at­trib­utes are dependent on the same attribute, as il­lus­trated in the example below:

The following table shows which products have been ordered per customer and to which ZIP code they must be delivered.

For example, the customer with the customer number 234 ordered articles 1-0023-D and 2-0023-D, which are to be delivered to his address at ZIP code 12345. For customer 567, articles 1-0023-D, 3-0023-D, 4-0023-D, and 5-0023-D will be delivered to the ZIP code 56789.

The data records can only be iden­ti­fied with a super key resulting from all three at­trib­utes – customer number, product number, and ZIP code. Since there is no non-key attribute, the database is 3NF compliant. Fur­ther­more, as there are no non-trivial trans­it­ive de­pend­en­cies, it is also 3.5NF compliant. However, there are mul­ti­val­ued de­pend­en­cies: both the product number attribute and the ZIP code attribute are dependent on the customer number attribute, but are not related to each other.

The dis­ad­vant­age of such a database design is that every time a new product is added to the customer’s record, the ZIP code must also be added, which results in redundant data.

These re­dund­an­cies can be elim­in­ated by con­vert­ing the table to 4NF. To do this, you have to divide the table in such a way that there are no or only trivial mul­ti­val­ued de­pend­en­cies. This is possible because the product number and ZIP code are in no way related.

As shown in the example, the fourth normal form elim­in­ates re­dund­ancy caused by mul­ti­val­ued de­pend­en­cies, in this case spe­cific­ally in the ZIP code column.

Note

In this (ad­mit­tedly somewhat contrived) example, the as­sump­tion has been made that only one ZIP code applies for each customer. However, if customers have the option of ordering products for delivery to multiple locations, there would be a de­pend­ence between the product number and the ZIP code, in which case the output table would already be 4NF compliant.

Fifth normal form (5NF)

A database table is compliant with the fifth normal form if it satisfies the con­di­tions of the fourth normal form in addition to the following:

  • The table cannot be split further without losing in­form­a­tion

Below is an example demon­strat­ing such a case in which a company operates a TYPO3-based website and a Magento web shop. Three employees are re­spons­ible for the software projects: Mary Smith, George Miller, and Joe Davis, each with different qual­i­fic­a­tions.

The table shows which employee’s qual­i­fic­a­tion applies to which software project’s re­quire­ments.

Mary Smith uses her knowledge of PHP and SQL on the Magento project and uses SQL and JavaS­cript for the TYPO3 website. George Miller also works with PHP for Magento and in JavaS­cript for TYPO3. Joe Davis is only involved in the TYPO3 project, working as the sole pro­gram­mer with PHP. The table also shows that Magento requires knowledge of PHP and SQL, while the TYPO3 project requires knowledge of PHP, SQL, and JavaS­cript.

The table has only one key composed of all three at­trib­utes, meaning that it at least complies with 3NF and the Boyce Codd normal forms. Since there are no de­pend­en­cies between all three at­trib­utes, the table also complies with the fourth normal form.

To check whether the table is also 5NF compliant, divide the output table “Employee qual­i­fic­a­tion for project de­ploy­ment” into the three tables: “Project de­ploy­ment,” “Employee qual­i­fic­a­tion,” and “Project re­quire­ments.”

The “Project de­ploy­ment” table shows which employee is involved in which software project.

The “Employee qual­i­fic­a­tion” table shows which employee is pro­fi­cient in which pro­gram­ming or database language.

The “Project re­quire­ments” table indicates which pro­gram­ming qual­i­fic­a­tion is required for which project.

At first glance, the database section appears much clearer after being de­com­part­ment­al­ised. But do the tables created during nor­mal­iz­a­tion have the same in­form­a­tion­al content as the initial table?

A joined database query across all three tables holds the answer. The result is sur­pris­ing.

Re­con­struct­ing the output table, you can assume that each employee involved in the project will use each of his or her qual­i­fic­a­tions, provided that these are required by the re­spect­ive project. However, in doing so, the in­form­a­tion that Joe Davis worked alone in PHP pro­gram­ming for the TYPO3 project has been lost. This means that the output table can’t be broken down without in­form­a­tion loss, making it compliant with the fifth normal form.

In practice, you’ll rarely come across database formulas that meet the re­quire­ments for 4NF but aren’t compliant with the fifth normal form. However, 5NF is in­ter­est­ing for ap­plic­a­tions in which new in­form­a­tion is obtained from existing data.

In the example, both Mary Smith and George Miller are pro­fi­cient in PHP, which they could also con­trib­ute to the TYPO3 project in the future. The company could use this in­form­a­tion to make software de­vel­op­ment in this project more efficient.

Ad­vant­ages and dis­ad­vant­ages of nor­m­al­isa­tion

The aim of nor­m­al­isa­tion is to reduce instances of double values. By trans­fer­ring a database to one of the listed normal forms, the target schema benefits from less re­dund­ancy than the source schema. Nor­m­al­isa­tion also makes database main­ten­ance easier.

On the other hand, database nor­m­al­isa­tion always involves storing at­trib­utes in separate tables. This may require the in­teg­ra­tion of foreign keys, which can lead to key re­dund­an­cies. The primary dis­ad­vant­age, however, is that in a nor­m­al­ised database, logically related data is no longer stored together. A join is required to merge data that has been split into different tables.

Complex in­form­a­tion can be filtered out via database queries using joins. However, joins are more complex to implement than simple queries. This also takes much more time if joins are made using a large number of database tables.

Go to Main Menu