Healthcare Databases & Information Systems Course
 

Normalization of Databases

The objective of this lecture is to learn about rules of good design.  We have already referred to these rules in our previous lectures, but here you can see them listed in one place in an organized fashion.

The purpose of design is to:

  • Increase efficiency.  Information should be organized in such a manner as to foster efficiency:
    1. Reduce redundancy.  Information that gets repeated in each record should be separated and put into a different table so that we reduce repetition.  For example in a table of encounters, patient ID replaces patient's name, contact information and demographic so that we do not need to repeat this information in each record of a visit.
    2. Reduce missing data entries.  Information that are logically impossible are transferred to another table so that we are not forced to leave it blank.  For example, since a pregnant male is not possible, we would like to have pregnancy information kept in a different table than gender information so that we do not need to enter banks for pregnancy field for males.  
  • Allow users access to data without knowing location of data.  It should be possible for a user to search for a piece of information in one field and not have to guess what other fields may contain the same information.

Design Principles

To achieve these objectives, mathematicians have developed a set of rules that if they are followed result in more efficient and easier to use databases.  These rules are not a matter of design preference but a requirement for effective and efficient database. 

 

Rule 1:  Each table should correspond to a single entity

The first rule we would like you to remember is that each table correspond to a single entity or a single relationship.  You cannot have a table containing information about both the patients and the providers.  Doing so unnecessary adds confusion about where data might be found.  It also forces us to add provider information when adding patient information, increasing the redundancy of the data.  Always ask yourself what is the table about.  If you can tell it is about one entity and one entity alone, then you are on a good start to design of databases.

Rule 2:  Rows in the table should correspond to individual occurrences of the entity

The second rule makes sure that each row in the table is an occurrence of the entity and nothing else.  . A row in a table often is referred to as a record.   Every record should be an occurrence of the entity.  So for example, in the patient table, each record should be one patient.  In the provider table, each record should be a provider.  This seems very simple rule but I have seen it violated.  Consider a table of visits. Each record should be about one visit.  I have seen occasions where multiple records are given to the same visit or some records in the visit table are about the provider's characteristics.  These are not reasonable ways to construct a table.  Always check the rows of the table to see each is an example of the entity.

Rule 3:  Primary key should uniquely identify the individual occurrences of the entity or relationship

The third rule makes sure that the primary key uniquely identifies the occurrences of the entity.  If we know the primary key, we should know which row in the table is involved.  A primary key should not be used for two rows in the same table.  For example, first name and last name are often insufficient for uniquely identify a patient as many patients have same names.  But a combination of first, middle, last name and date of birth might uniquely identify the patient.  A social security number uniquely identifies the patient. 

Rule 4:  Non-key fields should be facts about the occurrence  identified by the primary key

Rule 4 is to make sure that all fields in the table are facts about the primary key.  If the fact is about something else, it does not belong to the table.  For example, a name belongs to a table about patients.  It is a fact about the primary key that reflects individual patients.  Likewise, date of birth is another fact about the patient and belongs to the table.  But is diagnosis a fact about the patient?  At some level yes, but in reality it is not just about the patient.  A patient may have many different diagnosis at different times, so in reality diagnosis is a feature of the patient’s visit and not really the patient.  It is not a persistent fact about the patient.  Therefore, diagnosis belongs to the table of visits and not to the table of patients. 

Rule 5:  Each fact should be presented only once in the database

Rule 5 is to make sure that we do not present a fact more than once anywhere in the database.  Many students think that data should be presented in the sequence found or reported so that it is convenient to do use the data.  But Standard Query Language makes all data in the database available at all times.  There is no need to duplicate the information.  For example, it is not reasonable to list the name of the patient's provider in the visit table, when the name is available in the provider table.  We collect and often report who took care of which patient, but the information gets stored in different places.  It is comfortable to have everything we want in a table -- it feels that the data is accessible.  But in reality we are talking of a machine and all data from all tables are accessible.  Therefore, there is no need to duplicate the data.  . 

First Normal Form

Normalization is the process of applying principles of design to data structures so that they conform to our expectations. In this lecture we discuss three additional rules that are used as part of putting a database in Normal form.


Figure 1:  Violation of first Normal form

Take a look at the table in Figure 1 titled households. It has four fields, street address, zip code, first person in the household, second person in the house hold. Is this a reasonable table structure? Think about it what could go wrong. I see two problems in this table structure. First, user needs to know how occupants are listed. How would anyone know if they are the first person or the second person to be listed. Second, the listing is not extensible, what if the household had more than two people? Where would we put their name.

We can avoid the error we just talked about by putting tables in the first Normal form. A table is said to be in the first Normal form if and only if all fields contain only atomic values and there are no repeating fields in a row. By atomic we mean that it is not composite of two facts, e.g. total charges is a composite of hospital and outpatient charges and cannot be a field in a table in first Normal form. By non-repeating fields I mean that the same information should not be stored in two different fields. In the previous table household residents were listed in two fields. The first Normal form does not allow this. For another example, if the patient has 5 diagnosis during a hospital visit, it is not reasonable to list them as five fields in the same table. In these circumstance, the recommendation is to list a code that points to a different table that lists the diagnosis as 5 separate records in the table. The point is that the first Normal form does not allow any repeating fields.


Figure 2:  Solution for violation of first Normal form

Figure 2 provides the revision of the household table. We have put this table in first Normal form. The composite field address has been separated into atomic fields street number and street name. The repeating fields name have been listed as one field, requiring the need to add an additional record for each resident of the household. You can of course avoid the row duplication by separating the Household table into two tables, one containing the resident names and the other containing the address.

Second Normal Form

The discussion of the second Normal form requires us to introduce the concept of dependency. An Attribute Y is Functionally Dependent on an Attribute X, if a Value for X Determines a Unique Value for Y. X may be a Set of Attributes or a single attribute. We show this as an arrow going from X to Y and read it as X determines Y.

For example, we could say that an employee number functionally determines the employee name. This is the same as to say if we know the employee number we will know his name. The reverse is not always true. Knowing an employee name is not sufficient for guessing the employee number. So “Employee name” is functionally dependent on employee number but not vice versa.

We need one more definition before we can state the second Normal form. We say that an attribute is fully functionally dependent on another set of attributes, if and only if it is not dependent on any smaller subset of attributes. An employee number may be fully functionally dependent on employee name and date of birth as it is not dependent on either the name or date of birth by themselves.

Here is another example: employee name is not fully functionally dependent on the attributes employee number and department as it is functionally dependent on just employee number. Just knowing the employee number by itself is sufficient to know employee name. We do not need the additional information about the department. In essence, a fully functional dependency is the smallest subset need to establish functional dependency.

Now we are ready to state the second Normal form. A Table is in Second Normal Form if and only if facts in the table are fully functional dependent on the primary key. This is a very important principle. It says that the primary key is the minimum subset necessary to determine each fact in the table.


Figure 3:  Violation of second Normal form

Consider the table of Households in Figure 3. Is it in violation of the second Normal form? In the table there are five fields, street number, street name, zip code, name and employment. This is a typical table that might be constructed from a survey of residents of a household regarding their employment.


Figure 4:  Solution to violation of second Normal form

Note the problem in Figure 3, employment is not fully functionally dependent on the household address. Employment is a fact about the residents of the house. Therefore it should be separated into a different table. Each fact in the table should be about the primary key and fully functionally dependent on the primary key (see Figure 4). The example we have used often by now, whether diagnosis belongs to a table of patients can now be addressed more formally. No diagnosis cannot be in the patient table because it is fully functionally dependent on the patient as well as date of visit. The primary key in the patient table does not have the additional date of visit and therefore diagnosis cannot be put in this table.

Third Normal Form

A Table is in Third Normal form if and only if there are no combination of strictly informational fields (not primary key fields) that determines the value of another field.  


Figure 5:  Violation of third Normal form

Consider the table about clinic visit charges in Figure 5. Is the table in violation of third Normal form? For simplicity consider that what is not paid by employer is paid by employee.  Think it through. Is there a combination of facts in the table that functionally determines another fact? Yes, three of the four fields (charges, co-pay, whether patient has met deductible and employee contributions) will tell us the fourth. So one piece of information is not needed. Depending on the use we want to put the table to, we could drop the last two fields and keep only employers contribution and charges. We can deduce that the rest is paid by employee. If a piece of information can be computed from other facts in the table, it should not be part of the table.

Summary

Database design is more than listing of fields and includes judicious decisions about tables and their relationships. Tables that are properly normalized can be exploited by a wide variety of end users, who do not have to be familiar with the structure of the data. We provide five rules of design and 3 principles of Normalization that we encourage you to follow. These were:

  1. Each table should correspond to a single entity
  2. Each row should correspond to occurrences of the entity
  3. Facts in the table should describe the primary key
  4. Each fact should be represented only once in the database
  5. No composite or repeating fields should be used
  6. No combination of facts should determine the value of another
  7. The primary key should uniquely identify the entity
  8. All facts should be fully functional dependent on the primary key

Here you have it, eight simple rules to live by. The eight commandments of good database design.

What Do You Know?

Advanced learners like you, often need different ways of understanding a topic. Reading is just one way of understanding. Another way is through doing and practicing the concepts learned in this section.  The following assessment is designed to get you to think more about the concepts taught in this session.

Ask Us Policy.  You can complete these assignments in class, when the teacher is available.  You can also ask questions about this assignment from the teacher online by clicking on the Get Answers option.  If you see something that you are not sure about, ask and we would be delighted to answer it.

Please review your work from previous assignments on Mental Health Court Database and find the Normalization errors in the work.  Write a report on addressing the following:

  1. Create a summary table of the design and list the following:


  2. Entity name Primary keys Foreign keys List of other related entities
           
           
           


  3. Are there two entities with same primary key.  Identify them and indicate which rule of Normalization is violated.
  4. Are there any entities without a primary key.   Identify them and indicate which rule of Normalization is violated.
  5. Are there any composite facts in any single table.  Identify them and indicate which rule of Normalization is violated.
  6. Are there any composite facts or redundant fields across tables.   Identify them and indicate which rule of Normalization is violated.
  7. Is there any fact in a table that is not entirely about the primary key of the table.   Identify them and indicate which rule of Normalization is violated.
  8. Are there too many relationships specified and is it possible to drop some specified relationships and yet be able to provide the same information across tables
  9. What suggestions you have for additional improvement.  Be concise and do not use any adjectives.

Complete this assignment individually and not in group.  Organize the review with above listed headings.  Send your review to the the instructor.

Presentations

There are three sets of presentations for this lecture:

  1. Normalization of Databases (Slides, updated Slides, or Narrated  Slides)

Narrated slides and videos require Flash.

Recently Asked Questions

Ask a question and we will answer it within the next 48 hours.  If you have no questions, please review the answer to the questions asked by others: 

Question: Is it a redundant field if you are using foreign keys across more than one entity? Example: ClientID is a foreign key in both the Client_CaseWorker and Client_Provider entities; same with the FAST_Encounter. Does this mean that we can not have a foreign key populating more than one entitiy? Answer: No. Foreign keys are needed to maintain relationships between entities and are not counted as redundant.  This question was left on 11/14/2008 6:12:03 PM and answered on 11/15/2008 10:52:04 PM.

Question: If the ER diagram submitted by another student to us for evaluation does not include scenarios and use cases, should we just document the absence of those elements for this assignment? Answer: Yes, please do so. But pay special attention to Normalization issues  This question was left on 10/12/2006 10:18:48 PM and answered on 10/13/2006 8:13:47 AM.

Question: Does the assignment for HSCI 709 have to be done on the ER diagram for the mental health court database, or may we do it on another ER diagram assignment from this course like the EMR project? Answer: You can do it on a different topic, if you have access to the relevant information and if it is useful for your career or work.  This question was left on 10/11/2006 9:56:50 PM and answered on 10/13/2006 8:03:37 AM.

Question: Week 6 my name Victoria Doyon DoyonVA@hotmail.com does NOT appear on the list for current 709 students. Hence, I'm not confident that the list is current for Fall 2006. Please advise as we need to ask another classmate for their ER diagram for the mental health court database. Each student must solicit the work of one other student and each student should provide their work to only one other reviewer. Will you confirm to the class list via e-mail either that the list has been update AND when my name has been added, please. Kindest regards, Answer: The list is current. You need to add your Instant Messenger to the list. Look at the syllabus for the course in the first week and you will see the option there.  This question was left on 10/6/2006 9:59:49 AM and answered on 10/6/2006 12:16:37 PM.

Question: I am having difficulty opening full text articles in PubMed. Do you have suggestions as to the best way to get the articles to read? I think I will try to access them through the GMU library site. But wonder if there is a more direct route. Answer: If the cite has been set up to open as full text article, it should ask for your G number and provide you with full text. If you are having difficulty please let me know, perhaps the library protocol has changed and we need to correct the problem. Just email me what were you trying to do and on what page.  This question was left on 10/3/2006 10:42:51 PM and answered on 10/4/2006 11:00:47 PM.

Question: in the assignment#2 do we provide the primary keys as indicated from our own evaluation if they are not provided in the work we are assessing, also can it be applicable to the other fields needed in the table?Thank you Answer: Yes you can suggest the primary key or leave it blank indicating lack of the key. Of course you can also suggest additional fields, but keep in mind that the designer may not face the same situation that you face  This question was left on 10/27/2005 3:23:23 PM and answered on 10/27/2005 10:06:21 PM.

Question: In the assignment, when you ask for which rules of normalization are violated, are you referring to the Design rules or a violation of a specific normalization form? Answer: My preference is if you give a specific Normalization form rules  This question was left on 10/23/2005 12:51:34 PM and answered on 10/24/2005 8:00:07 PM.

Question: what is the purpose of functional dependencies?what are the three types of decomposition? Answer: Functional dependencies create inefficiencies in the database and should be avoided to the extent possible. If a fact in a table is functionally dependent on other facts in the table, then it should not be stored in the database saving space. In addition, facts in tables should be fully functionally dependent on the table's primary key.  This question was left on 4/26/2005 4:34:15 AM and answered on 4/30/2005 9:29:19 AM.

Question: I created the tables and the fields, but it seems as though alot of the tables are tied together with a field entitled client ID. Am I only suppose to use this field once in the entire data base or can I use it more than once to tie several tables together??? Answer: Each table can have a field called ID as primary key. But the meaning of this field should be different in each table. For example, in the patient table ID refers to the patient ID (e.g. medical record number). In the visit table the ID refers to the visit number. You should not have the same ID in two tables as primary keys as that will make the two tables about the same entity.   This question was left on 3/20/2005 12:27:28 PM and answered on 3/20/2005 5:43:59 PM.

Suggested Changes

In this section, you can review suggestions made by you and others regarding how to improve this lecture. 
 

Comment: Needs more updating. Very confusing.  This comment was left on 10/25/2009 11:37:14 PM.

Comment: good videos  This comment was left on 10/11/2009 7:14:00 PM.

Comment: I liked how this lecture was quite detailed and easy to understand.  This comment was left on 11/26/2008 5:52:50 PM.

Comment: This was a more complex topic and some additional examples might have been helpful. Some of the definitions actually made things more confusing when you reviewing some of the previous work submitted.  This comment was left on 11/18/2008 3:16:20 PM.

Comment: This was by far the hardest class so far. The normal forms are still confusing but still working at it.   This comment was left on 10/9/2008 10:59:18 PM.

Comment: just tell me  This comment was left on 11/23/2006 11:56:15 PM.

Comment: Does the assignment for HSCI 709 Normalization of Databases have to be on an ER diagram for the mental health court database or may it be done on one of the other ER diagrams from assignments?  This comment was left on 10/11/2006 9:54:30 PM.

Comment: Please inform students that questions submitted via this site should include HSCI 709 so the instructor can determine to which course the questions pertain.   This comment was left on 10/11/2006 9:49:34 PM.

Comment: Please make it clear that there is no on campus class for HSCI 709 Fall 2006 semester. Therefore students can not "meet with the instructor" or work on the project during class, because there is no HSCI 709 live class this semester. Thank you.  This comment was left on 10/11/2006 9:46:39 PM.

Comment: The more times I listen to the narrated slides and read through the lecture, the better I understand the concepts. I pick up additional points and understanding with each repetition. There are many important points to absorb. The narration is very clear and easy to listen to.   This comment was left on 10/3/2006 10:12:11 PM.

Comment: Better examples in the lecture would help more with the homework. It seems to vague and less specific to help with the homework  This comment was left on 3/20/2005 12:29:44 PM.

Comment: This was an extremely useful lecture for me. It answered a lot of my questions. It directed my focus on the data base design. I feel I understand it better now.   This comment was left on 3/7/2005 8:02:32 AM.

More

 


This page is part of the course on Healthcare Databases, the lecture on Normalization of Databases.  For assistance in use of this lecture write to Farrokh Alemi, Ph.D.  Created on January 9 2005.  Most recent revision 11/27/2007.