Data Modelling

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

Introduction

Data Modelling is the practice of taking a real-life situation and modelling it in a database.

If the model is well-built, then building applications on top of the model is all but trivial to do. If the model is not well-built, then applications will never work right. Think of data modelling as the foundation for all the rest of the work you will do.

Architecture

The first thing to consider when data modelling is the architecture.

In my experience, if you are building a web application, you are going to have the following architecture as your website becomes more and more popular.

  • Database backend. The database is likely not a single SQL database, but a collection of databases and also a collection of data sources that are not databases.
  • Service layer. No application directly interfaces with the databases. Instead, they must work through a service layer. The job of this layer is two-fold: Apply application rules and manage the data among the various data sources.
  • Application layer. On top of the services is the application layer. These are the various applications that require access to the data. As time goes on, the number of applications will grow from one to ten to hundreds to thousands. Applications may include other services.

In the beginning, all of these layers may exist within the same web server running on a single host. However, as the website grows in popularity, you will begin adding more and more web servers, and each web server will have a more and more specialized role. Eventually, you will begin to separate out the layers from each others.

Steps in Data Modelling

The steps in data modelling are as follows.

  1. Define the problem scope. What are the things that are doing stuff and what do they do? How do they relate to each other?
  2. Specify the things and relations formally. Sit down and write down what exactly the things are and how they relate and behave.
  3. Specify a schema to handle the specification. Translate the specification into database tables and such.
  4. Normalize the schema. Apply the basic normalization patterns.
  5. Specify access and modification patterns. Determine how the data will be used.
  6. Implement the application. This step must come before optimization.
  7. Optimize data access and modification.

Defining the Problem Scope

This is the hardest part to do, and it has to happen inside someone's head. All the various problems can be expressed in terms of bits of data and how they relate to each other. Focus on these questions.

  • What types of data are there?
  • What attributes do these types have?
  • How can we identify a specific instance of a type?
  • How does one instance of a type relate to another instance of the same or different type?

If it helps, draw out a picture. Make boxes for each type. Write down the attributes they have. (Ignore for now what type the attributes are.) Draw lines from the different types to each other, and label the lines with how they relate. It's okay to have lines loop back on to the same type. It's also okay to have a line split and branch and merge.

With this definition, you should be able to see how things will work. To solve the various problems in your problem scope, begin talking about data operations. These are:

  • Create a new instance of a type.
  • Modify an existing instance of a type.
  • Delete an existing instance of a type.
  • Query for certain instances of a type.

These are the most basic operations you can do.

Example: Let's say you want to model a comment system. Visitors will be able to leave comments. Organizing your thoughts, you see that there are visitors and there are comments. The visitor is just a name and an email address. The comment is a subject, the comment itself, the time it was entered, and who entered it. Comments may also be in reply to existing comments. You draw a line from the comment to the visitor and label it "comment author". You draw a line from the comment to itself and label it "in reply to".
Thinking of how the system will work, you imagine a visitor placing a comment. He enters his name, email address, the subject of the comment, and the comment. This would create a new comment that points back to the visitor. If the visitor wants to write a reply, he can select a comment to reply to, and the new comment will be marked as in reply to the old comment. Perhaps a visitor may want to read all the comments. In that case, you would grab all the comments, sort them by date, nest the ones that are a reply under the ones that they are a reply to, and display them.

Specify the Types and Relations Formally

This step should be pretty easy if you have thought through the design well.

First, write down a list of all the types you came up with. Under that, list the attributes of each.

Next, write down the relations. You have to answer the following questions:

  • How many things on one side can be related to how many things on the other? Is it a one-to-one, one-to-many, or many-to-many relation?
  • Is it okay if there is nothing on the other side? That is, can you have zero-to-one or zero-to-many relation? If zero is okay, and one is okay, then you are probably really thinking of a many relation.

With this, you are very close to being able to specify the tables. But before we do, let's clean up our design a little bit.

If you have a one-to-one relation somewhere, that suggests that the two types should be merged into one, larger type. Think if this makes sense. It probably does.

Example: We may have a thing called a 'visitor' and a thing called a 'member'. There is only one visitor per member, and one member per visitor. Since this is always going to be true, it really suggests that visitors and members are one and the same thing.

Specify a Schema

Now, this is the fun step.

For each type, create a table. For each attribute of the type, you need a column. You need to think, "What is the type of this column?" Is it a number, string, date, or something else?

Primary Keys

Every table needs a primary key. This is a column, usually just called "id" or "tablename_id" that identifies that row uniquely. This should always be a number.

Primary keys need to have two properties: One, that they are unique. That is, no two rows in a table have the same primary key. Two, that they are not null. That is, no row can have a null value for the primary key.

To make sure that we get unique values for our primary keys, we use something called a sequence. This is an object that returns the next number in a sequence. We will use the next value of this sequence for the primary key value of the next new row in a table. Note that sequences are a bottleneck in extremely busy systems. It may come to the point where you are not able to generate new objects, such as orders, fast enough if you have a bottleneck at the sequence. There are ways to work around this, however.

Next we need to add our relations.

One-to-Many Relations

For one-to-many relations, create a foreign key in the table that is "many" that references the table that is the "one".

What is a foreign key? It is a column that references the primary key of a table. That is, if a column that is a foreign key to another table has the value of 7, then that means we are talking about the row in the other table with a primary key value of 7.

Example: A visitor can post several comments. That is a one-to-many relation. So we need a column in the comment table called "author". This will refer to a row in the visitor table.
Example: A comment can be in reply to only one comment, but a single comment can have multiple comments that are in reply to it. So in the comment table, you'll need a "in_reply_to" column that refers to the comment it is in reply to.

Many-to-Many Relations

Many-to-many relations require a proxy table. This is a third table that keeps track of how the previous two relate. The two columns of this table refer to rows in the two tables.

Example: Let's say we allow a comment to be in reply to multiple comments. In this case, we'd need a third table, 'reply_comments'. The two columns are "comment" and "in_reply_to". If comment C and D are in reply to comments A and B, then there would be four rows in that table:
comment in_reply_to
C A
C B
D A
D A
Note that these proxy tables probably don't need a primary key. The primary key would be the comment/in_reply_to combination.

Handling Null

You may encounter a situation where it is okay to leave a foreign key blank (such as a comment that is in reply to nothing) or a value blank (such as a visitor without an email address.)

I suggest you avoid using NULL whenever possible, however. When you do have a NULL, write down what it means. Does it mean the value is missing? Does it mean that we don't know what the value is? Or does it mean something else?

Normalize the Schema

Oftentimes, people end up putting tables together that really should be put together in a better way. To ensure that you aren't doing this, we go through a process called normalization.

First Normal Form (1NF)

The first normal form eliminates any repeating column. This is done by allowing multiple rows with different values to represent each repeated item.

Example: A visitor can have several comments. So one might be tempted to create a table called 'visitor' that has columns 'comment1', 'comment2', 'comment3', etc... Instead, the comment table should have a column called 'author' that refers back to the visitor.

Second Normal Form (2NF)

Second normal form eliminates data that is duplicated needlessly based on a primary key.

Example: We have a table "comments" with the columns "id", "author", "author_name", and "comment". We also have a table "visitors" that has the columns "id" and "name". The "author" column of the "comments" table is a foreign key to the "visitors" table.
Now, assume that the visitor's name may never, every be changed. Then that would mean that, given the same author, comments would always have the same author name. That means that we shouldn't store the author_name in "comments", but instead only store it in the "visitors" table.

Third Normal Form (3NF)

Finally, third normal form eliminates duplicate data that depends on some other column that is not based on the primary key.

Example: We have a column "comment_length" in the comments table. This keeps track of the length of the comment. This is pointless and a waste of storage space and time because we can easily calculate the length of the comment.

Final Remarks

Normalizing data is an art the same way chess and karate are arts. It takes a lot of practice to get it right, and it helps to have someone make suggestions on how you can do better.

Specify the Access and Modification Patterns

Recall that the four basic operations are:

  • Find rows that match a criteria and look at the values of those rows. (SELECT)
  • Create new rows. (INSERT)
  • Modify existing rows. (UPDATE)
  • Remove rows. (DELETE)

Now, given the database schema we have put together, start writing queries that will do the job required by the problem space. This isn't wasted work. These queries will show how work is done, and it will expose weaknesses in your design or even missing types or attributes.

Pay attention to the SELECT queries. These are going to be the most complicated, especially when you are searching for a needle in a haystack. You may need additional attributes to help you find bits of data.

Avoid the temptation to optimize. Don't do it! Don't have indexes for anything but primary keys. You'll thank me later.

Implement the Application

This step is left as an exercise for the reader.

Optimize the Queries

Now, once the application is running (or partially running), you can start optimizing the data. First start with the slowest queries. These are both the queries that take the longest time to run and the queries that are run the most and take the longest total time. Start from the top, and rework it until it is no longer the slowest query. Then move on to the next slowest query.

You'll quickly notice that optimization is an engineering problem. You are trading things around. By making one query faster, you are making other queries slower. In the end, you'll have to find the happy medium.

Also, you'll quickly notice that as your dataset grows, the performance changes dramatically. You have to optimize according to the size of your dataset.

Don't forget that sometimes the best optimization is to move the work off the database completely and into a specialized system.