Database/Facts

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

Abstract

Some thoughts on facts, sources, and how facts that may contradict can be represented in a relational database.

Introduction

What is your birth date? You may be one of those people with a birth date recorded differently on your birth certificate or other birth records. Or your grandmother may insist you were born on a different day. Simple facts like these aren't so simple.

What is a fact? It's really a statement made by a source. If multiple reliable sources agree on something, then we consider that a universal fact, even if a few reliable sources disagree. Sometimes we want to represent each source's idea of what the truth is and work with that within our database.

By adding a few fields to each table in the database, and by relaxing primary key constraints, we can build a system of facts that consider how sources may not always agree with each other.

We can also decide what to do when the objects two sources were talking about turn out to be the same or different objects.

How to represent a fact

In the database world, we might have a table that looks like this:

Person
------
Person ID (primary key)
Name
Birthday

We take each entry to represent a different person. We also consider that the name and birthday associated with that person are true.

But this really isn't a correct statement. After all, truths only exist in logic games and mathematics. Really, the data we can collect from the real world is:

There is a birth certificate for a guy named Sam Smith who was born on July 5, 1984.
There is a social security record for a guy named Sam Smith who was born on July 8, 1984.
There is a birth certificate for a guy named Bill Jones who was born on September 8, 1963.
There is a tombstone in a particular cemetary for Bill Jones which says he was born on September 9, 1963.

We can't represent these facts in the database without creating four unique people.

Really, we have to abandon the idea that we can ever create a table with universal facts about identities, names, and birthdays. We instead have to create a table that represents the facts as we can discover them.

Person Records
------
Record ID (primary key)
Person ID (matching person IDs means we think that they are the same person.)
Name
Birthday
Source (Some kind of reference to the tombstone, birth certificate, or social security record.)

Note that the data we store in this table means something completely different from the data we stored in the previous table. In this table, we will have four entries. The two records concerning Sam Smith will relate to the same person (assuming we believe they are the same person.) The two records concerning Bill Jones will relate to the same person as well (assuming we believe they are of the same person.)

Notice the transformation we have made.

  1. Rename the table. (We don't want to confuse ourselves.)
  2. Remove the primary key constraint from the ID column.
  3. Add a Record ID column as the new primary key.
  4. Add a source column.

Now, when you search for the name and birthday of a particular person ID, you will get back multiple records. You'll have to write some kind of rules to pick and choose which ones to show, or how you will show all of them. Perhaps you can assign a reliability rating to the source and create some algorithm that will determine for you how reliable each value for each column is. From there, you can filter out the least reliable, or choose the most reliable records.

Note that this transformation applies to any table in your database. If you have some sort of table that represents data collected from potentially unreliable sources, you can make this transformation.

How to bring to objects into one, or split records of one object into two

In some cases, you may realize that two records actually reference the same person. In this case, all you have to do is choose one of the Person IDs and update all the tables with person IDs so that all of the similar persons have the same ID.

In other cases, you may realize that two records that refer to the same person are really referring to two different people. In that case, you have to create a new record, and gently update all the records that refer to the different person with that ID.

This may be a difficult task to do, but it will help to show the user which records are for which person and allow them to sort it out themselves.