Database/My Best Practices

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

Abstract

These are some thoughts on my best practices in relation to database modelling.

IDs

IDs are always a number. They always come from a sequence and are not randomly generated. You do not reuse old IDs from deleted records.

I usually do not have multiple sequences, except in cases where I have to generate a ton of IDs very quickly across multiple databases. I try to partition the IDs so that an ID is unique across all tables. In some cases, this is difficult or impossible, and I don't try to force it. If I can't do it, I can't do it, and I don't mind too much.

Primary Keys

Primary keys are always a single number, one of the generated IDs above.

Some people like to segregate data in a single table across universal IDs. That is, one universe has its set of IDs, and another universe has its set of IDs, and they might overlap. So they have a universe ID and the regular ID as the primary key. I don't like this. If you have this situation, I believe it is better to translate the external ID into an internal one. That is, the Universal ID + Object ID on the outside of the database is mapped to a single ID on the inside with some kind of mapping table.

Foreign Keys

Although I consider columns foreign keys, I don't let the database know. That is, I'll put in my notes that a certain column is a foreign key, but I won't put a foreign key constraint on the column.

Named Relations

Often times, you'll have a one-to-many or many-to-many relationship. In these cases, I'll try to give some kind of user-defined name for each relation. In some cases, the ideal name is the array index representing the order of objects. But usually it's some kind of name.

For example, a person might have multiple addresses. This is a one-to-many relationship. Why not give each address a name? One can be "home", another "work", another "Uncle Bob's Garage", etc...

Facts

Facts from potentially unreliable sources can be a difficult thing to deal with. See Database/Facts for my thoughts on that subject.

Entry Time

I like to keep track of when stuff was entered and by whom. I add the column "created" which is a datetime defaulting to now. I also add "created_by" which defaults to the current user. If I had to choose, I would keep "created". High-velocity tables don't need this, and it would only get in the way. Otherwise, it will help you diagnose problems since you can correlate records with reported problems or log messages.

Limited Varchar

I usually don't like limits on a varchar field. I hate to have to write client code that is aware of limits on string length. When my users find out about the limit, it is usually not a pleasant experience. Why limit strings when our databases can handle unlimited strings?

Localization: Money, Addresses, Phone Numbers

Money isn't as easy to represent as you think. If you're dealing with US Dollars (are you really? You'll want to translate the text to a foreign language some day, right?), you can get away with Numeric(18,2). If you're dealing with some other currency, that's not going to work.

I wish there was some sort of Decimal type in databases that allowed for arbitrary decimal floating point numbers. But that is not life.

For addresses, the same thing goes. YOu can't count on address1-address2-city-state-zip to get you international.

Generally, I'd use some sort of system relying on ISO codes and a generic format. That generic format can be free text if need be. If you need elements of the address or phone number, you'll need to write a parser. It does make sense to store the results of the parser. It may make even more sense to refuse data that doesn't parse.

Let me explain. If someone enters a US address, then you expect to be able to extract city, state and ZIP code. So write your parser accordingly. Or better yet, when you ask them for their address, ask them for city, state and ZIP code. When you store it in the database, store it in a way that makes sense. Or you may want to store the parsed bits in nullable fields.

Don't Design a DB in a DB

If you have something like this:

Attribute Table
---------------
Object ID
Attribute
Value

Then you may be implementing a DB in a DB. Think hard about what you are doing. If you are in a situation where you don't know what attributes will be coming up, why is that? Are you just trying to future-proof your data, or is the nature of the data such that it cannot be nailed down to a specific set of attributes?

I'm not saying you shouldn't go ahead. I can think of a million good reasons to have certain parts of the schema indefinite like this. And sometimes future proofing is a really good idea. For instance, you may have objects that, down the road, people will want to add attributes to. Just create an attribute map and be done with it.

(Personally, I think there may be a future relational DB that allows arbitrary attributes. Such a database would simply have objects and those objects would have IDs, attributes, and values. The attributes need not even be unique. This sounds remarkably like the type systems of Lisp, Python and Perl, which is why I think there will be a database like this one day. In fact, there already is. The Semantic Web uses a data format of pairs that give you this kind of database.)