How to Partition Your Data

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

There are really two ways to try and partition your data:

  1. Get a better database, something that can do replication. I don't like this recommendation because it makes the entire system more complicated. Replication is hard, and doing it right is really, really hard. Better to avoid it if you can.
  2. Design your data in a way that it can be easily partitioned by the application. This is much better. Not only does your data design become a lot simpler, but you are building in resilience to failures in your app. Bonus points if you find a way to organize your data without any partitioning at all.

How do you make partitionable data? Basically, you "dumb down" your data.

  1. Stop using foreign key constraints.
  2. Better yet, stop using constraints in general, except NULL/NOT NULL.
  3. Stop relying so much on UNIQUE. You can use it for your PRIMARY KEYs, but nothing else.
  4. Move your primary keys all to a single value. If you have some other key, that's nice, but don't count on it.

Now that you have all your data as above, you can start partitioning it. What you do is get several databases running in parallel, each with the same schema. You'll spread the data in each table among the different databases. That is, some records will go in one and some in another.

How do you decide which database a row goes into? Well, that's tough. Usually, you can break it down to a single key on one of the tables. For instance, something like CUSTOMER_ID. Then, you just stick all the data connected to that one row in the same database. So all the purchases for that customer, all the orders, all the settings, all the history, go into the same database, while other customers go into other databases.

Make sense? Sounds simple? It isn't. As you add more databases, the chance that one will not be working increases. Eventually, you'll need failover. This isn't bad. It's actually good, because if you have failover working (that is, if you can't find the data in one database, look in the next one), then you can repartition.

What do I mean by this? Well, when you repartition, you are adding new, clean databases with no data. Now all the customers have to be instantly transported to a different database. Obviously, this can't happen all at once. So what do you do? Well, you look first for the data where it should be, and then where it should have used to be. That's kind of like a failover.

In the end, you are basically building a replication/failover/partitioning system using databases as the building block, rather than using a replication/failover/partitioning database. Since your data needs aren't as generic as what databases can provide, that's fine. Your system may actually be better, faster, and cheaper to run.

I've given some very broad pointers. The devil is in the details. Depending on your data model, you will find some pretty good answers out there or in your head.

Your system is going to look like, in the end:

client -> service -> data abstraction system -> fleet of databases

Good luck. If you can do this successfully, you deserve to be making more than $100k a year, by the way.