How Databases Work

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

Introduction

What you think you know about databases is probably not correct. What you don't know, you probably need to know.

Database scaling

Data storage is limited, in the end, by how fast you can write to your disks. If you really want your database to go faster, you have to (a) minimize all the stuff that goes on the disk, and (b)... minimize all the stuff that goes on disk.

Some databases get limited by how much they can read. This again is fundamentally due to how much data you can read off of the disks at a time.

Sure, there are other limitations such as bandwidth, CPU, and memory, but most of the time it comes down to the disk.

If you want to make your database "go faster" after you have made it as fast as possible, then you need to do one of these things:

  • If your reads are limited, setup up a master-slave database. The slave databases need not be a "real" database. It could be flat files or BDBs. These are valid alternatives to databases! It could also be a memory cache.
  • If your writes are limited, you need to partition your database. This is really, really hard and no easy solution exists out there, no matter how much money you have. There are two ways to partition:
    • Move some tables to one database and other tables to another. Note that this will kill the join performance, so choose wisely.
    • Move some rows from one table to one database and other rows to another.

Preparing for scaling

My advice is to move as much business logic off of the database as possible as early as possible.

Avoid stored procedures at all costs, except those that deal only with one row in one table.

Avoid sorting and paginating in the database.

Avoid foreign key dependencies.

Yes, this makes no sense from a DB design perspective, but it does if you have to eventually scale the data. Of course, it would be nice if you had a system that didn't have to scale, but that isn't always an option.

When you start partitioning (not IF but WHEN), then you will be glad you had already prepared yourself for it.

Availability

Machines crash. All the time. No matter how much money you spent on the hardware, some idiot is going to install a bad A/C unit in your data center, and your hardware is going to shut down due to overheating. Life sucks. Get over it.

How do you make databases available? You don't. When you have a situation where you need some machine to coordinate transactions for one object, there is a chance that machine will go down.

Instead, deal with it! Build a system that tolerates temporary (or extended) failure. For instance, go ahead and show the data from the cache, but don't allow anyone to update it. Or allow your data to migrate between nodes, so that stuff that needs to get updated even when the primary database isn't available can get updated on the secondary database.

Multi-Master Replication

Forget it. It won't work in practice. See, databases can't sit around all day waiting for all of their peers to agree that their change was propagated. This especially doesn't work when the peers are spread across the world. (Or one day, spread across worlds.)

Instead, deal with availability issues by making your system robust to failure.

Also, partition your data.

See Also