Stabilizing Server Software
It isn't impossible to make software that breaks all the time stable. Here is what I would do if I had a team of developers and system ops and I wanted to make server software stable.
First step is documentation. People need to understand how things work.
This documentation provides a starting point to keep everyone on the same page.
Here are the kinds of documentation I would ask for.
Where is the software deployed? How is it configured? What servers does it communicate with?
You really can't document too much here. When the poor sap who has to diagnose an issue needs to look into what is going on, being able to see how things are put together is critical.
How do you do a deployment?
How are bugs filed and tracked?
Everything the team does needs rudimentary documentation. This is not because the processes should be rigid -- they should not -- but because we need to have a common understanding of how things are done.
What projects are being worked on? What will they do?
Documentation on projects is helpful because when you go to diagnose an issue, oftentimes the documentation for the projects will help people understand what is going on.
Meetings serve these purposes:
- Distribute information
- Allow discussion
- Make decisions
- Make and keep assignments
I can't speak to everything on what makes a good meeting. Here are some tips:
- One person is the chair. He decides who gets to speak when and about what. This does NOT have to be the most senior person. The senior person should designate a chair if it is not going to be him.
- ANOTHER person is the secretary. He records what decisions were made, and records what assignments were made.
- Every meeting has an agenda. If you don't write it down, then other people are guessing what it is.
- Keep the tone civil. That means let people finish what they are saying. That means keeping the Golden Rule.
- Honesty is more important than civility, though. If something needs to be said, then it should be clear that it should be said. Kudos to whoever says it. Bonus points if it is said civilly.
A Council is when you have two or more minds come together. The idea is that there is no superiors. The group needs to make a decision, and it needs all the relevant information and expertise. In a council, people should be allowed to speak freely and frankly. Everyone should be given a chance to speak their mind. Council members should think carefully about what they say, however, and not waste people's time. It's a difficult balance. It should be clear to members when they are in a council and when they are not.
Decisions are made by whoever owns the decision. Once it is made, it is not revisited unless the person is willing to revisit it. Due respect should be shown to the person who makes the decision, even if it is the wrong one. Up until the decision is made, he should be clear about what decision he is thinking about making and why. This gives people an opportunity to persuade the individual in an open, frank manner, and avoid discussing things that simply aren't relevant.
If someone is willing to put things up to a vote, it should be clear whether the results are binding or not.
Kinds of Meetings
- Regular review meetings: These are meetings to keep leaders informed about the current status of things, and to give others a chance to inform others. The kinds of meetings might include:
- A regular (weekly) bug review meeting. What bugs are in the queue? Which ones are we going to prioritize? What bugs have been solved? What are the next step for particular bugs?
- A regular review for individual projects.
- Team meetings to review what each person is working on and coordinate efforts. (Perhaps daily)
- Informative meetings
- Policy review meetings
You can make software stable by closing the bug loop. Bugs are found, reported, investigated, and resolved.
It is important that bug fixes are not applied without much thought. Some bugs are really bugs in the design, and must be addressed by the relevant designers and architects. Others are bugs in code, and need to be resolved with code changes. Others are caused by configuration or system problems.
Getting the diagnosis right is critical.
A regular (weekly) bug meeting will help "tame" the queue.
Developers or some other individual with a very technical understanding of the system could guide the bug meeting. People can rotate from bug duty to regular duty. "Big" bugs can turn into projects.
A project has many details that need to be implemented. It is important that each of these details get completed, or otherwise, why they are not completed should be specified. Introducing half-baked features and projects into an unstable platform causes problems.
The variety of testing that needs to be done cannot be underscored. If a particular test cannot be done, then alternatives must be fully explored.
- Unit Testing. These are small, fast programs that test the code behaves as expected. Ideally, unit tests should test each line of code written, that it does what we think it is supposed to do. Practically, every function should have a unit test. You'll find that if you try to write "big" unit tests, you'll end up wishing you wrote small ones. So write small ones, and avoid the big ones.
- Functional Testing: These are large, slow perhaps automated tests that see if the sytem behaves as expected. These should be run by the developer before code is checked in.
- Stress testing. These are tests to see where the limits of the software lie. Although relatively few, a good stress test can help people see whether code is improving or hurting performance.
- Integration testing: These are tests where the system is put into a real-life simulation or where real data is fed into the system. The results should be compared to either live systems or carefully reviewed to ensure that they are expected and reasonable.
Monitors provide a way to see where the system is broken.
Monitors should be attached to EVERY feature that is expected to work. Start with high-level things, but do not overlook the details. For instance, monitor the overall database size, but also, monitor the size of each table for unexpected values.
When monitors are triggered, it should either create a notification that can be addressed later, or send out an immediate alarm that will wake up people who are on call to investigate. Monitors should never be ignored, even if they are low priority. Sometimes, low-level, unimportant alarms will give valuable insight to the system.
The overall health of the system should be tracked by key metrics: how much money is being made? How many transactions are occurring? These should be surfaced on to a single page where anyone can see if the system is healthy or not with a single glance. An overview of the monitoring condition should also appear.
Logs should report every interesting or relevant bit of information. Errors should be monitored. Turning up the log level should provide detailed information about the process. Logs should be archived for a long period of time.
Think: If I were investigating a problem, what information would I like to see? Put that in a log.
A triage document should help people trying to diagnose bugs or monitors that have been triggered to their root cause. It is impossible to write this document in advance. Over time, however it should be adjusted. Every time a new issue is found, how it was diagnosed should be added to the triage document.
In the end, it should look like a decision tree. The starting points are alarms or general problems. Steps describe things to check, and what to do next depending on how things look.