So Lambda: Bad Engineering Idea of the Week

I'm struggling to remember to post new stuff here, or maybe I'm struggling to find something new to say. Either way, I'm going to try a different strategy: post about software engineering things from my job--lessons learned, handy tips, interesting bugs, hard problems, etc.

Today's tidbit is a bad idea for fixing a bug. Our architecture has a primary server and a secondary server for backup purposes, both of which must be kept in sync to guarantee correct backup behavior. One new feature attempts to provide better error detection and feedback, a key part of which is determining whether the backup process is running.

For a little more context, the primary server already will not allow clients to connect until it handshakes with the backup process and verifies a synchronized starting point. There is a simple socket connection and protocol to determine if the backup process is listening and do the handshake. If the backup process is not available, the primary server polls the socket occasionally and waits forever.

The objective of the new feature is to watch the startup of the primary server and send status to a separate application that monitors the status of all servers and clients in the network. How would you solve this problem?

Well, one of our engineers decided to modify the startup batch script to log in to the secondary server, get a task list and see if the backup process was running. If it's not, it fails immediately and stops the startup process. Why is this a bad idea? Let me count the ways...

It introduces a new interface between the two servers that didn't exist before, which adds complexity to the model.
It introduces technology not used elsewhere in the product, namely logging in to other servers and using non-portable command line tools.
It adds its own failure mode for incorrect login/passwords on the secondary server.
It couples the two servers in the startup sequence, as opposed to letting both start independently.
It requires starting over completely to recover from the identified failure mode.
It relies on the primary server configuration knowing the name of the backup process on a different machine.
The condition it detects is different from the information you want. It detects if a process is running or not, as opposed to detecting if the backup process is accepting connections and is ready to handshake (the process might be running, but more subtly broken, which this will not detect).

I'm sure there are more, but you get the idea. So what is the better solution? How about using the existing socket connection information instead of adding a redundant channel? Provide the status to the administration tool and wait for the backup to start up. An admin can easily diagnose the situation and get the backup process running if needed. You could add an optional timeout if the backup hasn't started in 20 minutes or so, but that's not really necessary.

There are a couple of key lessons/principles at work here. One is loose coupling, which is almost always a win, generally without introducing much complexity. The other is reuse: don't add new stuff unless you really need it.

Jul 31, 2008

Bad Engineering Idea of the Week

No comments:

Links

Twitter Updates

Blog Archive