Sunday, 31 August 2008

Building the terminator

Before we go any further, please take a few minutes to read this blog from Google

http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

We all are trying to build best of the class systems that take huge throughputs without complaining, do everything that we ask of them, give a unified view of the whole system using BPM and all the latest technologies and be superbly scalable by adding more resources and all that. We have he tools to make this possible and the resources to execute these implementations. In fact we have succeeded in achieving a lot.

But all this, is based on a program – a set of instructions, if-else blocks or loops that have been written by someone if not us, to provide us the tools to achieve our grand architecture. What if these tools are flawed? And we don’t realize it till it’s very late?

I was introduced to programming in my seventh class by my cousin who was a C expert. I learnt C at that age and tried a few things. I used to badger him with questions and get answers from him. One of the first BIG problems he told me was about was the Y2K problem – how using 2 digits or year had created a mess now. Countless number of programmers around the world did a lot of work to avoid a catastrophic event that could have happened as we stepped from 1999 to 2000. It was a great effort that succeeded. I used to read all that I could lay my hands on about this problem. Later as I went though engineering I realized how these small and seemingly insignificant things can bring down big systems. I blew up IC’s and motors in the lab by doing something that seemed perfect to me. I was corrected almost instantly when I saw the IC blow or the lab fuse burn out. I learnt. I watch discovery channel and see how sometimes big buildings come down because of a small flaw in a steel bar used on some floor. The thing here is – systems are prone to failure, and they will face that boundary condition that fails them sometime sooner or later, hopefully very very late maybe never.
The big question now is – did we belt and brace our system for that boundary condition?

In the IT world, we have moved past the Y2K bug that was the BIG thing and are moving towards the NEXT BIG things like Portals, SOA, Grid computing, million message per hour systems and all the high speed and high availability designs. We use tools and sometimes program our own tools. We do thousands of mappings and transformations or write thousands of lines of code. But do we belt and brace our programs? Or for that matter our XML or mappings or transformations?

In October 2007 I remember doing a release for our client, it was a CR and we did it at double speed. It was released over a weekend and we came on Monday to take over the monitoring from guys in the US. I was still sleepy and I noticed that something was wrong. Digging around, I found about a few hundred orders not processed in the system. Reason – some field that we mapped was null, and it gave an exception and so a few hundred orders got stalled. It took us an hour to wake the person responsible and get him to do the magic fix. It took me two hours to explain my lead why that silly mistake happened and he was not happy. I don’t blame him. And I did not try to explain my way out of the situation.

This was a small case, think of a messaging layer in a bank doing millions of messages in a day, what will happen if the variable used to hold the sequence number overflows? We could use long type, but then that too will overflow in a few years or if the message traffic really hits the fan.

We need to build air tight fail proof systems that will last longer than our lives. As engineers that is what we all have been taught to do. We can do that by doing simple things that safely secure the lowest components of the system – code level if we are coding or the mapping or XML or transformation level of we are using a tool.

This is just small list of possible things:

Implement the simplest possible solution to the problem: All of us, at some point of time are tempted to use all our knowledge to solve a problem. There was once a mail that I received. It showed about 10 ways to print “hello world” and it went from the obvious - print “hello world!” – to a cool solution using pointers. The simplest solution would have sufficed. Solving a problem does not necessarily have to satisfy the Geek in us, its fine as long as the problem is solved without getting complicated.

Documentation:
Developers never comment their code - thats not correct, the good ones do comment. Neither do we comment our XML. But at the end of the day, this is what will save us. Imagine trying to understand and fix a piece of code with a gun to your head and the clock ticking away and you cant understand of the code does this or that. We should document every piece of solution that we implement and keep it as close the code as possible, not in some source control on the other side of the world. Document the code, the XML and the mapping (if you can). At the end of the day when there is a gun to your head, this is what will help you understand what you are doing or what you are supposed to do.


Defensive coding:
See the below code,

if(a<5)
Print a

Let’s say some one decides that after printing “a”, you need to say hello, just to be courteous. You are not the person fixing it and this is what he does

if(a<5)
print a
print “hello!”

As you can see , its broken now. The correct code will be

If(a<5){
Print a
Print “hello!”
}
Although this may seem trivial, but this is what we need to avoid. Also we should validate all the parameters that are passed into any method – in fact don't use any input without validating it first. This list goes on, bottom line is – check stuff and play safe.

Error handling and Exception handling: I once met a person who was telling me that once in a while their application got stuck and would correct itself if restarted. It seemed strange because we did not see any exceptions, however on investigating the code we found that the problem was because an exception was being caught and not reported! Catching errors in a program or a work flow is one things, taking the correct remedial action or bubbling it back to the top level is another thing. It might not always be correct to ignore them or go on retrying for ever. Sometimes the wisest thing is to just throw it back up and stall any processing that might be going on.
Testing: We all test to check that the damn thing works. Do we test to see what happens if it comes across something unexpected? Do we test for all the invalid inputs? Do we test what happens when the connection or something fails and we need to retry? If we test all these, we are sure we have eliminated most of the bugs

Will to live: We all love Arnold in terminator, how he keeps going on and on and on and never fails. That's the will to live, systems we build should have the will to live. A connection gets closed unexpectedly, a disk gets unmounted, something else goes wrong, LAN cord gets pulled, no matter what the situation there will be a retry and recovery strategy. And if the system cannot recover, take a break, wait and then try again. Keep trying until it recovers a week later or maybe a month later.


This is just a small list of things that we can do and this needs to be supplemented by reviews, good practices and processes and by reuse of things we know work perfectly well.

At the end of the day its for us to decide whether we want to build a terminator style system that outlasts us and runs for years to come, or we build another Y2K problem that will blow sometime in the future.