Saturday, October 25, 2008

Network Problem Tonight

We had a network problem tonight which affected some of our servers and applications.

It was a strange problem. I could get to our applications from home. In fact, eLearning, Skylight and the Wiki farm were all working properly. Our legacy applications showed an error which pointed to the direction of AD authentication failure. 

My colleague went in to the office and said he could not get to any of our applications at all. So, for a while, we thought there were DNS resolution problems instead. I called central ITS. They checked and confirmed that everything was working properly on their side.

We performed further tests on various part of the network. Finally we concluded that it was our firewall cluster not routing traffic properly. We rebooted them one by one, and services resumed right away.

Fortunately, eLearning (which is by far our highest usage application), Skylight Matrix Survey System, and the Wiki farm were not affected. But the rest of our legacy applications were down for about 1 hours and 20 minutes. If it happens during the finals week, and the students cannot access their materials in their last hours of revising, the impact is much more severe. 

The firewalls have been performing reliably well all along. I almost believe that they are infallible. Nothing is! We need to better prepare ourselves, have better procedure to diagnosis network problems quickly.  

Or look at it in a different angle, firewall is just one example of many possible single point of failures. We need to ask ourselves what should our operational strategy (or institutional strategy?) be to prevent single point of failures as we move forward?

No comments: