I've been a bit irrate about the mail server lately. This weekend has proven to be a further illustration of why mail services must go. The major reason for handling our own email in CIS is "flexibility." The CIS department likes to have flexibility in the management of its services. This is quite appropriate. However, there are certain services where the need for stability outweighs the need for flexibility. We have a few major services that must always work (in this order):
- Authentication
- File services
- Printing
- DHCP
- Backups
- DNS
- Web
There are probably others. The last four are difficult to order. File services, printing, and web provide three of the services that make CIS separate from CNS. These are core services that we offer and I know that we can maintain any flexibility without offering these. Authentication supports those core features. Everything else could be offered by CNS, in my opinion, without undermining our independence and flexibility.
Email and backups are probably the two services that can be best handled by CNS rather than CIS, in my opinion. Email, in particular, is important at this point because it is the most difficult to support. Email doesn't function correctly if any of the above (except printing and backups) aren't working. Email bounces if authentication stops. Email isn't delivered if file services stop. Certain services can't be found if DHCP fails for very long. Email stops if DNS breaks. Webmail isn't accessible if web breaks.
Of course, this is stil an over-simplification. There are a lot of interdependencies and the "file services" heading contains multiple layers of complexity within itself. Our two most common failures are authentication and parts of file services. These would often go unnoticed, except for the email loss. Most workstations can still function if file services and authentication are a little flaky or slow. Email stops. And when email stops, I get phone calls at home on the weekend.
However, I don't work on the weekends unless there's an emergency. It's one of the boundaries I've set to keep my sanity. I put out fires all week long, I need some time to unwind and catch up on my honey-do, please. Saturday and especially Sunday are generally off time, unless someone calls me. I'm always open to faculty and staff giving me a phone call to ask for help and I will fix things needed.
Of course, the last few times of downtime have had nothing to do with other services failing, but with processes on the mail server maxing out the CPU. Sendmail stops working when the CPU maxes out as a safety precaution (mail processes can get very memory/CPU intensive and can lock up a machine if not careful). Anyway, I caught the machine with an load average of 296! If you don't believe me, here's a screenshot.

I believe my sister said it best, "'Nuf sed."

Leave a comment