July 2005 Archives

Okay, for those of you that read my blog, the number of which usually astonishes me to be above single digits, I thought you might be interested in getting a personal take on the mail migration project I'm working on for the CIS department.

So, back the beginning. When I first started with CIS I wasn't actually that pleased to be running mail services. Mail is one of those services that has to just work, but also requires more babysitting than would seem necessary. Let me use a diagram to demonstrate why:

Now, this is vastly simplified, but gets the jist of what happens when a foreign user sends an email to an account in CIS. I start with the first step of the process that involves us (i.e., I don't really care what kind of setup the other guy has ;) Anyway, the first step is for their server to execute a DNS lookup to find the MX (Mail eXchange) record for "cis.ksu.edu" since that's the last part of the email address. You can do the same on any of our Linux boxen (or just about any Unix box, for that matter) via:

$ host -t MX cis.ksu.edu
cis.ksu.edu mail is handled by 10 mustang.cis.ksu.edu.

That is, they contact their local DNS server, which then does the whole complicated process of finding our DNS server and then asks us for the MX, which responds with mustang, which is the "real" name of the machine we call "smtp.cis.ksu.edu." The 10 part is the priority of the mail server, so that if we had multiple mail servers we could have them listed by priority.

Once the foreign host knows the mail server, it initiates a direct connection to that mail server via a communication protocol called SMTP. We use a program called Sendmail to provide a server that communicates this protocol. Among other checks, it makes sure that the email being delivered belongs to a local user. To do this, it contacts our authentication server (through another convoluted process) to see if the user exists and performs delivery if she or he does.

Mail delivery continues when Sendmail starts another program called Procmail. In the typical case, this program runs another program called SpamAssassin to check to see if that program decides to see if it spam or not. Then, Procmail either finds the user's mailbox (a directory named "Maildir" in the user's home directory) or executes a script named ".procmailrc" and performs the actions requested by that script. Again, I'm simplifying for brevity.

Procmail writes to the file server (using the NFS protocol) to finally deliver your mail to your mailbox. For you to get your mail, you start up your favorite mail program (be it Microsoft Outlook, Mozilla Thunderbird, Netscape Mail, Pine, or Mutt), your mail program logs into the IMAP server (though Pine and Mutt can read the mail straight out of your home directory, but we'll not cover that case, again, brevity!), which contacts the ActiveDirectory server with your given username and password (did mention this is convoluted?) to verify your identity. Then the IMAP server uses NFS to read the contents of your mailboxes from the file server.

The problem being that if any one of these systems (or their subsystems) fails, part or all of the process stops. If the DNS server goes down, gets misconfigured when we make configuration changes, or because we lose network connectivity, email stops being delivered and possibly/probably you can't check/send email anymore either.

If the Sendmail server goes down, gets misconfigured (rarely), or the computer the server runs on gets too busy, mail stops being sent and stops being delivered. Other problems can occur when SSL keys expire, get accidentally changed during updates, or your local Anti-Virus program decides to interfere with communication.

If the Microsoft Windows Server running Active Directory goes down, becomes inaccessible, or a critical update fails you can't login and you can't send or check email because none of the machines are able to verify your identity.

If you modify your ".procmailrc" file and bugger it up, you won't have your mail delivered (or in the case of mistake I made this week, all your mail could be delivered to the wrong place). Generally, we don't have to mess with Procmail often, so this isn't something we break very often.

SpamAssassin, on the other hand, has proven to be a pretty fragile program. We actually have it set to restart on an extremely frequent basis because it would occassionally crash. It takes a lot of processing power to do the checks it does, so this takes a pretty heavy amount of processing on the mail server.

If some part of file services goes down, obviously, nothing works again.

I could also mention PAM, NSS, LDAP libraries, SASL libraries, libc libraries, OpenSSL, various hosts running various services, etc. If one goes down, half the system goes.

We could probably keep things up and running for a long time if we could keep ourselves from touching configuration files and installing updates all the time. However, that's assuming the scriptkiddies and the hackers would leave our boxen alone if we didn't keep them up to date, so we either risk breaking services ourselves in predictable and generally quickly reparable ways, or risk damage that requires complete rebuilds.

Furthermore, our user base is small enough that I don't have staff working in Nichols 24 hours. CNS has staff on call 24 hours a day and a night crew that spends part of their time monitoring systems. In fact, I'm really the only employee at this time who understands most of the inner workings of our mail system. It's not that it's extraordinarily complicated as mail systems go (as it's pretty simple), but that this simple mail system is more complicated than I would expect an hourly student to take the time to understand. CNS has had some notable downtime in the last year or so, but our downtime has been considerably worse.

The benefit of doing this for ourselves is the ability to be flexible. We can set our own policies that best suit our small department. On the other hand, we pay for that flexibility in greater amounts of down time. As such, I'm for the move to CNS despite the fact that there will be a number of inflexible policies that we will have to cope with because it will mean a better overall uptime. It will also mean that I have more time to work on improving the systems that are really more important to what CIS does, even if not as critical (if that makes sense).

This has been a fairly politically charged move as email is one of those things that has to just work. CIS faculty have some particularly difficult needs to accommodate in the way of mail quotas, attachment sizes, pre-filtering, spam protection, etc. CIS faculty are also very picky because most of them are pretty knowledgable about how these things can and should work. Some of our faculty also have some pretty eccentric work environments configured and aren't interested in relearning to do things in a different environment. I won't argue these things good or bad, that's not my business, but I can say that accommodating all of these disparate desires has been quite a task. (And I haven't even mentioned the students, but then, they come and go so

However, the major obstacles to this move have been overcome. There will still be parts of this that some will not like, including me. However, I am convinced that the benefits outweigh the problems.

I hope this "little" blog helps explain things a bit more clearly. Cheers.

I was reading a Slashdot article today about the online music licensing system the EU is considering. My mind followed a bunny trail onto a completely different subject about independent artists and how the Internet presents a very different opportunity for dissemination of music and such. This is nothing new, but as I thought about it, I had a thought that I had never really addressed in my mind full on.

What can this mean for the world of fame? For my parents, most information about the world at large was delivered through newspapers, magazines, radio, and television. Their sources were limited to (at most) 3 channels of television and then whatever local radio stations there were and probably only a single newspaper. (Though, for my dad, living in Kansas City, I imagine there were some other minor presses to get information from.) Because of this, if ABC, CBS, or NBC wanted to make someone famous, they just put them on TV and viola, instant celebrity. This has been changing over many years, though. As television got to be cheaper to produce, more stations arose and with the advent of cable, most folks suddenly had 10 to 13 channels to choose from. As the techno-geeks graduated from BBSes to the Internet, we started gathering information from more and more sources.

Now, most newspapers have to publish online just to compete. I can gather news from any number of news sources now. ABC, CBS, and NBC still have quite a bit of leverage to create celebrity, but not anything like they used to. Now, if I don't like any of my local radio stations, I can generally find online ones for free or to pay for. By satellite I can now get hundreds of channels of video or radio. Not only this but the Internet is starting to become a more personal, cozy place. I can now find out much about my friends and organizations I follow through the web and email in a much more direct route than waiting for USPS mailed letters and newsletters. Anyone can publish video, music, and information at almost no cost.

Since the web is organized by connectivity driven through links allows the world to flow around the webs of people around you. If someone comes to my web site, they can find a post by one of my friends, which leads them to another of my friends web site, and, perhaps, my friends friend can become my friend because we share interests that we might not have realized before.

Anyway, again, none of this is new stuff, but this led me to realize that as the power of the big central sources of news ended, we could see the end of an era very interested in fame and celebrity. We'll never be rid of the famous or celebrities, but I think we've already seen a waning. In their place, I think a better environment is rising where we don't care so much about the latest celeb and their pathetic lives, but we can become more interested in the lives of our friends and friends' friends. Thus, in a way, this has the effect of making our little worlds larger with respect to the former world which made it seem smaller.

Anyway, it was an interesting thought to me...

One of the most annoying features of CVS and Subversion (but especially Subversion) is the extra directories they create in your working copy of a repository. In the case of CVS, this is nearly tolerable because the directory is named "CVS" and generally only contains three files each. Irritating, but not terrible. The Subversion system creates directories named ".svn", but includes a half-dozen files it uses plus an unmolested copy of each file.

The annoyance comes in when I execute my favorite commands find and grep -r. For the Unix-uninitiated, find searches the current directory and all subdirectories for files according to some criteria (generally based on file names, file owners, dates, or file permissions). The grep command performs a textual search for some string of text (actually a regex, but we'll leave that out of it for now). Add the grep -r command and you have a version that also searches all subdirectories for files as well.

When programming, I use these commands frequently for common tasks. For example, occassionally I'll name a method something and then change the name later. This generally requires that I find every place in my code I used that method and rename the calls. Let's say I renamed "foo()" to "foobar()":

grep -r "->foo(" lib

Now, I can find each file and have vim search and replace every occurence or more carefully edit the files (as sometimes I have to be careful because another method named "foo()" might need to stay as is).

There are similar scenarios when I need to use find such as trying to locate all my library files to document or whatever.

Anyway, the problem with using the above command or find is that I end up with all the matches from grep twice and all the files in find twice plus some because of the duplicates. I really would like to pretend the files in the ".svn" directories just didn't exist. Thus, I created some quick commands to do just this in my shell today.

Initially, I did the quickest and dirtiest thing and added the following to my bash resource file (".bashrc"), which is read whenever I startup a terminal:

sfind() { find . -name .svn -prune -o "$@" -print; }
sgrep() { grep "$@" | grep -v "/\.svn/"; }

These work pretty well. Now I can run "sfind" with all the regular "find" arguments and get it working on the current directory or "sgrep" with all the regular arguments and get it working. However, these will (for most queries) not return the results from ".svn".

However, sfind isn't quite as nifty as I'd like because it would be nice to be able to use a directory as the first argument to specify a different directory to search. I prefer the Linux-style find that doesn't require the first argument (i.e. it will use "." (current directory) as the directory if no directory is given), which will also give me the Linux-like functionality when I'm on my Mac or another system that doesn't implement it. I augmented my first revision of "sfind" as follows:

sfind() {
   if [ -e $1 ]; then DIR=$1; shift; else DIR="."; fi
   find "$DIR" -name .svn -prune -o "$@" -print
}

Now, I can use a directory as the first argument or not. If not, then it uses the current directory as the default. This, of course, would fall apart if I had a directory named the same as some option I might use ("-name", for example). However, naming directories starting with dashes isn't something I think I've ever done, so I'm not too worried.

Hope this is interesting to y'all. Cheers.

About this Archive

This page is an archive of entries from July 2005 listed from newest to oldest.

June 2005 is the previous archive.

August 2005 is the next archive.

Find recent content on the main index or look in the archives to find all content.