I've just about had it with Microsoft Active Directory and the latest problem isn't even their fault. The saga that has been the CIS authentication system hasn't been a very fun one. However, in case someone else may be fighting the same problems out there let me elaborate.
When I started in as a systems administrator we had two parallel authentication systems: a Windows/Linux system and a Solaris one. The Windows/Linux one used Active Directory as the user database. Windows clients used the internal Windows authentication voodoo (mostly Kerberos) to do their thing. The Linux boxen used the SMB protocol to authenticate users. The latter had one major problem: case-insensitive passwords (FooBar == foobar == FOOBAR == fOoBaR). It's a "features" of the LANManager protocol dating back to Windows for Workgroups 3.11.
The Solaris authentication system was the real nightmare. Originally, we'd tried to make the Sun Solaris 9 boxen use Active Directory via LDAP or SMB to authenticate. Problem: Solaris 9 will not allow you to login to CDE (their front-end to X-Windows) without having an entry in /etc/passwd. (Barf.) That's actually a simplificiation, it'll work if you use Sun's own alternative to Active Directory, either their directory server or NIS+. Right or wrong, my predecessors decided that having two parallel authentication systems was superior to implementing NIS&emdash;I'm told NIS is a nightmare, but I have zippo experience with it to know one way or the other. At this point, nightmares of holocaust are starting to sound better to the holocaust I've been experiencing.
Anyway, the solution was that a central server would keep a master copy of /etc/passwd, /etc/group, and /etc/shadow. Then, every hour, that master database would be munged (based on access permissions of different machines) and distributed to all the other Solaris boxen. This means that any changes to account information must be updated in both Active Directory and the master files. This, inevitably, let to inconsistencies between accounts on Solaris and Active Directory.
To further aggravate the problem, Microsoft Windows Server 2003, which we started upgrading to around the time I joined the staff as a student in October 2003, increased security on its servers. This is a good thing, but eliminated our ability to run an SSH server on one of our 2003 servers. We used scripts on Solaris to create, modify, and delete accounts in both systems by using shell scripts on Solaris which then ran VB scripts on Windows 2000. With the move to 2003, this was becoming infeasible and the presence of 2003 on our network was destabilizing the abilities of the Windows 2000 domain controller. (Actually, it was probably the other way around as the Windows 2000 DC had a nasty habit of claiming ultimate authority over the 2003 boxen, so probalby the 2003 boxen were doing all they could to keep Mr. Clinton from taking over&emdash;an apropos 2000 server name, I think.)
Okay, so all of this put together led me to complete my Master's Degree with fixing this mess as my project. My solution, replaced the slowing dying scripts with a new system built of independent agents that each performed the tasks of maintaining the accounting systems. In the process, we decided to just retire the Suns rather than try and get them to work (especially since we could keep the Sun servers as SSH and other services function without the need for entries in /etc/passwd). Once all the Sun boxen were replaced with Linux boxen, I switched everything over to LDAP over TLS to perform authentication, authorization, and account profiling. My agents then created accounts and is, for the most part working fine.
Unfortunately, Microsoft has a problem. Using LDAP via TLS appears to cause serious problems with the LSASS module. After switching the Linux fleet (and the remaining Solaris fleet) to the new setup, Jefferson, our only Windows domain controller at the time, started crashing, frequently. This was a mess, so we hurried and grabbed a machine that had been slated for another purpose and made it a second DC to act as the new primary authentication server. Mostly, this just moved the crashes to the new server, Ford. Furthermore, sometimes both would still bite the dust if first Ford went down and the same problem was exploited on Jefferson simultaneously. This combo wreaked all sorts of havoc: users couldn't login during these times, Linux users already logged couldn't do some things because of the infamous "You don't exist, go away!" error, and the biggest problem was that any email sent to our mail exchanger during these times bounced, "User unknown." That's a bad message. It tells the sender something like, "yep, we fired him."
A solution was needed. So, we moved ahead to try and fix this problem by switching authentication to use Kerberos instead of LDAP over TLS. I figured that this solution had a better chance of succeeded because, technically, Kerberos is the protcol Windows uses internally and it follows a standard, RFC 1510. Being an IETF standard and one set in pretty solid language, it would be a difficult one to "embrace and extend" and it's one that probably isn't in Microsoft's best interest to do so with anyway. So, we switched to that. And, like magic, the Ford/Jefferson cycle of death has ceased.
However, a new problem surfaced. At first, it was pretty mild, but steadily grew until yesterday, Thursday, January 27, 2005, it became pretty unbearable. Every dozen or so times a person authenticated, authentication would fail and they would be asked their password. This became more frequent under heavy loads.
Having had loads of network problems and especially NFS issues, I thought it might just be that the communication load between the IMAP server and NFS server might be too heavy. About this time, we discovered that the mail server was only on a 100MB link (actually, worse, it was connected to a 4-port 100MB switch shared with three other computers!) I don't know who in the ancient past decided that was acceptable, but as Garfield would say, "that person should be drug out into the street and shot." It was not a good idea.
However, fixing this didn't solve the problem. Actually, following the reboot we performed after fixing this, it seemed to get much worse. It wouldn't let anybody in for nearly an hour yesterday. And yet, there were no problems in the logs accept a TEMPFAIL from Courier authlib and the Event Logs on Jefferson showed "Pre-authentication failure" for each user.
After several hours of staring at logs, searching Google, etc. I discovered that "pam_krb5" (the Linux Kerberos authentication module) is explicitly listed as not supported via the authpam module for the Courier IMAP server. DOH!
So, after all that work, I must, ironically, switch the mail server back to LDAP over TLS. It appears to be working to this point, but I've learned not to hold my breath.
Where do I go next? I see two possibilities. The reason I must use LDAP over TLS and not just plain LDAP is because I'm not stupid enough to make all my users pass their passwords across the network in plain text. Sending passwords out in cipher text is still a minor risk, but it's a far cry better than shouting, "Sniff my passwords in plain text!" Thus, I either need to authenticate without passing the passwords around (Kerberos) or I need to find another encrypted authentication protcol that Microsoft supports. Those are my two solutions: Kerberos and NIS.
But wait, I already tried Kerberos. Well, yes. However, there is another Kerberos solution. Instead of using Kerberos directly and explicitly, I can instead use LDAP without TLS with Kerberos. That is, LDAP has three modes of authentication: Simple, Kerberos, and SASL. Simple requires sending the password across the link. This is what we've done already. Kerberos isn't currently supported, as far as I can tell, with our Gentoo Linux distro. However, SASL authentication can be made to use Kerberos under the hood via it's GSSAPI authenticator.
Thus, I am now going to try and setup LDAP via SASL via GSSAPI/Kerberos to perform authentication. Maybe, this will work. Maybe not. If not, I'm going to give the Microsoft NIS server a try, nightmare or not, this dream is already bad enough, why not?
Not trusting at this point that anything will work, I can set up an independent KDC on Linux to perform authentication by and then, I'll just make Windows trust that KDC and use that as the Windows authentication system. If I can't get this to work, the only solutions left require spending lots of money (explicitly this time, rather than in the form of my time). I can either beg Microsoft to come look at our problem and fix it or I can decide Microsoft is full of it and replace it altogether with something else...that something else would probably have to be Novell NetWare. I'd really rather it be NetWare anyway, but I don't really want to run a NetWare server without training, it costs us a lot more since Microsoft practially gives us their software, and the changes required to make the switch make my head hurt.
I already work too much and already feel like I've wasted half my life fixing this problem. As I like to say during times like this: "I hate computers." Maybe I should accelerate my plans to go into seminary because problems like this will eventually eat me alive.
