25% of disk used – Cricalix.Net

At $work, we’ve been running very tight on disk space for a few months now. So tight that we’ve had to regularly drop data out of the tables and store it offline, in the hope that we can restore it later. Luckily it’s only mail transaction logs, but they’re still fairly important. I campaigned via my director to get something decent and scaleable in place – perhaps an IBM SAN component. Scales to multiple terabytes, supports multiple enclosures, hot swap everything and a fair chunk of redundancy.

It finally arrived on Friday – couriered straight from Germany by a courier company called HopHop. That had me laughing. Good thing I understand some basic (VERY basic) German, because the courier didn’t seem to speak English! The box was over 5 feet long, 3 feet wide and 2 feet deep – all this for a 4U rackmountable array? Turns out it was a bit of a Russian Doll box, containing 10 seperate hard drive boxes, 4 QLogic FC HBAs, 2 GBICs, 2 fibre patch cables and ~~a partridge in a pear tree~~ plenty of documentation. Probably weighed in at 80 kilos all told.

15 minutes later, one of our loaner Dell PowerEdge 2850 servers came back to the roost, so I even had a server to test the new toy with. A few bits of office furniture also chose to show up on Friday, so the office was a bit chaotic for a while. The rest of Friday is a bit of a blur, mostly from the disk array not behaving itself (or me doing things wrong, but I was following the instructions). It was sort of working, but the agent software kept causing errors in the SCSI driver under Linux, and the HBA BIOS tools complained of SCSI errors too. It should have tipped me off that the HBA tools saw more than 10 hard drives inside of the array, but hindsight is always 20/20 or better.

Skip over Saturday – I spent a few hours in Birmingham at the Home Show. Wasn’t bad, but wasn’t the most useful.

Sunday morning, I was at work for 7 A.M., full of energy and ready to conquer the disk array. Bad move, it conquered me for the next few hours. After spending ~4 hours getting an installation of Windows 2000 loaded properly (primarily rebooting because each install of the IBM management software or other patches requires a reboot – and each reboot took 5 – 10 minutes due to driver issues), and getting all the drivers sorted out, we were able to talk to the disk array. In-band management worked once we found out that Windows had loaded the wrong version of the HBA driver – well, the fault isn’t really Windows, the driver CD for the HBA was outdated, and wouldn’t work with the software that shipped with the unit. Must complain to IBM about that.

Once we proved we could do in-band and out-band management on the array controllers, we turned back to installing Linux (20 minutes!) and I went browsing the net. While the IBM install docs are good, they don’t hold a candle on the RedBooks some days. RedBooks are written by people who have used the kit, and are trying to do things with it, and have already hit the same stumbling blocks. Lo and behold, paraphrased ‘Linux does not support RDAC or the SM agent for in-band management.’ What? Why the hell does the documentation for installing it tell me to install them then!?

Out-band management, here we come. Everything works. Even got the firmware updated so that the storage manager on the unit was the same level as the manager on our management stations – which enabled the storage partitions licence (or ability to add the licence), tidied up the interface and hid a few things.

I went home somewhere around 7 P.M. on Sunday. I was back at work by 7:30 A.M. on Monday, doing final array tests, speed tests, hot-spare tests, ‘oops, a controller just failed’ tests and more. It all worked. Planned for installation that evening – can’t take long, there’s only 70 gig of SQL tables to be copied, and drivers to be installed on 2 servers. Shouldn’t have said that really, we didn’t leave Manchester until 2 A.M., and we started on time! In the end, the real-world performance numbers weren’t as good as my tests. I think the aacraid driver under the 2.4 series kernels is a bit slow, because that’s the only real reason I have for a pair of mirrored SCSI 10K RPM disks offloading data at 5 MB/s – my tests indicated that the physical drives could push 30+ MB/s. So a ~1 hour data copy took almost 3 hours.

Hosting facilities (well, this one) aren’t the most pleasant of places to spend a weekend. Dry air, lots of noise, no bathroom without calling the porter, no food allowed. Dirty air too, after 2 hours my skin felt horribly dirty. I did get to look at the nice cluster of xSeries servers in 4 racks with attached SAN though, and the set of inter-connected SGI servers.

And as for the title of this post? When the copy was finished, we went from 100% capacity used (400 MB to spare!) to 25% used. It should last 6 months before I need to request more drives for the array – at least that’ll be easier to do. Go to Manchester, insert drives in array enclosure, drive back to work, manage enclosure remotely.