OpenSolaris: Ready for Prime-Time?

Introduction
- In a nutshell
- Upshot
Background
- Hardware specs
- Server and administrative use case
The Good
- ZFS pool redundancy magic
  - RAID-Z
  - Mirroring
- Other ZFS benefits
The Bad
The Ugly
Conclusion and next steps

Introduction

I've been running an OpenSolaris server for about a year now, in a live "production" environment hosting approx 8 TB of home/family/personal data. (Mostly photo and video data, which double in size approx each year. In other words, each year, we generate at least as much photo and video data as all of our digital lives before that year--a fairly common phenomenon for households and businesses.) Currently I am running the most recent "stable" release: build snv_111b (version 2009.06).

Primer: OpenSolaris is the first (and currently only) open-source version of true UNIX (as defined by the Open Group who owns the "UNIX" trademark). It was derived from Solaris (which is still closed-source and sold commercially), a branch of UNIX owned by the former Sun Microsystems (recently purchased by Oracle). As such, it is a "POSIX"-like operating system, or at least it is supposed to be. Solaris and OpenSolaris are primarily marketed as server operating systems, and secondarily as a desktop operating system.

In a nutshell

The ZFS file system is amazing, and almost makes up for everything wrong with OpenSolaris.
OpenSolaris is absolutely not ready for production server use, much less "mission-critical" production.
OpenSolaris requires way too much manual intervention, and too many workarounds, hacks, and intensive research on-line to keep running. If you insist on trying it, just be prepared to pull a significant amount of hair out.
OpenSolaris as a hassle-free, productive desktop OS? Not gonna happen! Or notebook OS? Even crazier still. Those are actually advertised usage scenarios. No way Hose A.
I do realize and appreciate that OpenSolaris is free/libre, and that many people vastly more talented than I (most employed by former Sun Microsystems) put in countless hours developing it...so that I can download it for free and criticize it. This doesn't change the fact that it is just not ready for prime-time. Yet, or likely anytime soon. And that is my opinion.

Upshot

OpenSolaris shows incredible promise and potential. Version 2010.03 is right around the corner. There will be some ZFS enhancements that make it even more attractive, but from what I understand so far, few if none of the gripes I have below will go away.
I would actually encourage using OpenSolaris if only for the reason of having a large, supportive community to encourage more and faster development on it, in order to start fulfilling it's great potential as a killer server environment. Just not in a production environment! At least not yet...and not for me--I don't plan on running it in the future again
- The only thing I really like that is unique to OpenSolaris now is the ZFS file system, which is being ported to other operating systems as I write this, some even in semi-stable release (e.g. FreeBSD and Nexenta). Give me ZFS on a more classically POSIX-like OS any day.

Background

Hardware specs

Intel 5400 motherboard chipset.
Two quad-core Xeon CPUs.
16 GiB of RAM
Rackmount chassis with 20 hot-swappable SATA bays.
- Most bays populated with 7200 RPM, 1 TB, SATA-II drives
  - Typically arranged in mirrored pairs, striped as a single pool.
  - Each mirror pair typically consists of:
    - One "server-grade" SATA-II drive.
    - One "consumer-grade" SATA-II drive from a different manufacturer but otherwise purchased at the same time.
    - This strategy seems to be a reasonable compromise between the slightly longer average life of so-called "server-grade" drives, and the cheaper price of consumer grade.
- Several "permanent" SDD drives mounted inside for things such as ZFS L2ARC and ZIL.
26 SATA ports
- On-board ICH (6 ports)
- SiI 3114 adapter (4 ports)
- Two LSI SAS HBAs (8 ports each)
Dual gigabit NICs
- One Intel e1000 for max OpenSolaris compatibility
- One NetGear, just so the second one isn't the same brand or chipset as the first. (Which makes sense when you understand the maze of network administration tools and how much more complicated the same make and/or chipset would make things.)
1,400 watt power supply (about 250 w continuous draw when idle).
Humdrum NVidia video card. (Do not attempt OpenSolaris with anything else or you will be sorry, even though you probably won't care for accelerated graphics).
Fair-sized UPS capable of keeping it alive for about 20 minutes.

Server and administrative use case

Even though it is "just" a home-based file server, nevertheless it must support occasionally high-bandwidth multiple access scenarios (high IOPS and throughput). Furthermore, the requirement of uptime is no less important to me than an enterprise server--not counting in dollar terms. (And even though it has yet to achieve anywhere near the number of nines required.)

Let me also clarify my own role as something of a quasi-"expert" OpenSolaris server administrator:

I don't want to be a server administrator. Not a Windows Server admin, not Linux, not UNIX. I hate administering servers. I just want the damn things to...well, serve.
It's true that before having a family, I did have luxury of being an "OS fiddler/tweaker". Those days are long over.

Because I'm just a geek in general, I tend to do things that demand much from my computer systems--more than they can typically reliably deliver off-the-shelf.

This is why I always build my own computers (save for notebooks)...at least, before I had children.

The most frustrating--if not outright insane--part about the current state of computer technology, for me, is storage reliability.

Every year at least three hard drives bite the dust on me (usually more--once a record 12 drives in a single year).
- This is due to the simple math of hard drive lifespan, multiplied by the number I use every day or often enough.
Recovering from drive failure means either restoring from backup, or more likely, rebuilding the OS from scratch if the drive was a system drive.
- I always keep data on separate drives, and run mirrors, parity RAID, or some kind of redundancy. This makes life a little easier; just throw the bad hard drive away, insert a new one, and let the array rebuild.
Lately my data storage needs are growing so rapidly, that a regular ol' Windows Server and/or Linux box with RAID-5 just isn't cutting the mustard.
After much research, I jumped into OpenSolaris for it's amazing ZFS file system. I don't regret the move for discovering the magic of ZFS, but I very much do regret it overall.

The knowledge I gained on OpenSolaris doesn't even apply to any other operating system; not Linux, BSD, or other UNIX or UNIX-like derivatives. It is just sunken brain cells. Had I known the learning curve ahead of me then, I would have figured something else out--even just cutting way back on my production of digital media so as not to need ZFS in the first place!

The Good

ZFS pool redundancy magic

RAID-Z

RAID-Z does not suffer from the potential "Write Hole" data corruption problem, that most other parity-based RAID implimentations (e.g. standard RAID-5) may experience under certain conditions (such as an unexpected controller, service, or system failure).
- I experienced this problem more than once, about a decade ago on some early versions of Promise' cheap consumer-grade "fake RAID" controllers.
ZFS offers single-parity (e.g. RAID-5), double-parity (aka "RAID-6"), and triple-parity options for RAID-Z.
- Single-parity RAID is essentially dead (for modern production use), due to a risk of data loss during a post-failure rebuild period that is recently approaching "certainty".
  - The problem is that hard drives are getting so large (doubling in capacity roughly every year), that to rebuild an array after a single drive failure takes so long, that the odds of another drive in the array failing is approaching mathematical certainty. With single-parity, a second drive failure before recovering from the first, would lose the entire volume and data in it.
    - I have unfortunately experienced this problem as well. One drive on a Windows Server RAID-5 volume died. While the volume was rebuilding on a replacement drive, a second original drive died, which resulted in total volume loss. Fortunately I never relied on storage redundancy as a substitute for backups.
  - Double-parity RAID buys a few more years in the useful lifetime of parity-based RAID, and triple-parity RAID buys a few more years on top of that.
    - Eventually however, as long as we use spinning platters and they double in capacity each year, no amount of redundancy will solve the exponential problem.
    - The only robust long-term solution that scales much more reasonably with exponentially increasing drive capacity, is currently (and will probably always involve), simple mirroring at it's core. (Whether or not other solutions for data protection and/or performance are layered on top or underneath, such as striping.)

Mirroring

A pool (aka "volume" in other OS parlance) is always striped across available virtual devices. Considering that a "virtual device" can be a single drive, a file, or an entire array, one begins to see how flexible a ZFS pool can be.
Mirroring is potentially more expensive than parity-based RAID.
- The storage cost for parity RAID can be expressed: [$ per usable TB] = ([number of drives] * [$ per drive] / [TB per drive]) / (1 - ([parity count] / [number of drives])).
  - With a reasonable configuration, this is (variably) cheaper than mirroring.
- For simple mirroring, the cost formula is equally simple: [$ per usable TB] = ([number of drives] * [$ per drive] / [TB per drive]) * [drives per mirror].
  - In other words, for the common case of two-drive mirrors, your cost per TB doubles.
  - However if you consider the fact that the price/capacity ratio for drive storage is cut in half every year, all this really means that you will always be one year behind the price/capacity curve. Doesn't seem so bad in context, for the benefit of mirrored data.
Mirroring can--and, depending on the use case, should--be used in conjunction with other solutions such as striping.
- E.g., RAID 1+0 or just "10", which means data striped across multiple mirrors.
  - Note: RAID 10 is not the same thing as RAID 01. RAID 10 is a "stripe of mirrors" and is much more tolerant of multiple drive failures. RAID 01 is a "mirror of stripes", is much less tolerant of multiple failures (in fact less reliable than just RAID 1 mirroring), and is generally not advised for any use case.
- A mirrored set does not have to come in just pairs. Depending on the value of your data and/or your distrust of hard drives, then you may wish to have three or even four-way mirrored sets (e.g. all four drives contain the exact same data and any three drives could fail at once without losing the volume or its data).
Mirrored data is computationally cheap to manage. There is no Xor parity calculations to perform, so no processor offloading is necessary, and therefore mirroring done by software as fast (in real-world use) as hardware-base mirroring--and vastly more portably so.
Redundancy achieved through mirrors (especially stripes of mirrors), all else being equal, generally performs significantly better than parity-based RAID, whether or not hardware offloading is performed for parity calculation. This is one reason why parity-based RAID is often considered unsuitable for database use.
There is no significant write performance penalty with mirrors, as there is with parity-based RAID.
- It should be noted however, that ZFS eliminates much of the penalty for parity RAID writes, with its "Copy-On-Write" feature. In this scheme, parity for existing data does not need to be read and recalculated during a write, as it does with more traditional forms of parity RAID.

Other ZFS benefits

Native support for data snapshots, which allow administrators or users to roll back to an entire system state, or just individual user data files. (Something like Mac OS X's "Time Machine" only more robust and more natural for the file system to provide.)
All ZFS data, regardless of whether underlying redundancy is present or not, is checksummed on-disk. This adds extra protection against not only parity RAID data corruption, but other forms of data corruption. For redundant pools, it also allows for automatic self-healing of corrupted data.
The ZFS file system is comparatively fast, especially with built-in support for SSD drives for second-level read caching and write logging, which greatly increases the IOPS figure (generally meaning performing well under multiple simultaneous loads).
A ZFS pool may be non-destructively grown, fairly simply by:
- Adding virtual devices to the pool. (A virtual device can be a RAID-Z or mirrored array.)
- Swapping existing drives of a virtual device (e.g. array) out with larger ones (then letting the overlaying array rebuild in between each one). When all drives of an array are swapped out with larger ones, the envire virtual device (and the pool that contains it) will expand into the additional space.
The administrative model, even though it is terminal-based only, is fairly easy and straightforward.
- That is, once you get drives up, running, and recognized prior to managing them with ZFS model.
Full support for hot-swapping drives.
Native support for transparent on-the-fly compression.
The pending next release (2010.03) will also support:
- Native, on-the-fly block level data deduplication
- Native, on-the-fly encryption

You may notice that everything I love about OpenSolaris has to do with ZFS. In fact, there is almost nothing I don't love about ZFS, save for the extended ACL system (which is optional).

The Bad

Supported on precious little hardware.
- It was an absolute miracle that some of my hardware happened to be supported, and then only because I happen to consciously try to use the most popular and widely supported hardware when building systems.
- Even then, I had to swap out some hardware for stuff that it would actually work on (e.g. storage controllers, NICs, video card).
Existing knowledge and skill working with Linux--or pretty much any other UNIX-like OS--is more or less useless on OpenSolaris.
- Although occasionally innovative and welcomed improvements, OpenSolaris relies on so many terminal utilities that exist only on OpenSolaris (some not even on its commercial Solaris sister), that you have essentially no advantage by being a whiz with Linux, BSD, or any other UNIX-like variant.
- You might as well hop over straight from Windows (or nothing) and have just as steep a learning curve.
- I have no idea how OpenSolaris manages to still be considered "UNIX" by The Open Group, owner of the UNIX trademark and standard.
The universe of additional software that can be installed is practically non-existent (compared to anything other server OS including Windows or Linux).
Regular updates for bugs? Forget it! Rarely if ever happen. If OpenSolaris doesn't work for you now, you should not realistically anticipate an inter-release patch to fix it.
Community support is threadbare. And even then, support comes from mostly programmers and hardcore professional system administrators--so the help often assumes significant foreknowledge of OpenSolaris particulars. The fact is, just not that many people use OpenSolaris.
Not to contribute to the already considerable (and mostly unfounded) FUD surrounding the purchase of Sun by Oracle, but there really is yet no roadmap or bankable statements from Oracle on the future of OpenSolaris. (But in fairness, OpenSolaris never has had a publicly communicated roadmap...which was also a major drawback.)

The Ugly

Don't even think about letting the system power off accidentally! If it does shut down accidentally, the odds are unconscionably high that it won't fully come back up again on it's own without significant administrative intervention. The various problems I've experienced after power outages were legion--too many to itemize. And most of them were the result of old known, documented, unfixed bugs. To safe yourself untold headache:
- Get the biggest UPS you can't afford, dedicated solely to your OpenSolaris server.
- Consider a backup generator to the UPS.
- Put duct tape over all power buttons.
- Make sure nothing else is stored in the same room--no other servers, not even a broom. The only time you want to be in the same locked room as the server, is when you intend to spend hours there.
- Make sure the room the server is in, only be accessed by OpenSolaris admins.
Which reminds me: I can't reboot the system. It has to actually power off completely, otherwise it gets stuck in an endless self-reboot loop and never fully comes up. This makes remote management when reboots are required all but impossible (and forget about WoL support!). The problem didn't exist on version 2008.11, and I'd wager it won't on 2010.03 either. But it does exist on this hardware with multiple tries of installing the current 2009.06 release. It is a know bug yet to be fixed with an inter-release patch.
Managing network interfaces is incredibly clumsy and error-prone. I never could get Jumbo Frames and/or IPv6 working properly.
Bootup errors when mounting directories: If you suddenly can't log in, you may discover that there are no user home directories. (In which case you would be placed in a single-user Maintenance Mode anyway.)
- It's because, sometimes, inexplicably, things mapped to the "/export" directory (possibly other directories as well) won't mount.
- It is a known bug and involves the kernel trying to mount things in the wrong order (and then finding that the next directory to mount isn't empty as it could expect if mounts were executed in the correct order).
- The manual fixes are an incredible headache, and not very well known as many people apparently just give up, reinstall, and/or move on to some other OS.
- There is only one paragraph on the entire World Wide Web that documents how to fix it. (Well now two with my link below.) And I only found that after pouring through page after page of documentation, bugfix databases, forum posts, and chat transcripts. And it's a known bug!
Error messages are often cryptic, if not misleading or outright incorrect.
The built-in CIFS/SMB service, although easy enough to enable by itself, is tied very closely to the new ZFS ACL security model. While this new extended security model is a welcome move towards something like the more intuitive and powerful Microsoft NTFS ACL model, in practice it is a nightmare to configure and troubleshoot. And unlike the similar extended model on Linux, there are no GUIs available to help. And in this case, since the command-line model is so complex, a GUI is all but essential just to understand what security is in place at the moment.
Getting the alternative SMB service running (samba.org's Samba service), is rife with issues, bugs, and workarounds on OpenSolaris. (But once it works, it is arguably a better Windows file sharing service than Microsoft offers, though the same is true of Samba running on any operating system. And better yet, with Samba you don't have to mess with ZFS ACLs, instead using the simpler [albeit less powerful] legacy *nix ACL model.)
Getting a VNC server running requires way too many workarounds and tweaking of obscure features.
The ZFS Snapshot service can be very finicky and is easily broken.
Getting OpenSolaris to boot from a mirrored root volume is an exercise in sheer frustration. It requires all kinds of hacks and workarounds and frankly if I had to do it again, I wouldn't!

Conclusion and next steps

The bottom line, for me at least, is that I have a file server now, finally, after a year of frustration and becoming something of an expert OpenSolaris server admin against my will. And inasmuch as it is in my control, it will never, ever be shutdown or rebooted again!

I will still have to doubling available storage every year, without shutting the server down or rebooting. Fortunately I can do this, without necessarily even logging in to the console or remotely.
Assuming I can string enough UPSes together (because I refuse to turn it off which would be required to exchange for just one big UPS), it will survive future and too-frequent northern California storm blackouts. 12 hours seems to be an extreme upper bound without power here, and I don't think that is too unreasonable for just one server and a bank of batteries!
I have resolved to not touch the server it at all, and just let it operate for as long as it will. The only real hardware failures I experience in that kind of scenario (e.g. in a secure, clean, and stable environment) are drives dying, which is fine; I just replace those while everything stays running and everything adjusts automagically and non-destructively.
Upgrading to version 2010.03 (out in mere weeks) is completely out of the question! It took me months to "recover" from upgrading from 2008.11 to 2009.06, I've just recently gotten things stabilized, so there is no way I would voluntarily subject myself to that kind of torture again.
It seems reasonable that it should be able to run continuously for the next five years. I've easily achieved 1+ years out of Windows Servers on cheap consumer hardware before, sitting out in the open in high-traffic rooms, and the only reasons for eventual downtime were to replace dead drives. Take dead drives out of the equation (with hot-swapping), and consider that even OpenSolaris should in theory be more solid than Windows Server, and that the server is perpetually clean and safe; and five years does not seem unrealistic to me.

In short, I only plan on running OpenSolaris for as long as my current version continues running without intervention other than hot-swapping dead drives. Once it gives up the ghost, surely by then the ZFS file system will be supported more fully on other operating systems (it is already brewing on BSD, Nexenta [an OpenSolaris/Debian hybrid], and Linux [for now only in userland space until Oracle changes the license model]). Or, there will be another file system as robust finally out of the gate (e.g. Oracle's own btrfs). Either way, I do not anticipate that I will ever find myself running OpenSolaris (of any version) again, and I find that unfortunate--even sad (but not unfair)--to say.

This work is licensed by James R. "Jim" Collier in 2010 under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

The Gorn Ultimatum

search this blog