Demystifying data storage: Archiving options for PACS
Images
Dr. Nagy is the Director of the Radiology Informatics Laboratory, Medical College of Wisconsin, Milwaukee, WI. Mr. Farmer is the Chief Technology Officer of Cambridge Computer, Waltham, MA.
There are a number of options when it comes to choosing the right storage technology for your medical images. The first challenge is that the storage industry is evolving rapidly, and you don't want to be locked into an obsolete storage strategy. Hard drive capacity has increased by a factor of 17.6 million since its invention in 1952. 1 There has been a sustained average annual growth in capacity of 33%. Tomorrow, there will be higher capacity drives at lower prices and you don't want to be stuck with older, slower, and more expensive equipment.
The second challenge you will have to face is that storage can take a big bite out of your budget if you don't fully understand what you are paying for. Data storage systems can be one of the single greatest expenses for many picture archiving systems (PACS), and their cost and complexity is often the barrier to entry to adopting PACS. It's easy to overbuy, and there is no shortage of enthusiastic storage salespeople out there to take your money! In this article, we will demystify some of the terminology thrown about in the storage arena and will lay the foundation for helping you create a flexible, robust, and cost-effective storage strategy.
Disk devices versus jukeboxes: Online, nearline, and offline storage
Storage is usually classified functionally as either online, nearline, or offline. Online storage refers to data that is stored on magnetic hard drives with access times in milliseconds and transfer rates in the range of 10s to 100s of megabytes (MB)/second. Online storage is immediately available to your PACS application.
Nearline storage typically refers to a tape or optical jukebox in which robotic arms can retrieve the tapes automatically and insert them into a drive to read or write data. Generally, a nearline system can access data within 60 seconds and is able to transfer data at a few MB/sec. Offline storage is removable tape or optical media that is stored on a shelf in a catalog and is retrieved manually. Today, there is little use for offline storage as it is very slow in retrieving data and can cause data loss if media is mislabeled, misplaced, or mishandled.
The relation between online (hard drives) and nearline (tape and optical media) is historically considered to be a direct trade-off between cost and performance. A type of software application described as hierarchical storage management (HSM) would manage a relatively small portion of online storage and a larger amount of nearline storage together as one storage pool. The HSM would try to predict which studies would be requested and keep them on the online portion of the system. As the online portion filled up, older studies would be retired to the nearline storage system. In PACS, the rule of thumb is that roughly 80% of immediate relevant prior studies are from within the last 6 months, and 90% are from within 1 year (Figure 1). A good rule of thumb is to buy at least 1 year's worth of online storage. This ensures that you won't experience delays in retrieving prior studies for 80% to 90% of your cases.
At the Medical College of Wisconsin, we generate roughly 10 terabytes (TB) of data annually for the 225,000 radiological procedures performed. This ratio would be higher for cancer centers that do a higher percentage of computed tomography (CT) procedures. Relative to other healthcare applications, PACS requires a disproportionate amount of data storage, 100 to 1000 times as much, and must be able to scale in capacity indefinitely.
Moving from jukeboxes to disk arrays
Early adopters of PACS were forced to rely disproportionately on nearline storage versus online storage. Disk storage was too expensive to accommodate a sufficient cache of prior studies, and studies had to be pulled from the nearline jukebox regularly. Jukeboxes have a limited number of drives, not to mention all kinds of slow mechanical processes. As such, requests would queue up, which caused delays, and failures of the robotic systems were extremely inconvenient and costly.
Another problem with jukeboxes is that they lock you in to today's cost of storage. At any given time, removable media is cheaper per MB than hard drives, but jukebox systems require you to buy most of your storage technology up front, in anticipation of your long-term needs. By the time you grow into your anticipated needs, the cost of hard disks could have dropped down way below the original cost of the jukebox. Once you factor in the cost of maintaining the jukebox and the software and expertise to manage it, the disk approach is cheaper.
Using disks for nearline storage
The historical cost difference between hard drives and removable media, such as tape and optical, no longer exists. On the contrary, hard drives are not only much faster, but they are also cheaper for data storage compared with tape or optical media. The role of nearline storage must change to one of disaster recovery and obsolescence protection.
Today, the preferable solution is to use disk technology, rather than a jukebox, for a nearline system. You still have nearline storage, but you are storing a second copy to a disk device, rather than to a jukebox. The software that manages the disk-based nearline archive might be the very same HSM software that manages a jukebox.
If you are just starting out with a PACS and your budget is constrained, you could build your system entirely with online storage and later add nearline storage as the need becomes apparent. This helps with obsolescence protection, as it makes data migration much easier when the time comes to move data to newer hardware platforms or between systems from different vendors.
Dissecting online storage (disk systems)
PACS do not require "enterprise" storage solutions
In the past, the only place to buy high-capacity, scalable disk arrays was in the enterprise computing marketplace, and, as such, there is a common misconception that storage systems designed for corporate data centers are required for PACS. Corporate data centers often have hundreds of servers, each running different applications and operating systems. Each server in an enterprise needs only a few gigabytes (GB) of storage for their textual information. Managing storage in such diverse environments is a nightmare, and a very expensive nightmare at that. It makes a lot of sense to manage all those servers from a central storage area network to reduce complexity.
Enterprise-class storage systems can be cost-justified for complex data centers, but they are overkill for most PACS. The best bet is to buy dedicated storage systems for your PACS archive (Table 1).
DAS (Direct attached storage)
There are three ways your PACS can be using online storage. Direct attached storage (DAS) is to have the hard drives directly on the server running the PACS application. This is the simplest model and the one, historically, that all PACS vendors started with. Unfortunately, the DAS model has scalability limitations in a PACS environment in which you need lots of drives. Scalability means that next year when you need to buy more TB of storage, you will be limited by the number of drives you can fit into the server. You can purchase an external small computer system interface (SCSI) drive system to extend a DAS, but this will buy you only 1 to 2 years before it reaches capacity. The DAS model is also not very fault tolerant. If that server goes down, you will lose access to the data on that server.
Storage area network
One of the most popular solutions for the corporate data center is the storage area network (SAN), in which the storage is independent of the servers. A SAN is a dedicated network for connecting storage devices to computers. This means you can add storage each year without having to take the servers down. The storage can be accessed from more than 1 server, so your system can suffer the loss of a server without losing access to the storage.
Network attached storage
Another popular solution for enterprise data centers is network attached storage (NAS). In SAN, the storage is accessed on a separate dedicated network controlled by the servers. In contrast, NAS is freestanding storage sitting on the network. NAS is not directly attached to the servers and the storage is accessed using network standard protocols. An analogy of NAS would be attaching your printer to a network as opposed to attaching directly to your computer. When you have only one computer, it is simpler to attach it directly, but when you want many computers to access it, you are better off attaching it to the network directly.
Not being tightly coupled to the PACS vendor and its software can be a real advantage for NAS. It does not need the same level of validation with the PACS application for every upgrade. This gives the customer more freedom in choosing from various NAS vendors than being locked into the storage vendor the PACS vendor prefers.
SCSI versus SATA
There are four different types of hard drives on the market: SCSI, Fibre Channel, advanced technology attachment interface (ATA), and serial ATA (SATA). SCSI and Fibre Channel drives are typically used in enterprise storage arrays and servers. ATA and SATA are typically used in personal computers and storage arrays with less demanding performance requirements. Whether you use SAN, NAS, or DAS, you should consider using storage systems based on ATA and SATA drives.
SCSI hard drives-- These are the common hard drives used for enterprise storage. They typically run at high rotational velocities of 10,000 to 15,000 revolutions per minute (rpm). They are more expensive than commercial ATA drives, primarily because ATA drives consist of up to 8 times the volume of drives in the SCSI market.
ATA hard drives -- ATA drives were originally designed and marketed to the personal computer marketplace. The individual drive performance for ATA and SATA is slower than Fibre Channel and SCSI drives due to rotation speeds at 7200 rpm versus 15,000 rpm for SCSI. It is interesting that the overall system performance differences are not likely to be noticeable on a PACS. In fact, depending on the controller and number of drives used in your storage systems, ATA and SATA disk systems could perform with equal speed or faster than some Fibre Channel and SCSI systems. Meanwhile, ATA and SATA drives are less expensive and are available in higher capacities. This translates to a significantly lower cost per TB without any trade-offs from a PACS perspective. PACS is all about moving big files around a network with a high throughput. The SCSI drives are better for transactional storage, such as email servers and databases, which need access to small files many thousand times per second.
The peak loading on a PACS storage system for a large hospital with multiple simultaneous requests is approximately 30 to 50 MB/sec. The limitations to performance are mostly at the application level on how fast they can receive the data on the workstation.
Disaster recovery and fault tolerance
Online storage systems span data across multiple hard drives in a redundant array of inexpensive disks (RAID). There are several different techniques for implementing RAID. The RAID 5 is the configuration that is most suitable for PACS.
RAID technology compensates for the failure of an individual hard drive. In RAID 5, the disk controllers write data across multiple drives in such a way that if one drive fails, your data is still intact. If two drives fail, however, your data is lost, so it is common to keep at least one online spare drive in the cabinet. In the event of a drive failure, the online spare is automatically substituted and the disk array gradually returns to a fault-tolerant state.
Make sure you configure the system so that you will be alerted to a failure. In the past, when PACS was situated in the department, a blinking red light on a failed drive might have been enough to alert an attentive administrator. Today, the PACS is buried in the back of the data center, where visual error lights might not be observed. Ensure that your storage system utilizes an alerting mechanism for any failure that will require human intervention. This will ensure that you get an email or page when a drive dies on your server. Simple network management protocol (SNMP) tools are available that can trap all the errors from your servers and storage devices so you can see what is going on from one location.
Due to their mechanical operation, hard drives are prone to failure. The more drives you have, the higher the probability that you will have a failure. There are other components that can malfunction as well. Be sure to ask your vendor to identify other single points of failure. Power supplies are another common point of failure. Be sure that you have redundant power supplies and be sure to plug them into an uninterruptible power supply (UPS) system, which is a device that uses batteries to back up the electrical power in case of a power outage.
Mirroring refers to a technique for writing your data to two locations at the same time. If one storage location failed, the PACS can access the other copy relatively easily. Replication refers to a type of mirroring in which data is written to one place and then copied to another place. Depending on the type of network connecting the two places, the second copy could be a bit out of sync with the original.
Backup systems for PACS might use similar hardware and software as enterprise data center backup systems. The biggest difference is that your data is largely cumulative. That is, you are adding data rather than updating previous data, and you almost never delete data. Enterprise data centers typically use a backup strategy that involves making full backups once a week and backing up only the changes during the week. This approach could be costly and unnecessary for a PACS. A PACS could be backed up in full only once, with new files added to the backup system incrementally.
Optical versus tape
If you plan to have offsite storage and want to use a removable media format, you should look at tape storage rather than optical. There are two reasons for this. The first is that the cost density is currently superior with an industry standard of 500 GB per tape of Super Ad-vanced Intelligent Tape (SAIT). The optical industry standard is still really in the range of 5 to 30 GB with the latest optical media being 30 GB Ultra Dense Optical (UDO). The second difference is performance. If you need to retrieve data from removable media in the case of disaster, tape has a much faster sequential through-put rate. The fastest read time from an optical drive is 2 to 3 MB/sec, whereas a tape can retrieve at 20 to 30 MB/sec. In the event of a disaster, retrieving 1 TB of data (approximately 25,000 studies) from optical drive would take 4 days, as opposed to 9 hours with tape.
Conclusion
You should strongly consider putting all of your PACS storage online; it is not only economical but is also a good protection from obsolescence. Also, consider using SATA hard drives, which offer a level of reliability once available only for enterprise storage at near desktop prices. The cost of storage will continue to decrease every year while simultaneously increasing in capacity as the computer industry continues to innovate. The best way to take advantage of this is to purchase only the storage you need for the upcoming year and encapsulate your storage from the PACS server by employing a network attached storage technique. With careful planning, you should be able to stay ahead of your storage requirements without having the archive consume a significant portion of your hard-earned PACS budget.