|
YOUR FEEDBACK
|
TODAY'S TOP SOA & WEBSERVICES LINKS Storage Protocols Simplifying the Storage Algorithm
Shared storage in a Linux cluster environment
By: Iain Findleton
Dec. 16, 2004 12:00 AM
Current storage solution paradigms are facing a bumpy road ahead with the rapid emergence of the Linux-based cluster as a vehicle of choice in the enterprise storage market. Most readers will be familiar with the problems associated with the use of the Direct Attached Storage (DAS) paradigm in cluster applications, and many will have experienced the limitations of the Storage Area Network (SAN) paradigm. In large clusters, DAS results in fragmented islands of data that number in the same order as the number of nodes in the cluster. SAN implementations can reduce the number of islands by about an order of magnitude but, in the enterprise world, that is not a great advance. In either case, the fragmentation results in escalating storage management complexity and in comparably escalating costs for software solutions to the management issues. DAS and SAN solutions are workable on storage clusters of order 10 nodes, whereas for clusters of order 100, 1,000, or 10,000 nodes, these approaches are simply unworkable. The Network Attached Storage (NAS) paradigm, the workhorse of the previous millennium for shared storage applications, has the design characteristics to address the shared storage requirements of many cluster-based applications, but lacks features such as a unified name space, local cache synchronization, and global locking services that are needed for the implementation of parallel applications on clusters. While NAS is effective in serving the needs of an array of independent processes running on cluster nodes and accessing independent storage objects, it doesn't lend itself to problems such as the scaling of DBMS server applications, which imply the sharing of the same data objects by many cluster nodes in a highly synchronized fashion. Global File System (GFS) products do present the possibility of addressing the needs of clusters for shared access to storage by many processes running on many nodes of a cluster. GFS designs found in the storage networking marketplace today provide access, through either a unified name space or an aggregated name space, to objects stored on a pool of physical storage that is shared by many nodes in a cluster. A unified name space allows all of the applications on the nodes of the cluster to access data through an identical view of, for example, the directory structure that describes the metadata that relates to the actual data of interest. An aggregated name space, on the other hand, presents an aliased view of a number of independent metadata views maintained by, for instance, a pool of NAS servers. Whatever the approach, these schemes rely on some form of centralized locking service to ensure that the metadata view is consistent and coherent across all clients of the storage system. The Scalability IssueGene Amdahl, a pioneer in the development of multiprocessor systems, enunciated the critical issue relating to the scalability of parallel applications, stating that the marginal improvement in the performance of a parallel application delivered by adding more processing units to the parallel array is bounded by the amount of strictly serial processing in the application. In essence, Amdahl's Law states that the increased performance of a parallel application, when plotted against the number of parallel-processing elements, becomes asymptotic at a rate determined by the percentage of serial code in the application. The shock comes when a study of Amdahl's Law effect reveals that the level of asymptotic performance is reached, even for applications with inherent 95% parallel content, after about 50 nodes (see Figure 1).In the storage networking context, the serial component of the algorithm is the locking mechanism. Coordination of access to storage objects through the use of a lock manager service implies that all access is serialized. Depending on the nature of the storage access, this serialization will manifest the Amdahl's Law effect either very rapidly, as in the case of many accesses to relatively small objects, or in a somewhat leisurely fashion where large blocks of data are being transferred for each access to the lock service. Based on the experience with current products in the storage networking marketplace, Amdahl's Law makes its impact at order 10 nodes. Advanced storage networking schemes that make use of approaches like connection routing, object storage targets with offload engines, and products aimed at very highly parallel applications appear to be able to push back the manifestation of asymptotic behavior to something less than 100 nodes. Eliminating the Traffic CopAlmost all implementations of shared access to network storage make use of some form of a traffic cop to coordinate access to data objects on the shared storage media. The MetaData Controller (MDC) acts as a traffic cop by scheduling the access to stored objects by client application processes. A process requests access to an object by contacting the MDC, which then schedules that access in some reasonable manner. Typically, the storage algorithm wishes to ensure that, for example, one process doesn't overwrite an object while another is reading it. DBMS applications may implement complex access rules that are coordinated using shared locks.On the other hand, what if the accesses to data objects could be scheduled in such a manner that undesirable collisions never occurred? What if no two data access requests for the same data object are allowed to occur during the same time that either one of them needs to complete? What if this scheduling facility could be effected by the storage targets themselves, in a totally asynchronous and independent fashion? The need for a traffic cop would be eliminated. The serial component of the shared storage algorithm would be eliminated. The shared storage algorithm would become 100% parallel, and the effect of Amdahl's Law eliminated. Each additional storage processing element would result in an improvement in the performance of the storage system equal to the performance capacity of the new element. Terrascale Technologies Inc. succeeded in developing its Shared Access Scheduling Algorithm (SASS), which does eliminate the traffic cop in the shared storage access algorithm. The SASS algorithm is implemented as an intelligent, iSCSI-based, Object Storage Target (OST). The SASS-based OSTs in a storage cluster configuration are driven from storage client nodes of a cluster through a block device driver using the standard iSCSI protocol. The block device driver makes available to client applications a standard block device interface for each of the OSTs configured in a storage configuration (see Figure 2). In the Linux context, the block devices look like disk drives, and each makes available a quantity of disk blocks. The collection of OSTs implements an IP SAN style of storage pool. Using any standard Linux tool, such as the Multiple Device Driver (md) or the Logical Volume Manager (lvm), the individual iSCSI devices seen by the clients can be aggregated into a single logical device that presents to applications a unified storage space with a logical block address space. Using the unified logical storage space, an application, such as a file system, can be run that makes use of storage space provided by the OSTs. Scalability CharacteristicsThe OSTs that implement the SASS algorithm operate independently and asynchronously of each other. There is no communication between individual OSTs. In an implementation that uses md to aggregate OSTs, I/O operations will be stripped over the individual storage targets. Each OST receives I/O requests from client application nodes, and carefully schedules them so that there are no collisions that would violate the usual data access semantics. Because the OSTs are oblivious to each other's presence, adding an OST to the configuration contributes the full available performance of the additional target to the overall performance of the combined pool. Amdahl's Law does not apply because there is no serializing component in the storage structure.Similarly, clients do not have any knowledge of each other. The OSTs accept iSCSI-based connections from storage client nodes, each of which knows of the existence of the OSTs in the storage pool, but have no knowledge of each other. Again, there is no potential for the serialization of I/O requests based on interclient communication, so adding a client to the system delivers the full performance capacity of the storage system to the applications running on the client. The effective available storage capacity is determined by the aggregate performance capacity of the OSTs and the throughput capacity of the network being used. The performance of a shared storage pool built using SASS-based OSTs scales linearly to the capacity of the interconnection network used to make the storage pool available to the storage client nodes of the cluster. The name space presented to the client nodes is unified. The application interface is at the block level. Using the standard Linux RAID 0 implementation in md effectively distributes the I/O load across all of the participating OSTs, eliminating the need for periodic and complex container management operations to eliminate hot spots and rebalance the storage system (see Figure 3). The TerraGrid ProductTerraGrid is a software bundle that demonstrates how to use SASS-based OSTs to build a global file system implementation for a Linux cluster. It consists of the SASS-based OST, the device driver needed to drive the OST, and a modified version of the standard ext2 file system, called extz, which is used to implement a global file system. The only other component of the design is md, which is part of Linux. A storage system is built by running the OST code on one or more storage targets, each of which contributes the same amount of physical storage to the combined storage pool. On the storage client nodes, the device driver and md are run to create the storage aggregation interface. After building an ext2 file system from one client on the aggregated device, using mke2fs, the extz file system is mounted on all client nodes (see Figure 4).Applications on client nodes have fully synchronous, coherent access to data stored in the extz file system via a unified name space. A cache invalidation protocol ensures that metadata views across all client nodes are consistent. Since there is no requirement for either the client nodes to have knowledge of other clients, nor for the OSTs to have knowledge of each other, there are no intrinsic scalability limits to the system. Since the SASS algorithm ensures that each client will receive a temporally accurate copy of the data stored on the OSTs when it makes an access, there is no need to implement special access coordination mechanisms within the application. All applications run as if they were the unique and exclusive owners of the entire shared storage resource. Scalable Database ImplementationsAn issue of significant interest in the DBMS marketplace is the scaling of the performance of DBMS server applications. The traditional approach is to use SMP machines to deliver more processing horsepower to the server. Shared memory-locking schemes provide for the coordination of DBMS operations among the process running on the processor array. This approach requires the use of increasingly sophisticated memory subsystems, dramatically increasing the cost of SMP machines as the number of processors is increased.The lock management scheme used in DBMS server designs is not intellectually different from the MDC approach that's widely used in the storage networking realm. Implementation of a SASS-based scheme with the view to eliminating the scalability constraints on the DBMS algorithm is a near-term goal of Terrascale's. Implementation of a scalable DBMS infrastructure using the TerraGrid product to handle the locking of file-based data and metadata records is straightforward. The SASS algorithm will handle the correct provision of access to the underlying file structure by multiple instances of the DBMS server application running on an array of client nodes. Within a single client node, the DBMS server code can handle multiple sessions. In principle, a single node instance of a DBMS server application, such as mysqld, can be cloned and run on top of the extz file system to build a scalable DBMS platform that can deliver large aggregate transaction rates. SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK |
|||||||||||||||||||||||||||||