Originating Author: David Floyer
Original Post Here: http://wikibon.org/wiki/v/Designing_Systems_and_Infrastructure_in_the_Big_Data_IO_Centric_Era
In January of 2008, EMC landed a haymaker by announcing the incorporation of flash SSD technology into Symmetrix, and started the IO-centric era. At the same time, Fusion-io had just come out of stealth mode with the announcement of the first flash PCIe card. Later in 2008 IBM and Fusion-io announced the first million IOPS Quicksilver benchmark. Just four years later, the Big-Data IO-Centric Era came of age when Fusion-io and HP demonstrated 1 Billion IOPS using eight servers and 64 flash PCIe cards. The key technology making this possible is atomic writes directly from the server memory to flash. The reduction in latency and overhead allows the real-time ingestion of transactional and event data and the very near real-time creation of metadata and indexes to allow effective real-time analytics.
Wikibon analyzed the costs of four different 1 Billion IOPS configurations, shown in Figure 1. The differences are so great that a log graph is required. Figure 1 shows that:
- The Fusion-io/HP Atomic write configuration costs $3 million;
- The best configuration using a traditional IO software stack with PCIe flash cost an order of magnitude more, $34 million;
- The cost of the cheapest configuration using disk and SSDs cost about two orders of magnitude more, $220 million;
- The cost of a disk only configuration costs about three orders of magnitude more, $2,276 million.
The introduction of flash with atomic writes has allowed nearly a thousand-fold increase in the affordability of high IO applications in just four years. The impact of four years on the amount of space required to provide the computing for 1 billion IOPS is shown in Figure 2 below, reduced from the size of 2.7 football fields to a ½ rack.
The IO writes in this demonstration were small (64 bytes), and no work was done other than move the data from the server to the flash memory. It would be easy to dismiss the demonstration as trivial and not real-world.
Wikibon believes that would be a mistake, as the cost and capability implications are very significant. Many data sources such as Twitter, instant messages, Machine to machine (M2M) communication (such as refrigerators, automobiles, central heating), mobile location coordinates, metadata about a picture, etc., consist of a few bytes per input and are often event-driven. The ability to deal with massive numbers of IO at low cost will drive the development of applications to exploit these data torrents.
This research note analyzes how 1 Billion IOPS was achieved with atomic writes, looks in detail at the alternative ways that 1 Billion IOPS can be achieved and the cost differences, and looks at the implications on both IO infrastructure and system design going forward.
Wikibon concludes that atomic writes direct to flash will open up completely new ways of designing systems with enormous potential for improving business and societal efficiency, and providing new services to and from consumers, businesses, and governments. Wikibon believes that the overwhelming returns on new IO centric applications will drive infrastructure change from storage centric to IO centric over the remainder of this decade. This is a once-in-a-lifetime opportunity to disrupt the current status in most business segments, and the spoils will go to those who understand well and invest early and often.
Details of One Billion IOPS Demonstration
In San Francisco on January 5, 2012 Fusion-io and HP demonstrated a system driving a billion IOs per seconds (IOPS). Figure 3 shows the process flow. The “secret sauce” to achieving this performance is the elimination of the traditional IO stack by writing directly to flash memory. These writes use the Fusion-io VSL software with a beta version of an Auto Commit Memory API. The API performs an atomic write by communicating directly with the controller in a PCIe Fusion-io Drive. An atomic write guarantees in one pass that the data is secure, and the ioDrive controller is then responsible for any recovery of the data. This reduction of latency enabled by avoiding the traditional IO stack increases the throughput of the ioDrives from 1 million to more than 15 million IOs a second. The overall throughput of the server is now 125 million IOPS, and eight servers with a total of 64 ioDrives deliver 1 billion IOPS. The three-year total hardware and facilities cost of the solution is about $3 million. More detail of the assumptions behind this figure can be found in Footnote 1 & 2.
This capability is of great interest to database vendors and operators, as IO write time to persistent storage (particularly for database logs) is almost always a limiting factor for database throughput. The beta version of the Auto Commit VSL software does not yet include the locking features to ensure flash memory is not overwritten by another task, hence the name “non-locking atomic writes” used in this post. Fusion-io has indicated that it is working closely with some data-base vendors on a locking capability within the Auto Commit Memory API, which is likely to come to market in 2012.
Footnote 2 in the Footnotes section below give greater detail about the configurations shown in Figure 1 and Figure 4, and is optional reading. Wikibon analyzed the benchmark and created three other “thought experiment” configurations with the same throughput to look at the impact at the cost benefits of atomic writes.
Figure 4 looks at a breakdown of the costs of the most reasonable configurations. The detailed cost assumptions are shown in Table 1 in Footnote 1.
Implications for IO Infrastructure
This finding illustrates what Wikibon has been indicating for some time, that hard disk drives suck the life out of system performance, and are a major constraint to performance improvements and cost reductions. Indeed disk drive performance degrades annually as more and more data is stored under the actuator, a trend not offset by minimal improvements in seek times and spin speeds. Very high IO systems can be designed using flash-only approaches for active data at a dramatically lower cost than previously possible. This leads to an IO-centric design for both applications and infrastructure.
Figure 5 describes the Wikibon IO centric model. There are five layers in this model, three physical and two management:
- Local working flash layer
- This layer places flash very close to clusters of processors and uses locking atomic writes. This enables all the operational and event IO from multiple sources to be processed. Because of the removal of the constraints of writing out to hard disks, the metadata and indexes can be created for use by security, compliance, real-time analytics, and archiving. This obviates the need to separate operational and data warehouse systems.
- A reference model for equipment in layer 1 could be HP Proliant servers and Fusion-io cards with atomic writes using the Auto Commit Memory (ACM) API.
- End-to-end security models would be managed at this level. A reference model could the the Stealth Technology from Unisys.
- Active Data Management
- This layer manages the integrity and movement of data between the local working flash layer (1) and the distributed, shared flash layer (3). Short-term backup and recovery would be controlled in this layer.
- A reference model for equipment could be the later stages of Project Lightening from EMC with a plan to integrate tiering and cache coherency between the servers and storage arrays.
- Distributed Shared Flash Layer
- This layer provides lower latency access but is shared both locally and remotely by server clusters. Most metadata and indexes would be held in this layer, as well as other active transactional, event, and analytic data.
- Compression and de-duplication of data at this level would lower costs and facilitate movement of data over distance.
- A reference architecture could be a combination of flash-only arrays from vendors such as SolidFire that include flash IO tiering and management and federated arrays from HP 3PAR, EMC and NetApp.
- Archive Management layer
- This layer would manage the interface between the metadata and indexes and the archive disk layer and minimize the access and load times for data on disk.
- Frameworks for this layer that could be considered would be object systems such as WOS from DDN and Cleversafe’s dispersed storage technology. These would require integration with cross-industry (e.g., email archiving) and industry specific archive software.
- Distributed Archive/Long-Term Backup Layer
- This layer provides the lowest cost storage (on hard disk drives or tape for the foreseeable future), where the detailed data is geographically distributed under the management of the metadata/indexes in layer 3 and the layer 4 management layer.
- Reference models could be DDN S2A9900 architecture for sustained sequential IO and distributed cloud systems such as Nirvanix.
The thickness of the arrows in Figure 6 indicate the amount of data being transferred, and the arrow, the direction. Note that major flow is down through the layers. In practice there would be minimal transfer upwards from the lowest layer, as the IO and transfer costs are so high from hard disk drives.
Implications for System & Application Design
The results of the change in capability and cost of IO infrastructure is that the design of systems will also change dramatically:
- Current Operational Systems could be consolidated or separate databases independent linked to provide a single source of “truth” of the state of a large and small business and government organizations.
- Big Data Transaction and Event Systems can leverage the vast amount of information coming from people, systems, and M2M, ingest it and use the information to manage business operations, government operations, battlefields, physical distribution systems, transport systems, power distribution systems, agricultural systems, weather systems, etc., etc., etc.
- Big Data Analytical Systems allow the extraction of metadata, index data and summary data directly from the input stream of operational big data. This in turn allows the development of much smarter analytical systems designed as an extension of the operational system in close to real time instead of a much delayed extract of available operational data.
- Archive Systems would have the same ability to exploit the extracted metadata and indexes in real time. This could at last allow a complete separation of backup and archiving, improve the functionality of archive systems, and reduce the cost of implementing them. The ability to hold archive metadata and indexes in the active storage layers will allow much richer ability to mine data, and at the same time allow more effective access to detailed data records and the deletion of old data.
Systems design has been constrained for decades by the low speed of disk drive technology. The impact of this constraint has been increasing as server technologies improve while mechanical disk rotation speed remains constant. The cost per gigabyte has improved enormously; the cost per IO has changed very slowly.
The technology of atomic writes direct to flash removes these constraints. It allows the potential for unification of big transactional data and big-data analytics in the same system and will open up completely new ways of designing systems. Wikibon has written about theenormous potential for improving business and societal efficiency, and providing new services to and from consumers, businesses and governments.
Over the next decade, Wikibon believes that the overwhelming returns on new IO-centric applications that combine operational and analytic system will drive infrastructure change from storage-centric to IO-centric.
Action Item: CIOs should lead the business discussion on the potential of these system to radically improve the efficiency and reactiveness of their organizations. If their systems designers are not drooling over the potential of IO Centric, they should be assigned maintenance tasks. This is a once-in-a-lifetime opportunity to disrupt the current status in most business segments, and the spoils will go to those who understand well and invest early and often.
The footnotes provide more detail on the configuration details and cost assumptions of the comparisons, as well as more technical detail about VSL.
Detailed Assumptions Behind Configuration Costs
Table 1 below shows the detailed assumptions and costs underlying the previous figures.
This section gives greater detail about the configurations shown in Figure 1 and Figure 4 and is optional reading. Wikibon analyzed the benchmark and created three other “though experiment” configurations with the same throughput to look at the impact at the cost benefits of atomic writes. Figure 6 shows the four detailed configurations and the constraint on performance. One configuration is proven capable and the other three are theoretically capable of reaching one billion IOPS.
- The non-locking atomic write configuration is based on eight HP DL370 servers taking 2U of rack space, each with eight Fusion-io ioMemory2 Duo cards with 2.4TB of MLC flash. The total space is 16U of rack space (in the demonstration more space was left between the servers),and the total power requirement is 15kW. The system is well balanced, with the ultimate constraint being the flash cards, which topped out at 15,625,000 IOPS per card.
- The traditional write to PCIe flash uses the same type of configuration that was used in earliest one million IOPS demonstrations.
- 1 million IOPS achieved by IBM and Fusion-io with the quicksilver project in 2008.
- In 2009, HP reduced the server configuration required down to four quad-core AMD Opteron processors with five 320GB Fusion-io ioDrive Duos and six 160GB ioDrives.
- At HP Discover in November 2011, HP achieved 1.55 million random 4K IOPS within a single HP ProLiant DL580 G7 server and ten ioDrives.
- The “thought” configuration to achieve one billion 64 byte IOPS requires 250 HP ProLiant DL580 G7 servers configured with 4 x Ten-Core Intel® Xeon® E7-4870 2.40GHz 6.4GT/s QPI 30MB Cache (130W), 256 GB of 1333MHz ECC/REG DDR-III Memory, and 2,000 0.4TB ioDrive2 flash cards from Fusion-io. The performance constraint is the server. This delivers an IOPS rate of about 4 million/server. The rack space/server is 4U.
- The third 1 billion configuration assumes that the data is written to disk. To be able to balance the IOs, flash SSDs are used to block the 64byte writes into 1 MB chunks. For each server, five Anobit Genesis SSD drives were assumed with 200GB signal processing MLC. The form factor is 2.5” 6Gb/s SAS 2.0 drives with 510MB/s read and write speed and 50,000 P/E cycles. The output from the four drives is written to a single 2.5” Seagate® Savvio® 15K.3 6-Gb/s 146GB Hard Drive at a rate of 40 IOPS and 40 megabytes/second. The server assumed is a HP DL370 with 2 x Six-Core Intel® Xeon® X5660 2.80GHz 6.4GT/s QPI 12MB L3 Cache (95W), 72GB 1333MHz ECC/REG DDR-III Memory, in 2U of rack space. 5,000 servers are required.
- The fourth “thought” configuration assumes no flash or battery protected RAM, just disk drives: DAS/JBOD gone wild. It is assumed that with sequential write 300 IOPS could be achieved with a Hitachi 2.5″ drive. 5,000 servers are required to meet the server throughput (as in configuration 3), and 667 drives are required for each server to handle the IO write rate. In total 5,000 servers and 3.3 million disk drives are required, taking 8,588 racks to house the configuration and 90,875 kW of power.
Deep Dive: Fusion-Io VSL And Auto Commit Memory API
The VSL memory architecture provides multiple 2TB address space chunks. The first 2TB is used for system files. The remaining 2TB allocation chunks are for user data or directory files. A large file takes the whole chunk, while multiple small files share a single chunk. There can be up to 273 2TB chunks.
VSL provides a set of functions that allow the flash storage to act as a memory subsystem. The key addition is that these functions allow the higher levels of the storage hierarchy to exploit the persistent nature of flash. The architecture pushes down the sector, buffer, and log management of flash to the flash controller, which takes responsibility for guaranteeing any write action or allocation request, and maintaining recoverability. The VSL layer is responsible for block erasures, reliability, and wear-leveling. In the event of a crash, the VSL driver can reconstruct its metadata from the flash device.
With the addition of the Auto Commit Memory (ACM) API, applications can declare a region of a VSL 2TB virtual address space as persistent. The application can then program to this region using regular memory semantics, such as pointer dereferences, and access the memory directly via CPU load and store operations. These operations bypass the normal IO stack components of the operating system, and as a result latency is significantly reduced.
In the event of a failure or restart, VSL and ACM work in cooperation with the existing platform hardware and CPU memory hierarchy, to apply the memory updates and ensure data persistence and integrity.
One obvious initial use of ACM is for acceleration of database and file-system logs, which will allow almost instantaneous releasing of locks held by a transaction. Fusion-io has indicated it is working with system and application developers to apply this technology to other opportunities in operating systems, hypervisors, and databases, and directly in applications.