VM World Update: Fusion-io in the Wikibon Cube

August 28, 2012

Designing Systems and Infrastructure in the Big Data IO Centric Era (Wikibon Repost)

January 31, 2012

Originating Author: David Floyer

Original Post Here:  http://wikibon.org/wiki/v/Designing_Systems_and_Infrastructure_in_the_Big_Data_IO_Centric_Era

Contents

Executive Summary

In January of 2008, EMC landed a haymaker by announcing the incorporation of flash SSD technology into Symmetrix, and started the IO-centric era. At the same time, Fusion-io had just come out of stealth mode with the announcement of the first flash PCIe card. Later in 2008 IBM and Fusion-io announced the first million IOPS Quicksilver benchmark. Just four years later, the Big-Data IO-Centric Era came of age when Fusion-io and HP demonstrated 1 Billion IOPS using eight servers and 64 flash PCIe cards. The key technology making this possible is atomic writes directly from the server memory to flash. The reduction in latency and overhead allows the real-time ingestion of transactional and event data and the very near real-time creation of metadata and indexes to allow effective real-time analytics.

Figure 1 – Log Graph of Total 3-year Hardware & Facilities Cost for Four Alternative 1 Billion IOPS Write Configurations. Source: Wikibon 2012

Wikibon analyzed the costs of four different 1 Billion IOPS configurations, shown in Figure 1. The differences are so great that a log graph is required. Figure 1 shows that:

  • The Fusion-io/HP Atomic write configuration costs $3 million;
  • The best configuration using a traditional IO software stack with PCIe flash cost an order of magnitude more, $34 million;
  • The cost of the cheapest configuration using disk and SSDs cost about two orders of magnitude more, $220 million;
  • The cost of a disk only configuration costs about three orders of magnitude more, $2,276 million.

The introduction of flash with atomic writes has allowed nearly a thousand-fold increase in the affordability of high IO applications in just four years. The impact of four years on the amount of space required to provide the computing for 1 billion IOPS is shown in Figure 2 below, reduced from the size of 2.7 football fields to a ½ rack.

The IO writes in this demonstration were small (64 bytes), and no work was done other than move the data from the server to the flash memory. It would be easy to dismiss the demonstration as trivial and not real-world.

Wikibon believes that would be a mistake, as the cost and capability implications are very significant. Many data sources such as Twitter, instant messages, Machine to machine (M2M) communication (such as refrigerators, automobiles, central heating), mobile location coordinates, metadata about a picture, etc., consist of a few bytes per input and are often event-driven. The ability to deal with massive numbers of IO at low cost will drive the development of applications to exploit these data torrents.

This research note analyzes how 1 Billion IOPS was achieved with atomic writes, looks in detail at the alternative ways that 1 Billion IOPS can be achieved and the cost differences, and looks at the implications on both IO infrastructure and system design going forward.

Wikibon concludes that atomic writes direct to flash will open up completely new ways of designing systems with enormous potential for improving business and societal efficiency, and providing new services to and from consumers, businesses, and governments. Wikibon believes that the overwhelming returns on new IO centric applications will drive infrastructure change from storage centric to IO centric over the remainder of this decade. This is a once-in-a-lifetime opportunity to disrupt the current status in most business segments, and the spoils will go to those who understand well and invest early and often.

Figure 2 – Space to scale taken by Four Alternative 1 Billion IOPS Write Configurations, from ½ rack (the little red dot in the top left of the green sphere) to 2.7 football fields.  Source: Wikibon 2012

Details of One Billion IOPS Demonstration

Figure 3 – Process flow for one HP DL370 server with 8 Fusion-io ioMemory2 Duo flash cards with VSL and non-locking Atomic Write API.  Source: Wikibon 2012

Figure 4 –Breakdown of Total 3-year Hardware & Facilities Cost for Three Alternative 1 Billion IOPS Write Configurations.   Source: Wikibon 2012

In San Francisco on January 5, 2012 Fusion-io and HP demonstrated a system driving a billion IOs per seconds (IOPS). Figure 3 shows the process flow. The “secret sauce” to achieving this performance is the elimination of the traditional IO stack by writing directly to flash memory. These writes use the Fusion-io VSL software with a beta version of an Auto Commit Memory API. The API performs an atomic write by communicating directly with the controller in a PCIe Fusion-io Drive. An atomic write guarantees in one pass that the data is secure, and the ioDrive controller is then responsible for any recovery of the data. This reduction of latency enabled by avoiding the traditional IO stack increases the throughput of the ioDrives from 1 million to more than 15 million IOs a second. The overall throughput of the server is now 125 million IOPS, and eight servers with a total of 64 ioDrives deliver 1 billion IOPS. The three-year total hardware and facilities cost of the solution is about $3 million. More detail of the assumptions behind this figure can be found in Footnote 1 & 2.

This capability is of great interest to database vendors and operators, as IO write time to persistent storage (particularly for database logs) is almost always a limiting factor for database throughput. The beta version of the Auto Commit VSL software does not yet include the locking features to ensure flash memory is not overwritten by another task, hence the name “non-locking atomic writes” used in this post. Fusion-io has indicated that it is working closely with some data-base vendors on a locking capability within the Auto Commit Memory API, which is likely to come to market in 2012.

Footnote 2 in the Footnotes section below give greater detail about the configurations shown in Figure 1 and Figure 4, and is optional reading. Wikibon analyzed the benchmark and created three other “thought experiment” configurations with the same throughput to look at the impact at the cost benefits of atomic writes.

Figure 4 looks at a breakdown of the costs of the most reasonable configurations. The detailed cost assumptions are shown in Table 1 in Footnote 1.

Implications for IO Infrastructure

This finding illustrates what Wikibon has been indicating for some time, that hard disk drives suck the life out of system performance, and are a major constraint to performance improvements and cost reductions. Indeed disk drive performance degrades annually as more and more data is stored under the actuator, a trend not offset by minimal improvements in seek times and spin speeds. Very high IO systems can be designed using flash-only approaches for active data at a dramatically lower cost than previously possible. This leads to an IO-centric design for both applications and infrastructure.

Figure 5 – Logical Model of the five layers in an IO Centric Infrastructure  Source: Wikibon 2012

Figure 5 describes the Wikibon IO centric model. There are five layers in this model, three physical and two management:

  1. Local working flash layer
    • This layer places flash very close to clusters of processors and uses locking atomic writes. This enables all the operational and event IO from multiple sources to be processed. Because of the removal of the constraints of writing out to hard disks, the metadata and indexes can be created for use by security, compliance, real-time analytics, and archiving. This obviates the need to separate operational and data warehouse systems.
    • A reference model for equipment in layer 1 could be HP Proliant servers and Fusion-io cards with atomic writes using the Auto Commit Memory (ACM) API.
    • End-to-end security models would be managed at this level. A reference model could the the Stealth Technology from Unisys.
  2. Active Data Management
    • This layer manages the integrity and movement of data between the local working flash layer (1) and the distributed, shared flash layer (3). Short-term backup and recovery would be controlled in this layer.
    • A reference model for equipment could be the later stages of Project Lightening from EMC with a plan to integrate tiering and cache coherency between the servers and storage arrays.
  3. Distributed Shared Flash Layer
    • This layer provides lower latency access but is shared both locally and remotely by server clusters. Most metadata and indexes would be held in this layer, as well as other active transactional, event, and analytic data.
    • Compression and de-duplication of data at this level would lower costs and facilitate movement of data over distance.
    • A reference architecture could be a combination of flash-only arrays from vendors such as SolidFire that include flash IO tiering and management and federated arrays from HP 3PAR, EMC and NetApp.
  4. Archive Management layer
    • This layer would manage the interface between the metadata and indexes and the archive disk layer and minimize the access and load times for data on disk.
    • Frameworks for this layer that could be considered would be object systems such as WOS from DDN and Cleversafe’s dispersed storage technology. These would require integration with cross-industry (e.g., email archiving) and industry specific archive software.
  5. Distributed Archive/Long-Term Backup Layer
    • This layer provides the lowest cost storage (on hard disk drives or tape for the foreseeable future), where the detailed data is geographically distributed under the management of the metadata/indexes in layer 3 and the layer 4 management layer.
    • Reference models could be DDN S2A9900 architecture for sustained sequential IO and distributed cloud systems such as Nirvanix.

The thickness of the arrows in Figure 6 indicate the amount of data being transferred, and the arrow, the direction. Note that major flow is down through the layers. In practice there would be minimal transfer upwards from the lowest layer, as the IO and transfer costs are so high from hard disk drives.

Implications for System & Application Design

The results of the change in capability and cost of IO infrastructure is that the design of systems will also change dramatically:

  1. Current Operational Systems could be consolidated or separate databases independent linked to provide a single source of “truth” of the state of a large and small business and government organizations.
  2. Big Data Transaction and Event Systems can leverage the vast amount of information coming from people, systems, and M2M, ingest it and use the information to manage business operations, government operations, battlefields, physical distribution systems, transport systems, power distribution systems, agricultural systems, weather systems, etc., etc., etc.
  3. Big Data Analytical Systems allow the extraction of metadata, index data and summary data directly from the input stream of operational big data. This in turn allows the development of much smarter analytical systems designed as an extension of the operational system in close to real time instead of a much delayed extract of available operational data.
  4. Archive Systems would have the same ability to exploit the extracted metadata and indexes in real time. This could at last allow a complete separation of backup and archiving, improve the functionality of archive systems, and reduce the cost of implementing them. The ability to hold archive metadata and indexes in the active storage layers will allow much richer ability to mine data, and at the same time allow more effective access to detailed data records and the deletion of old data.

Conclusions

Systems design has been constrained for decades by the low speed of disk drive technology. The impact of this constraint has been increasing as server technologies improve while mechanical disk rotation speed remains constant. The cost per gigabyte has improved enormously; the cost per IO has changed very slowly.

The technology of atomic writes direct to flash removes these constraints. It allows the potential for unification of big transactional data and big-data analytics in the same system and will open up completely new ways of designing systems. Wikibon has written about theenormous potential for improving business and societal efficiency, and providing new services to and from consumers, businesses and governments.

Over the next decade, Wikibon believes that the overwhelming returns on new IO-centric applications that combine operational and analytic system will drive infrastructure change from storage-centric to IO-centric.

Action Item: CIOs should lead the business discussion on the potential of these system to radically improve the efficiency and reactiveness of their organizations. If their systems designers are not drooling over the potential of IO Centric, they should be assigned maintenance tasks. This is a once-in-a-lifetime opportunity to disrupt the current status in most business segments, and the spoils will go to those who understand well and invest early and often.

Links

Link to interview between John Furrier and David Floyer at the Node Summit 1/28/2012

Footnotes

The footnotes provide more detail on the configuration details and cost assumptions of the comparisons, as well as more technical detail about VSL.

Detailed Assumptions Behind Configuration Costs

Footnote 1

Table 1 below shows the detailed assumptions and costs underlying the previous figures.

Table 1 – Detailed Assumptions and Costs for Four Alternative 1 Billion IOPS Write Configurations.   Source: Wikibon 2012.

Deep Dive: Comparison With Other Billion IOPS Configurations

Footnote 2

This section gives greater detail about the configurations shown in Figure 1 and Figure 4 and is optional reading. Wikibon analyzed the benchmark and created three other “though experiment” configurations with the same throughput to look at the impact at the cost benefits of atomic writes. Figure 6 shows the four detailed configurations and the constraint on performance. One configuration is proven capable and the other three are theoretically capable of reaching one billion IOPS.

Figure 6 – Details of Four Alternative 1 Billion IOPS Write Configurations.
Source: Wikibon 2012
  • The non-locking atomic write configuration is based on eight HP DL370 servers taking 2U of rack space, each with eight Fusion-io ioMemory2 Duo cards with 2.4TB of MLC flash. The total space is 16U of rack space (in the demonstration more space was left between the servers),and the total power requirement is 15kW. The system is well balanced, with the ultimate constraint being the flash cards, which topped out at 15,625,000 IOPS per card.
  • The traditional write to PCIe flash uses the same type of configuration that was used in earliest one million IOPS demonstrations.
    • 1 million IOPS achieved by IBM and Fusion-io with the quicksilver project in 2008.
    • In 2009, HP reduced the server configuration required down to four quad-core AMD Opteron processors with five 320GB Fusion-io ioDrive Duos and six 160GB ioDrives.
    • At HP Discover in November 2011, HP achieved 1.55 million random 4K IOPS within a single HP ProLiant DL580 G7 server and ten ioDrives.
    • The “thought” configuration to achieve one billion 64 byte IOPS requires 250 HP ProLiant DL580 G7 servers configured with 4 x Ten-Core Intel® Xeon® E7-4870 2.40GHz 6.4GT/s QPI 30MB Cache (130W), 256 GB of 1333MHz ECC/REG DDR-III Memory, and 2,000 0.4TB ioDrive2 flash cards from Fusion-io. The performance constraint is the server. This delivers an IOPS rate of about 4 million/server. The rack space/server is 4U.
  • The third 1 billion configuration assumes that the data is written to disk. To be able to balance the IOs, flash SSDs are used to block the 64byte writes into 1 MB chunks. For each server, five Anobit Genesis SSD drives were assumed with 200GB signal processing MLC. The form factor is 2.5” 6Gb/s SAS 2.0 drives with 510MB/s read and write speed and 50,000 P/E cycles. The output from the four drives is written to a single 2.5” Seagate® Savvio® 15K.3 6-Gb/s 146GB Hard Drive at a rate of 40 IOPS and 40 megabytes/second. The server assumed is a HP DL370 with 2 x Six-Core Intel® Xeon® X5660 2.80GHz 6.4GT/s QPI 12MB L3 Cache (95W), 72GB 1333MHz ECC/REG DDR-III Memory, in 2U of rack space. 5,000 servers are required.
  • The fourth “thought” configuration assumes no flash or battery protected RAM, just disk drives: DAS/JBOD gone wild. It is assumed that with sequential write 300 IOPS could be achieved with a Hitachi 2.5″ drive. 5,000 servers are required to meet the server throughput (as in configuration 3), and 667 drives are required for each server to handle the IO write rate. In total 5,000 servers and 3.3 million disk drives are required, taking 8,588 racks to house the configuration and 90,875 kW of power.

Deep Dive: Fusion-Io VSL And Auto Commit Memory API

The VSL memory architecture provides multiple 2TB address space chunks. The first 2TB is used for system files. The remaining 2TB allocation chunks are for user data or directory files. A large file takes the whole chunk, while multiple small files share a single chunk. There can be up to 273 2TB chunks.

VSL provides a set of functions that allow the flash storage to act as a memory subsystem. The key addition is that these functions allow the higher levels of the storage hierarchy to exploit the persistent nature of flash. The architecture pushes down the sector, buffer, and log management of flash to the flash controller, which takes responsibility for guaranteeing any write action or allocation request, and maintaining recoverability. The VSL layer is responsible for block erasures, reliability, and wear-leveling. In the event of a crash, the VSL driver can reconstruct its metadata from the flash device.

With the addition of the Auto Commit Memory (ACM) API, applications can declare a region of a VSL 2TB virtual address space as persistent. The application can then program to this region using regular memory semantics, such as pointer dereferences, and access the memory directly via CPU load and store operations. These operations bypass the normal IO stack components of the operating system, and as a result latency is significantly reduced.

In the event of a failure or restart, VSL and ACM work in cooperation with the existing platform hardware and CPU memory hierarchy, to apply the memory updates and ensure data persistence and integrity.

One obvious initial use of ACM is for acceleration of database and file-system logs, which will allow almost instantaneous releasing of locks held by a transaction. Fusion-io has indicated it is working with system and application developers to apply this technology to other opportunities in operating systems, hypervisors, and databases, and directly in applications.


FLASH UPDATE: Wikibon Repost

January 4, 2012

Wikibon has just published an updated Flash Report — comprehensive — the usual great work we get from Dave Vellante et al.

Read more here:  WIKIBON FLASH UPDATE


Wikibon Repost: Is Virident Poised for Cloud?

December 15, 2011

Turning IT Costs into Profits: New Storage Requirements Emerge for NextGen Cloud Apps

Last Update: Dec 14, 2011 | 06:39
Originating Author: David Vellante

One of the more compelling trends occurring in the cloud is the emergence of new workloads. Specifically, many early cloud customers focused on moving data to the cloud (e.g. archive or backup) whereas in the next twelve months we’re increasingly going to see an emphasis on moving applications to the cloud; and many will be business/mission critical (e.g. SAP, Oracle).

These emergent workloads will naturally have different storage requirements and characteristics. For example, think about applications like Dropbox. The main storage characteristic is cheap and deep. Even Facebook, which has much more complex storage needs and heavily leverages flash (e.g. from Fusion-io), is a single application serving many tenants. In the case of Facebook (or say Salesforce), it has control over the app, end-to-end visibility and can tune the behavior of the application to a great degree.

In 2012, cloud service providers will begin to deploy multitenant/multi-app environments enabled by a new type of infrastructure. The platform will not only provide block-based storage but it will deliver guaranteed quality of service (QoS) that can be pinned to applications. Moreover, the concept of performance virtualization – where a pool of performance (not capacity) is shared across applications by multiple tenants – will begin to see uptake. The underpinning of this architecture will be all-flash arrays.

This means that users can expect a broader set of workloads types to be run in the cloud. The trend is particularly interesting for smaller and mid-sized customers that want to run Oracle, SAP and other mission critical apps in the cloud as a way to increase flexibility, reduce CAPEX and share risk, particularly security risk.

The question is who will deliver these platforms? Will it be traditional SAN suppliers such as EMC, IBM, HP, HDS and NetApp or emerging specialists like SolidFire, Nimbus and Virident? The bet is that while the existing SAN whales will get their fair share of business, especially in the private cloud space, in the growing cloud service provider market, a new breed of infrastructure player will emerge with the vision, management, QoS and performance ethos to capture marketshare and grow with the hottest CSP players.

For data center buyers, where IT is a cost center, the safe bet may be traditional SAN vendors. For cloud service providers, where IT is a profit center, the ability to monetize services levels by providing QoS and performance guarantees on an application by application basis will emerge as a differentiator.

Action Item: All-flash arrays will emerge as an enabler for new applications in the cloud. The key ingredients will be not just the flash capability, but more importantly management functions that enable controlling performance at the application level. Those shops that view IT as a profit center (e.g. cloud service providers) should aggressively pursue this capability and begin to develop new value-based pricing models tied to mission critical applications.


Wikibon Report Card — How are we doing?

February 16, 2010

From Novmember 10th 2009:

Wikibon’s Take

We see five main priorities facing Xiotech, including:

1) Channel. Xiotech has flip-flopped on the channel several times and now that the channel is performing well for the company, Xiotech needs to provide consistent support.

2) Enterprise. Xiotech needs to grow beyond SMB and secure a number of large reference accounts for its products. Atkinson will use his connections in the customer base to open doors but Xiotech needs to demonstrate it can successfully service large enterprises.

3) OEM. Xiotech needs to secure some OEM’s. Its product is well-suited for the OEM market because it’s rock solid and performs well. An OEM strategy will take time to develop but could pay dividends in 2012 and beyond.

4) Focus. Xiotech was one of the first companies to market with virtual disks, virtual ports and virtual controller groups but it never really caught the ‘Tier 1.5’ wave as did players such as Compellent, 3PAR and Lefthand. Xiotech needs to carve out a niche of its own by articulating the value of its controller-less vision and demonstrating superior cost, performance, ease-of-use and reliability to customers.

5) Buzz. Xiotech still needs to do a better job marketing its product and developing a new brand to change its image. Atkinson is capable of doing this because of who he is, who he can attract and the way he’ll market Xiotech’s experts.

How’re we doing so far?


Xiotech CEO Sends 2010 Signals

January 10, 2010

Full Disclosure:  I started with XIOTECH this week here in the Bay Area.  As if on cue, WIKIBON publishes the following:

I received an email blast yesterday from Alan Atkinson, the new CEO of Xiotech. It was a teaser to keep the company on people’s radar screens. I wrote about Xiotech back in November after I met with Alan. As I said at the time, Atkinson would both shake up Xiotech and get more marketing value out of its big name people assets. He’s doing just that.

In his note he pointed out that Xiotech has made several additions to its management team, including Jim McDonald as Chief Strategy Officer and Brian Reagan as SVP of Marketing and Business Development. Atkinson also announced that Richie Lary has been appointed corporate Fellow. McDonald worked with Atkinson at Goldman Sachs and then WysDM, which was later sold to EMC where he did a stint after the acquisition. Reagan is former EMC, IBM (via Arsenal Digital) and Richie Lary is a scary smart dude who looks like Cosmo Kramer and is generally accepted as the Father of Digital’s Storageworks. Atkinson also pointed to Steve Sicola, Rob Peglar and Mark Glasgow to underscore the depth of Xiotech’s bench.

I went digging to find out a bit more about what’s going on at Xiotech. First, Atkinson mentioned a good second half of 2009 but he didn’t give any specifics. Indications are that second half revenues were up nearly 40% year on year. I’m sure the comparison is against a miserable 2nd half of 2008 but that’s still good progress which is largely attributed to the growth of Xiotech’s ISE architecutre. Some of the growth supposedly came from Wall Street where both Atkinson and McDonald know 99% of the IT guys from their tenure at Goldman. I couldn’t confirm Goldman as an account yet but it’s an obvious bet that Atkinson and McDonald will get Xiotech in the door.

One of the things Atkinson said when we met last fall is that he wanted to get his OEM business going. I mean it’s a no-brainer strategy, right? You have some good IP with an architecture that is cost effective and since 2008 has had like four drive failures in the field over a sample of more than 30,000 disk drives. According to the Carnegie Mellon study you’d expect two orders of magnitude more failures annually than Xiotech customers are experiencing. Some OEMs will be interested in these types of results because it will keep their costs down. But the OEM business will take time to build and I don’t expect tons of noise this year on that front. Xiotech has been talking about being qualified behind IBM’s SAN Volume Controller but that’s only a baby step; although it’s another avenue for Xiotech’s reseller channel.

Another point Atkinson made in his letter was “We also will position ISE as the storage answer to dilemmas in virtualization, cloud and other performance-starved applications.” I’m not entirely clear on what this means but about a month ago, Jim McDonald wrote a blog that touched on the role of the hypervisor and how increasingly complex array-based function is moving back toward the host. Interesting perspective. I picked up on this trend back in November and used it for a blog on flash threats and opportunities. A key here will be how well Xiotech is able to integrate with the vStorage APIs announced by VMware in 2008. So far very few vendors have stepped up to this task. It’s a big resource commitment that further underscores the importance of OEMs that can provide resources to get integration work done.

For too long Xiotech’s been a ship at sea without a clear navigation plan. I have no doubt Atkinson is bringing focus and I’m interested in hearing more about the company’s vision, messaging and strategy later on this year.

Stay tuned.

reposted verbatim from wikibon