SSDs choked by crummy disk interfaces: NVMe and SCSI Express Explained

December 13, 2011

This is the complete repost of Chris Mellior’s terrific article from last week:

Gotta be PCIe and not SAS or SATA

By Chris Mellor • Get more from this author

Posted in Storage7th December 2011 15:43 GMT

Free whitepaper – VMready

A flash device that can put out 100,000 IOPS shouldn’t be crippled by a disk interface geared to dealing with the 200 or so IOPS delivered by individual slow hard disk drives.

Disk drives suffer from the wait before the read head is positioned over the target track; 11msecs for a random read and 13msecs for a random write on Seagate’s 750GB Momentus. Solid state drives (SSDS) do not suffer from the lag, and PCIe flash cards from vendors such as Fusion-io have showed how fast NAND storage can be when directly connected to servers, meaning 350,000 and more IOPS from its ioDrive 2 products.

Generation 3 PCIe delivers 1GB/sec per lane, with a 4-lane (x4) gen 3 PCIe interface shipping 4GB/sec.

You cannot hook an SSD directly to such a PCIe bus with any standard interface.

You can hook up virtually any disk drive to an external USB interface or an internal SAS otr ATA one and the host computer’s O/S will have standard drivers that can deal with it. Ditto for an SSD using these interfaces, but the SSD is sluggardly. To operate at full speed and so deliver data fast and help keep a multi-core CPU busy, it needs an interface to a server’s PCIe bus that is direct and not mediated through a disk drive gateway.

What could go wrong with this rosy outlook? Plenty; this is IT. There is, of course, a competing standards initiative called SCSI Express.

If you could hook an SSD directly to the PCIe bus you could dispense with an intervening HBA that requires power, and slows down the SSD through a few microseconds added latency and a hard disk drive-connectivity based design.

There are two efforts to produce standards for this interface: the NVMe and the SCSI Express initiatives.


NVMe, standing for Non-Volatile Memory express, is a standard-based initiative by some 80 companies to develop a common interface. An NVMHCI (Non-Volatile Memory Host Controller Interface) work group is directed by a multi-member Promoter Group of companies – formed in June 2011 – which includes Cisco, Dell, EMC, IDT, Intel, NetApp, and Oracle. Permanent seats in this group are held by these seven vendors, with six other seats held by elected representatives from amongst the other work group member companies.

It appears that HP is not an NVMe member, and most if not all NVMe supporters are not SCSI Express supporters.

The work group released a v1.0 specification in March this years, and details can be obtained at the NVM Express website.

A white paper on that site says:

The standard includes the register programming interface, command set, and feature set definition. This enables standard drivers to be written for each OS and enables interoperability between implementations that shortens OEM qualification cycles. …The interface provides an optimised command issue and completion path. It includes support for parallel operation by supporting up to 64K command queues within an I/O Queue. Additionally, support has been added for many Enterprise capabilities like end-to-end data protection (compatible with T10 DIF and DIX standards), enhanced error reporting, and virtualisation.

The standard has recommendations for client and enterprise systems, which is useful as it means it will embrace the spectrum from notebook to enterprise server. The specification can support up to 64,000 I/O queues with up to 64,000 commands per queue. It’s multi-core CPU in scope and each processor core can implement its own queue. There will also be a means of supporting legacy interfaces, meaning SAS and SATA, somehow.

blog on the NVMe website discusses how the ideal is to have a SSD with a flash controller chip, a system-on-chip (SoC) that includes the NVMe functionality.

What looks likely to happen is that, with comparatively broad support across the industry, SoC suppliers will deliver NVMe SoCS, O/S suppliers will deliver drivers for NVMe-compliant SSDs devices, and then server, desktop and notebook suppliers will deliver systems with NVMe-connected flash storage, possibly in 2013.

What could go wrong with this rosy outlook?

Plenty; this is IT. There is, of course, a competing standards initiative called SCSI Express.

SCSI Express

SCSI Express uses the SCSI protocol to have SCSI targets and initiators talk to each other across a PCIe connection; very roughly it’s NVMe with added SCSI. HP is a visible supporter of it, with there being SCSI Express booth at its HP Discover event in Vienna, and support at the event from Fusion-io.

Fusion said its “preview demonstration showcases ioMemory connected with a 2U HP ProLiant DL380 G7 server via SCSI Express … [It] uses the same ioMemory and VSL technology as the recently announced Fusion ioDrive2 products, demonstrating the possibility of extending Fusion’s Virtual Storage Layer (VSL) software capabilities to a new form factor to enable accelerated application performance and enterprise-class reliability.”

The SCSI Express standard “includes a SCSI Command set optimised for solid-state technologies … [and] delivers enterprise attributes and reliability with a Universal Drive Connector that offers utmost flexibility and device interoperability, including SAS, SATA and SCSI Express. The Universal Drive Connector also preserves legacy investments and enables support for emerging storage memory devices.”

An SNIA document states:

Currently ongoing in the T10 ( committee is the development of SCSI over PCIe (SOP), an effort to standardise the SCSI protocol across a PCIe physical interface. SOP will support two queuing interfaces – NVMe and PQI (PCIe Queuing Interface).

PQI is said to be fast and lightweight. There are proprietary SCSI-over-PCIe products available from PMC, LSI, Marvell and HP but SCSI Express is said to be, like PQI, open.

The support of the NVMe queuing interface suggests that SCSI EXpress and NVMe might be able to come together, which would be a good thing and prevent the industry working on separate SSD PCIe-interfacing SoCs and operating system drivers.

Of course this imagining could be just us blowing smoke up our own ass.

There is no SCSI Express website but HP Discover in Vienna last month revealed a fair amount about SCSI express, which is described in a Nigel Poulton blog.

He says that a 2.5-inch SSD will slot into a 2.5-inch bay on the front of a server, for example, and that “[t]he [solid state] drive will mate with a specially designed, but industry standard, interface that will talk a specially designed, but again industry standard, protocol (the protocol enhances the SCSI command set for SSD) with standard drivers that will ship with future versions of major Operating Systems like Windows, Linux and ESXi”.

HP SCSI Express cardHP SCSI Express card from HP Discover at Vienna

Fusion-io 2.5-inch, SCSI Express-supporting SSDs plugged into the top two ports in the card pictured above. Poulton says these ports are SFF 8639 ones. The other six ports appear to be SAS ports.

A podcast on HP social media guy Calvin Zito’s blog has two HP staffers at Vienna talking about SCSI Express.

SCSI Express productisation

SCSI Express productisation, according to HP, should occur around the end of 2012. We are encouraged (listen to podcast above) to think of HP servers with flash DAS formed from SCSI Express-connected SSDs, but also storage arrays, such as HP’s P4000, being built from ProLiant servers with SCSI Express-connected SSDs inside them.

This seems odd as the P4000 is an iSCSI shared SAN array, and why would you want to get data at PCIe speeds from the SSDs inside to its X86 controller/server, and then ship them across a slow iSCSI link to other servers running the apps that need the data?

It only makes sense to me if the P4000 is running the apps needing the data as well, if the P4000 and app-running servers are collapsed or converged into a single (servers + P4000) system. Imagine HP’s P10000 (3PAR) and X9000 (Ibrix) arrays doing the same thing: its Converged Infrastructure ideas seem quite exciting in terms of getting apps to run faster. Of course this imagining could be just us blowing smoke up our own ass.

El Reg’s takeaway from all this is that NVMe is almost a certainty because of the weight and breadth of its backing across the industry. We think it highly likely that HP will productise SCSI Express, with support from Fusion-io and that, unless there is a SCSI Express/NVMe convergence effort, we’re quite likely to face a brief period of interface wars before one or the other becomes dominant.

Concerning SCSI Express and NVMe differences, EMC engineer Amnon Izhar said: “On the physical layer both will be the same. NVMe and [SCSI Express] will be different transport/driver implementations,” implying that convergence could well happen, given sufficient will.

Our gut feeling is that PCIe interface convergence is unlikely, as HP is quite capable of going its own way; witness the FATA disks of recent years and also its individual and admirably obdurate flag-waving over Itanium. ®

Nimbus Data Reveals S-Class at Flash Memory Summit

August 26, 2010

Next-generation Flash Storage System

   •  Enterprise Flash Storage System
   •  Up to 1.65 M IOps and 72 Gbps throughput
   •  Up to 6,000 IOps per watt and 675,000 IOps per floor tile
   •  Modular design from 24 to 600 redundant flash blades
   •  From 2.5 TB – 250 TB of solid state storage capacity
   •  4 – 12 auto-negotiating 10 GbE / GbE network ports
   •  Features Nimbus HALO storage operating system
   •  Full solutions starting under $25,000 (USA list price)

Xiotech Debuts ISE NAS

April 14, 2010

Today Xiotech launched a turn-key, scale-out, network-attached storage (NAS) solution pairing our patented Intelligent Storage Element (ISE™) blades with Symantec’s (Nasdaq: SYMC) FileStore product, a file serving platform that delivers industry-leading performance and scale to enterprise storage environments. The joint solution allows cloud service providers, high-performance computing environments and data storage users to achieve cost effective and comprehensive file-based data management and protection with maximum performance and scalability.

To read more, please visit:

Headed to SNW? Critical meetings you MUST attend

April 7, 2010

Monday, April 12, 2010

9:20 am – 10:05 am:  Executive Overview and Current Topics on Solid State Storage

Rob Peglar, Senior Fellow, Xiotech

This tutorial provides introductory material and discussion of solid state storage. A comprehensive overview of the technology, from components to devices to systems, is provided along with an overview of several current topics surrounding the integration, deployment, use and application of solid state storage. The material is intended for those who are not familiar with solid state storage in the enterprise and wish to develop a working understanding of the technology and its usage.

11:10 am – 11:55 am:  A New Storage Model for the Cloud

Richard Lary, Corporate Fellow, Xiotech Corporation

Despite an explosion of data being stored, the technologies to manage all that data have changed little since the advent of the Storage Area Network (SAN) a decade ago or even since the advent of RAID well before that. Cloud computing promises to make it easier to get more value out of all that data, but only if we adopt a very different storage architecture. Today’s SANs are complex. They require significant attention. And, they are not elastic enough for the decapitalized business model of cloud computing. This presentation will discuss why it’s time for the industry to rethink storage by deconstructing the storage array. By adapting a more flexible, element-based approach to storage, organizations eliminate large periodic capital expenses, benefit from self-management of basic storage functions, and gain the flexibility to use the storage management functionality their unique needs require.

Storage Virtualization I & II – Implementing and Managing Storage Virtualization

1:00 pm – 1:45 pm and 1:55 pm – 2:40 pm

Rob Peglar, Senior Fellow, Xiotech

Storage Virtualization is one of the buzzwords in the industry, especially with the near-ubiquitous deployment of Storage Networks. But besides all hype, there is a lot of confusion, too. Companies are using the term virtualization and its characteristics in various and different forms. This tutorial describes the reasons and benefits of virtualization in a technical and neutral way. The audience will understand the various terms and will receive a clear picture of the different virtualization approaches. Links to the SNIA Shared Storage Model and the usage of the new SNIA Storage Virtualization Taxonomy will help to achieve this goal. This tutorial is intended for IT Managers, Storage and System Administrators who have responsibilities for IT infrastructures and storage management tasks. Learning Objectives • Understand the definition of storage virtualization and its taxonomy • Learn about the three categories/methods of storage virtualization and their architectures • Understand which storage virtualization techniques apply to various new and existing infrastructures and potential benefits to storage management.

Wednesday, April 14, 2010

4:05 pm – 4:30 pm Executive Interview — Exploding Data and Imploding Budgets: Can Technology Really Shrink this Elephant in the Room?

John Gallant, Senior Vice President and Chief Content Officer, IDG Enterprise
Jim McDonald, Chief Strategy Officer, Xiotech
Garry Olah, Vice President, Business Development, Citrix Systems

Consider Element-based Storage to Support Application-centric Strategies

March 29, 2010

What is Element-based storage?

Element-based storage is a new concept in data storage that packages caching controllers, self-healing packs of disk drives, intelligent power/cooling, and non-volatile protection into a single unit to create a building-block foundation for scaling storage capacity and performance. By encapsulating key technology elements into a functional ‘storage blade’, storage capability – both performance and capacity – can scale linearly with application needs. This building-block approach removes the complexity of frame-based SAN management and works in concert with application-specific function that resides in host servers (OSes, hypervisors and applications themselves).

How are Storage Elements Managed?

Storage elements are managed by interfacing with applications running on host servers (on top of either OSes or hypervisors) and working in conjunction with application function, via either direct application control or Web Services/REST communication. For example, running a virtual desktop environment with VMware or Citrix, or a highly-available database environment with Oracle’s ASM or performing database-level replication and recovery with Microsoft SQL Server 2008 – the hosts OSes, hypervisors, and applications control their own storage through embedded volume management and data movement. The application can directly communicate with the storage element via REST, which is the open standard technique called out in the SNIA Cloud Data Management Interface (CDMI) specification. CDMI forms the basis for cloud storage provisioning and cloud data movement/access going forward.

The main benefits of the element-based approach are:

  • Significantly better performance – more transactions per unit time, faster database updates, more simultaneous virtual servers or desktops per physical server.
  • Significantly improved reliability – self-healing, intelligent elements.
  • Simplified infrastructure – use storage blades like DAS.
  • Lower costs – significantly reduced opex, especially maintenance and service.
  • Reduced business risk – avoiding storage vendor lock-in by using heterogeneous application/hypervisor/OS functions instead of array-specific functions.

Action Item: Organizations are looking to simplify infrastructure, and an application-centric strategy is one approach that has merit. Practitioners should consider introducing storage elements as a means to support application-oriented storage strategies and re-architecting infrastructure for the next decade.

Rob Peglar is VP of Technology at Xiotech and a Xiotech Senior Fellow.  A 32-year industry veteran and published author, he leads the shaping of strategic vision, emerging technologies, defining future offering portfolios including business and technology requirements, product planning and industry/customer liaison. He is the Treasurer of the SNIA, serves as Chair of the SNIA Tutorials, as a Board member of the Green Storage Initiative and the Solid State Storage Initiative, and as Secretary/Treasurer of the Blade Systems Alliance.  He has extensive experience in storage virtualization, the architecture of large heterogeneous SANs, replication and archiving strategy, disaster avoidance and compliance, information risk management, distributed cluster storage architectures and is a sought-after speaker and panelist at leading storage and networking-related seminars and conferences worldwide.  He was one of 30 senior executives worldwide selected for the Network Products 2008 MVP Award.    Prior to joining Xiotech in August 2000, Mr. Peglar held key technology specialist and engineering management positions over a ten-year period at StorageTek and at their networking subsidiary, Network Systems Corporation. Prior to StorageTek, he held engineering development and product management positions at Control Data Corporation and its supercomputer division, ETA Systems.     Mr. Peglar holds the B.S. degree in Computer Science from Washington University, St. Louis Missouri, and performed graduate work at Washington University’s Sever Institute of Engineering.  His research background includes I/O performance analysis, queuing theory, parallel systems architecture and OS design, storage networking protocols, clustering algorithms and virtual systems optimization.

repost from WIKIBON: 

Data Storage TCO Analysis

December 26, 2009

There are many ways to parse the cost of the next Terabyte of storage for your enterprise — here is one way to start understanding storage TCO — you can use this simple calculator as the basis for other financial metrics like ROI and IRR.  Here are the high-level data points:

  • HARDWARE ACQUISITION: Hardware purchase costs are often overemphasized as they are the easiest to identify and calculate.
  • MAINTENANCE: Hardware and software maintenance costs are an integral part of TCO. Manufacturers may use maintenance costs to influence future buying behavior through premature obsolescence.
  • ADMINISTRATION & OPERATIONS: Storage administration and business operations are important considerations for overworked staff and underfunded IT budgets. How a system is administered and its impact to the organization, whether positive or negative, last the life of the system.
  • RESOURCE USAGE: Power, cooling, and floor space are increasing in importance for organizations, especially those located on either coast or with limited data center space.
  • DOWNTIME EVENTS: The cost of downtime is often glossed over until it occurs and is visible to the board of directors or makes headline news.
  • Xiotech Sample TCO Analysis -- Detail Page


    Self Healing Storage Explained (again)

    December 25, 2009

    (Repost from Tech Target)

    Xiotech Corp. Emprise 5000 and 7000 systems

    Xiotech’s Emprise systems are put together using a building block it calls the Intelligent Storage Element (ISE). ISE is based on Advanced Storage Architecture technology which Xiotech acquired from disk drive maker Seagate Technologies Inc. in November 2007. According to Xiotech, an ISE reduces the two greatest causes of drive failure — heat and vibration — to provide more than 100 times the reliability of a regular disk drive enclosed in a typical storage system drive bay.

    Xiotech’s product can power-cycle disks, perform diagnostics and error correction on bad drive sectors, and write ‘around’ them if necessary. Xiotech also claims its product will incur zero service events in five years of operation, and guarantees this under warranty.

    Xiotech’s system comes in three models. The dual-controller Emprise 7000 SAN system supports up to 64 ISEs and includes the same management features as the Xiotech Magnitude 3D 4000 platform, including intelligent provisioning and a replication suite. Like the Magnitude 3D, the Emprise 7000 supports Fibre Channel or iSCSI. It scales to 1 PB.

    The single-controller Emprise 7000 Edge is positioned targeting branch offices and the midmarket. It supports up to 10 ISE for a total maximum capacity of 160 TB. The Emprise 5000 is a DAS system that consists of one ISE. It supports Fibre Channel only. Both the 7000 Edge and the 5000 can be upgraded to a Model 7000.

    What users are saying about self-healing storage

    Since Xiotech introduced ISE at Storage Networking World in the spring of 2008, the new technology has rapidly come to account for more than 80 percent of the company’s revenue. Scott Ladewig, manager of networking and operations for Washington University in St. Louis, traded in an older Xiotech model, the Magnitude 3000, for an Emprise 7000 last summer.

    “In the past, it’s not like we’ve spent hundreds of man-hours on drives, our whole SAN is 24 TB or so,” he said. “But if a drive failed, we’d have to spend a Saturday night watching it rebuild, and drives are growing larger and larger, and taking longer periods of time to rebuild, when they’re vulnerable to a double disk failure.”

    The five-year warranty offered by Xiotech included in the cost of the ISE system proved irresistible to Rick Young, network systems manager at Texas A&M College of Veterinary Medicine, who also replaced a Magnitude 3D 3000 with an Emprise 5000. “Right now we spend around $11,000 a year to maintain disk trays on the 3D,” he said. “Multiply that by five years on the Emprise system, and it’s no small amount of savings. We can put what we would’ve spent on maintenance towards our next refresh.”

    For Richard Alcala, chief engineer of New Hat LLC, a post-production firm in Santa Monica, Calif. more typical scale-out storage products with clustered file systems proved too cumbersome to manage in a performance-intensive environment. “The highest priority for us is the number of real-time streams” the system can feed to artists working on videos. With the older system, “we spent a lot of time doing maintenance, trying to heal the system and recover data,” he said. “Once every three months we’d spend about four hours running diagnostics.” Alcala replaced that system with Data Direct Networks’ S2A 9900.

    Xiotech has been “going gangbusters” in the enterprise with ISE, according to Data Mobility Group’s Harris, but generally, Harris said he thinks the most advanced self-healing storage products will get the most traction in specialized vertical markets like media and entertainment and high-performance computing (HPC). “That said, how hard is it to power-cycle a hard drive?” he said. “That ought to be SOP for every disk array out there.”

    Xiotech ISE: Is this the product we’ve all been waiting for?

    December 23, 2009

    The Xiotech blog was on fire last week with updates from the new SVP Marketing Brian Reagan as well as a lengthy discussion of SSD as a component of ISE (Intelligent Storage Element) by Rob Peglar.  More here:  XIOTECH BLOG

    Why PCIe-based SSDs Are Important

    November 20, 2009

    There’s an old expression I like: “Different isn’t better, it’s just different.”

    When it comes to SSDs based around a SATA or SAS format — that’s pretty much the case in my view. Yes there are exceptional products suited for enterprise like Pliant and STEC. And, yes — there are more conventional items for consumers like Intel and OCZ (and about 20 others).  And yes, the standard pacakge 3.5″ form factor for these devices make them suitable for shared storage as well as for integration into hetreogenous and homogenous storage environments like you might find in a typical data center.  Embracing these SSDs you will find the usual manufacturers like EMC, NetApp, SUN, and others.  Their use of SSD is evolutionary, easy to digest.

    PCIe-based SSDs are very different.  For one thing, they sit on the server system bus right next to the CPU.  This is a direct attached (DAS) model that has numerous advantages for certain types of processing.  We agree that not all PCIe-based SSDs are suitable for all applications — but in terms of applications that can take advantage of bandwidth, throughput, and latency enhancements, these devices are indeed a superior architecture.

    There are some challenges:

    1)  Not all servers are created equal.  PCIe-based devices require strict adherance to the PCIe specifications at the server level.  Ping if you want to learn more about why this is critical.

    2)  Many servers do not have enough PCIe slots configure appropriately for PCIe devices.  This is especially true when creating HIGH AVAILABILITY (or HA) environments.

    3)  Only a very few servers have enough of the right type of slots to be meaningful from a value perspective.  It makes no sense to refresh a server for a PCIe-based SSD if you have to spend 2x or 3x to get the right slots, power, etc.

    4)  Applications may not be optimized for SSD DAS.  No kidding.  OLTP or DBMS applications that can take the most advantage of SSD DAS are optimized for high latency disk access over networks such as NAS.  These applications are totally comfortable sending out 1000s or 10s of 1000s of transaction requests to build up a queue depth for the CPUs.  The net result of this is that the CPUs appear very busy but in fact aren’t doing very much.  These limitations are known and well defined.  Over time, application vendors such as SUN, Oracle, and Microsoft will implement fixes to optimize PCIe-based storage.

    Aside from these items, there is a discussion regarding suitability of NAND flash devices in the data center as well as the MLC/SLC issue.  I’ll tackle those in another post.  In my veiw, MySpace and are leading the way — and there are many others who have not come forward publicly preferring to keep the ROI and GREEN advantages all to themselves.

    The latest announcements from Fusion-io, Texas Memory Systems, Micron and others point out these differences.  FULL DISCLOSURE:  I am a former employe of Fusion-io.