The Pendulum Swings Hard Towards Metal
Kubernetes and new storage engines, like Mayastor, can take advantage of modern metal
Although the computer industry is constantly changing, in storage, the changes often occur very slowly. This makes sense — the value proposition of storage is “store my data and give it back when I ask.” If the velocity of change is high, then “give it back” becomes risky. Generally, any failure to “give it back” can be catastrophic, hence the conservative approaches to technology change.
The pendulum in this technology change is external vs internal storage. This pendulum has swung hard towards internal storage when paired with cloud-native technologies. Here we’ll explore the fastest storage server on the planet and the software that makes it hum.
In the bad old days, when disks were small, techniques such as RAID were invented to create larger, virtual disks. For example, if an application needed 1TB but the disk technology only provided 200GB disks, then a simple RAID-0 stripe or concatenation enabled 5 disks to act as a 1TB virtual disk. Unfortunately, the 1TB disk was 1/5 as reliable as a 200GB disk, so in order to “give it back” redundancy such as RAID-5, RAID-6, erasure coding, and mirroring were added Also, 5 disks didn’t physically fit into many servers. So the pendulum swung towards the external storage market, where massive RAID arrays could have dozens or hundreds of disks configured as virtual disks.
To access the virtual disks, servers often used some sort of efficient block protocol with the serial SCSI protocol becoming dominant over fabrics made of fibre channel and Ethernet. On the low-end systems, SATA was a natural evolution up from the floppy disk, but never really made sense in distributed systems. These protocols and fabrics evolved together, with the fabric being consistently faster than the disks.
Meanwhile, the features of the virtual disk grew in complexity and number. Array-based replication, snapshots, and clones gained market acceptance and significantly improved the resiliency of “give it back” options. The RAID array evolved from being a custom hardware design to a clustered, general-purpose computing design. The magic features, decoupled from custom hardware, became a software story. The software-defined storage (SDS) industry was born.
A funny thing happened on the way to the cloud — the economic justification for a RAID array collided with the fundamental nature of distributed computing. It is simply not cost effective to have a RAID array allocated to every server, rack of servers, or row of servers. Even if the cost was somehow solved, the networking constraints are difficult to solve for hundreds or thousands of servers. As a resolution, cloud engineers developed distributed storage solutions with many servers contributing in parallel, rather than a small number of RAID arrays.
Resiliency is implemented by replication (aka mirroring) to another availability zone along with perhaps an erasure-coded (think RAID-5/6) store in each zone. By calling it infinite scalability and create a marketing campaign — everyone was impressed. To really make everyone happy, cost models were developed and customers trained that high performance and virtual disk size are inextricably related, and performance was throttled “for your own good and the good of your neighbors (please ignore our price-gouging model).”
But that didn’t really cause the pendulum swing away from RAID arrays. The storage services were still software-defined and who cared if it is running on a pair of controllers or spread across a dozen? In each case, the features were similar and the bottleneck was ultimately the fabric or network connecting them all.
Oops, I called S3 a storage array. Well, in many respects it can be considered a service provided by a number of storage arrays, and nobody really cares how many as long as “give it back” works. The big innovation is not architectural, it is ease of consumption and pricing — and that’s ok.
To swing the pendulum back toward internal storage requires core technological changes in the hardware and an easier way to access the storage for software.
Remember the 1TB solution that requiring 5 disks that could barely fit in a single server, leading to the evolution of the RAID array? Well it is now in the phone in your pocket, plus 50% more. Around 10 years ago we were touting the first “peta-rack” RAID array solutions with 1PB in a single rack. Today, we’re talking about PB per rack unit, not PB per rack. Also, SSD sizes are increasing at a rate of 2x every 2 years, so look for a 4TB phone around the time you want to upgrade your iPhone 12.
On the fabric side, Ethernet won. Get over it. The most important Ethernet fabrics for the next few years are 100GbE. But it is still slower than storage, by 6x or so. It takes about 2 SSDs to fill a 100GbE pipe.
Pendulum swings hard
The pendulum has now swung hard towards the metal. To take advantage of the size and performance of the storage, the workload moves to the same CPU directly connected to the storage. Yes, it is direct-attached storage for the win. But before you worry about the storage features you know and love, there is good news — those are delivered via software. More on that later, for now, lets get a perspective on server specs.
24 NVMe u.2 slots operating at PCIe generation 4 speeds (twice as fast as PCIe generation 3, the most common NVMe speed)
2 servers, each with an AMD EPYC 7002 series CPU, for a total of 256 PCIe generation 4 lanes, and up to 256 processors (as defined by what the OS reports as a processor)
bandwidth: measured > 125 GB/sec @64k sequential (around 11x faster than 100 GbE)
IOPS: measured > 23M IOPS @4k random (around 10x faster than 100 GbE)
2 rack units (RU) in size, or 360 TB/RU with 30 TB SSDs
This is a lightning fast server with a low parts count, delivering good resiliency and low cost. I compared it to an equivalent system at a major cloud vendor, where the price was $140k/year. Really? I’ll have a half dozen, thanks… not!
About those storage features…
The story does not end with the metal. To be a full pendulum swing, the features and benefits of the storage software stack is important. This server has exposed the storage software stack as the bottleneck.
To understand the magnitude of this problem, recall that most of the RAID software stacks were developed in the days of HDDs, really slow HDDs. To be fair, the CPUs were also slow back then, but Moore’s law took care of that. To create many of the really cool features for storage, such as compression, RAID, deduplication, snapshots, and clones, the CPU had plenty of time to compute while waiting for the HDD to rotate. Also, HDDs and some early generation SSDs have a single-queue model of storage. With NVMe, multicore processors, and many PCIe lanes all operating in parallel to really fast (and also parallel) SSDs, the budget of time for waiting on the HDD to rotate or the SCSI/SATA queue to process is gone. The immediate impact is that managing the data by the storage software is slower than actually getting the data to medium.
Nowhere is the speed of medium more important than in the design of caches. To be truly effective as a cache, the size of the cache must accommodate the working set size and the latency to cache should be an order of magnitude faster than the medium. In the early days of RAID arrays, it was effective to have RAM caches on the order of 1GB with latency of 4-10μs vs 10-20ms for HDDS. This left plenty of CPU time to design fancy caching algorithms to deal with multiple block sizes and implement various clever prefetch policies to increase the effective working set size. As we all know, software gets bigger over time and all of these fancy algorithms chew up CPU cycles and expose memory access issues (read: NUMA) in the systems architecture. Today low-latency SSDs are on the order of 10μs and the overhead costs of cache management can easily approach several μs. There just isn’t enough time to get very clever with cache management when the storage is very fast. Besides, many applications cache their important data anyway, so this feature of external storage is becoming less interesting.
Redundancy and resiliency is another area where fancy storage software is stressed. Calculated parity and checksums are very common features of modern storage stacks. Traditionally, these costs could also be hidden behind the slow HDDs. Now they are exposed and are relatively more expensive in the latency budget. For speed, mirroring is best. For space, some form of N+M parity (eg RAID-5 or erasure coding) is best, but slower.
Big challenges in the software stack go far beyond these architectural trade-offs. The good news: evolution in software development are also applicable. For example, at data rates > 1M IOPS the effects of locking (mutex, reader-writer, et.al.) begin to significantly impact data flows. To have a chance to approach 10M IOPS, the storage software stack must be lockless. Above 2M IOPS, memory accesses in NUMA machines also rears its ugly head. To mitigate this, global variables need to be changed to per-CPU variables.
Retrofitting an old storage software stack to address these issues can be a substantial undertaking (perhaps not worth the effort). A far better approach: take a fresh look at the storage software problem and build a new stack with modern tools, and an eye towards deploying on high speed, very parallel systems.
I’ve been working with MayaData on the open-source Mayastor storage engine. All of us recognize the issues of trying to run legacy storage software stacks on modern hardware. To tackle these issues, the Mayastor project team made some very wise decisions:
Container scheduling using the kubernetes container orchestrator and leveraging the large set of readily available cloud-native tools
Using the rust language and toolset for developing lockless, highly parallel, and memory-safe storage features
Poll-mode drivers reduce latency to NVMe disks and NICs
Poll-mode drivers use processors inefficiently: trading CPU cycles for better storage performance. We have 256 processors at our disposal, so allocating a few dozen to SSDs and a few dozen to NICs still leaves a few dozen for applications
Very few kernel-level dependencies, freeing Mayastor to evolve faster than the kernel
For redundancy, replication across nodes is implemented using NVMe over Fabrics (NVMe-oF), which matches very well with both NVMe drives and modern 100GbE NICs
Fully automated CI/CD pipeline and agile development process, ensuring code changes are broadly tested and converge to quality fast
These choices allow Mayastor to reach maturity quickly and reliably, especially considering the huge efforts and time required to develop storage software stacks in the past. Deployment and testing can be accomplished on modern infrastructure, including cloud infrastructure.
Will the pendulum swing back?
It is unclear when the pendulum will swing back towards external storage. The hardware technology has clearly shifted towards internal storage. With software-defined and cloud-native storage stacks being designed to match the capabilities of the modern hardware, the need for external storage is dramatically reduced. To some degree, this is also the story being told by the computational storage vendors: CPU cores are cheap, move them to the storage. Whether those cores run in a general-purpose, but very fast server, or a special-purpose SSD, is the only question. Anyone can write storage software in rust and exploit containerization in either environment. But one thing is completely sure: best performance is available on the metal without sacrificing cloud-native workflows or applications.
Kudos to the team at Viking Enterprise Solutions for building very fine storage servers and software. A warm shout out to the team at Mayadata working on container attached storage (CAS), Mayastor, cStor, and other open-source cloud-native projects.