CASL with Nimble Storage
November 25, 2013
I was fortunate enough to get to spend an hour with Dmitriy Sandler from Nimble Storage to see what all the fuss was about with their product and more specifically their Cache Accelerated Sequential Layout (CASL) File System.
Hardware Overview
Let’s cover some of the basics before we dive into CASL. The storage array comes fully loaded with all the bells and whistles, out of the box. All the software features are included with this iSCSI array and include items such as:
- App Aligned Block Sizes
- Inline Data Compression
- Replication
- Instant Snapshots
- Thin Provisioning
- Zero Copy Clones
- Non-Disruptive Upgrades
- Scale Out to Storage Cluster
And this list goes on ad on. WAN replication for instance is very efficient due to the inline compression that is done during the writes.
Additional Specs can be seen below.
I would like to point out that the Nimble Arrays use Active-Standby Controllers which seems like a bit of a waste of capacity, but is AWESOME if you have a failure. It was pointed out to me that many Active-Active Controller systems end up having a significant performance impact during a failure because the controllers are overloaded. This shouldn’t be the case with a Nimble Array.
File System
So the features are nothing to sneeze at on their own right, but what differentiates Nimble from other storage arrays? The Nimble philosophy is that Hybrid storage is the right way to handle 95% of storage workloads. All flash arrays are expensive and the wear involved on SSDs is a limitation. All Spinning Disk arrays typically just don’t have that performance uumph that companies want these days. So what is the best way to use both SSDs and Spinning Disks?
‘Enter CASL
CASL stands for Cache Accelerated Sequential Layout. The name says it all here. The file system is specifically written to make the most out of a hybrid design. Lets look at a typical write sequence first.
Writes are sent to the device in multiple block sizes depending on the application using the device. Nimble arrays don’t care what the block sizes are and will accept any block sizes thrown at it. However, each volume can be automatically tuned for the specific application’s block size to optimize both performance and capacity efficiency. The blocks enter the PCIe NVRAM device on the array and are immediately copied to the Standby Controller across a 10Gb Bus. Once both controllers have the write in NVRAM, the write is acknowledged making for some very snappy response times and low latency for application writes.
Now that the writes are in NVRAM they are individually compressed in memory before ever writing to any disks. The data is “serialized” into a 4.5MB stripe and is evenly laid out across the entire set of SAS disks. This sequential write is pretty quick due to the fact that the data is written sequentially which limits the seek times necessary on random writes. Whats really cool about this process is that the CASL algorithm looks at the origin of the disk writes and puts them next to each other on disk. During a read operation there is a high likelihood that these writes will be retrieved together as well which will help read performance.
Pretty neat huh? But wait a second, what about those SSD’s in the system? We skipped them during this process. Well, during the write to the SAS disks, the CASL algorithm looks to find “Cache Worthy” data and segments it into smaller stripes for the SSDs and writes a copy to them as well. A graphic of this process is found below.
Reads, are done first from the NVRAM which is a nice add!. Many times data that is just written is often read right away so having NVRAM able to be read from is a fast way to handle reads. NVRAM can’t store a lot of data however so reads immediately are done based off the data in the SSDs. On the first cache miss, data is copied up from the spinning disks to the SSDs again, along with a prefetch of surrounding relevant blocks to accelerate subsequent application read requests.
Because CASL is built around SSDs, all writes are done in a full read/delete page, eliminating any write amplification. Since SSDs are just a cache, there is no need to waste any of them for hot-spares or RAID, offering a much higher usable capacity and lower $/GB. This also allows Nimble to use MLC drives whereas most systems are still bound to higher cost eMLC or SLC technology.
Garbage Collection
When I first heard of the CASL filesystem and how writes are done, I didn’t think that the writes to spinning disks were that much different from the Netapp Write Anywhere File Layout (WAFL) but digging into the garbage collection, the difference becomes clearer.
WAFL opportunistically tries to dump NVRAM to disk in the available open blocks similarly to CASL. The problem becomes when blocks are modified in WAFL, the blocks become very fragmented like swiss cheese. CASL has the same challenge but during Garbage Collection, these blocks are pulled back into NVRAM and re-written sequentially which keeps the system running nice and smooth.
Unlike WAFL, CASL is built as a fully Log Structured Filesystem (LFS). In other words, every time data is written down to disk, it’s done so in an optimally sized sequential stripe offering great write performance (thousands of IOPS from 7.2K drives) but also the ability to maintain performance over time as the system is filled up by intelligently leveraging the low-priority but always-on Garbage Collection engine.
So if CASL is just like WAFL with Garbage Collection, do you still have the performance cliff issues when you hit 80-85% full on aggregates?
Good question.
I’m sure that a Nimble Employee would love to jump on this question, but I believe Nimble Arrays don’t have the same issues when they get full because Garbage Collection gets a higher priority when the array fills up. Also, when the layout gets fragmented, garbage collection is laying down the data in a new full sequential stripe which helps this “Swiss Cheese” effect.
Chris,
Great question, and equally great response by Eric! I’ll add a little more detail on this just in case. The bigger issue NTAP has that causes the performance cliff is that WAFL, like the name implies, is very opportunistic in its writes. In other words, on a clean system it writes well, but as the system fills up and gets fragmented, performance tanks. CASL is specifically written to avoid these issues from the onset. This is accomplished in a handful of ways. First, all writes are done in a sequential stripe that has been tuned/optimized for the layout on disk. CASL is an implementation of LFS, so think of it as starting at sector 0 and laying down stripe after stripe until you’re at the end of the disk, always writing in a sequential stripe no matter what. This means that performance is not degraded even if you’re 95% system full…it’s actually the exact same performance as a brand new system just out of the box. Of course during that time, some data has been marked for deletion, snapshots have expired, etc. So there is also some fragmentation going on behind the scenes. However, CASL by its nature is designed to write in full stripes and not just fill in those small holes (which is what causes the performance cliff). Behind the scenes is an always-on, low priority GC process that scans for fragmentation and, when the system begins to get moderately full, finds fragmented stripes, pulls them back up into memory, recombines them, and writes the data back down as a new and fully sequential stripe at the end of the filesystem. It can then go back and delete the fragmented stripes making room for brand new stripes of data to be written down later. This brings me to a second point, which is metadata. Metadata is kept separate from the data itself and an active copy is always stored on SSD. This not only introduces some fairly cool data integrity capabilities, but also allows to very quickly and easily determine whether GC has anything to do and if so, which stripes to “defragment”.
This all allows Nimble to not only leverage all the capacity without performance degradation, but also not have to worry about performance overhead of GC itself. All in all, one pretty sweet system!
Hope this (very long response) helps!
To be clear, CASL is nothing like WAFL with Garbage Collection. Nimble is a variable block log-structured file system. WALF is a 4k block write anywhere (hole filling) file system that runs a defrag job (usually that never finishes!). Once CASL finds fragmented 4.5MB stripes, it move them back into NVRAM and compresses and re-stripes them sequentially again with new writes. The sweeping engine is a constant low priority CPU process until the array hits 90% full, then it moves to a high priority process. The author left out InfoSight, which is Nimble’s support and monitoring system (included with the array) that is sending daily autosupports and 5 minutes heartbeats back to Nimble. Nimble collects over 30 million sensor points per array per day. InfoSight would have started alerting you once your array got to about 75% full and tell you exactly what day you’re going to run out of storage.
It is a good question, and the answer is no – CASL will sustain performance right up to 95% full across the storage pool. This is a result of the light weight Garbage Collection process that Eric described. So with all data compressed and all volumes thin provisioned you can fit a lot of logical data while maintaining high performance even with the array approaching capacity.
Guys, please stop the FUD. CASL will not maintain high perf when the array is 95% full. It will not even maintain high perf if the array is 90% full. Nimble’s guidelines for optimal perf is 85% full. If anyone wants to bother to read the actual Nimble KB on this, ask your Nimble rep for KB-000038.
Garbage collection only cleans up space after you’ve deleted snapshots of volumes. If you’re just filling the array with data and pass the 85% mark, you WILL experience perf degradation. There’s no magic bullet – even with CASL…and I like Nimble…just not FUD.
@forbsy While you’re correct that Nimble does not recommend running arrays past 85% capacity, writes don’t slow down until 90%
@forbsy I think there’s some confusion here so hopefully the KB article text you referred to will help clear things up. Per the KB:
“This critical notification is generated when the array has less than 10% free space available. Free space
represents disk space that is available to the array for write operations. Utilization includes both volume
and snapshot space consumption. Once the system has reached the 10% threshold on arrays with OS
prior to 1.4.x or 5% on the newer systems, write performance will be significantly impacted, continuing
to degrade as available space decreases. This is due to the overhead associated with processes required
to return unused blocks to the file system in conjunction with the inefficiencies inherent in seeking
available space. As noted in the alert, if corrective action is not taken write workload will cease and an
outage will occur. The guideline threshold of array space utilization for optimal performance is 85%. ”
As you’ll notice, the watermark for write performance degradation is listed as 5% available (or 95% full) in any of the 2.x code which has been shipping for roughly the past year. This is well tested in the field and is not FUD, it’s fact. Now, as Tony mentioned, the guideline is 85% and this is mainly so we have enough time to figure out how to not hit the 95% mark.
As for GC, it is far more than just cleanup of deleted snapshots. I can spend a good 30 minutes minimum on a whiteboard going just through that process alone.
Hope this helps clear things up.
Hi Dimitriy. I hear what you’re saying but i think it adds more confusion. It states that the guideline for optimal performence is 85%. To me that means anything over 85% is not optimal. I think its more about the message you want to tell customers. If 85% is the communicated number than stick with that rather than trying to explain that things will still be fine up to 95%. It sends a confusing message to customers.