Understanding RAID Penalty

Determining which type of RAID to use when building a storage solution will largely depend on two things; capacity and performance. Performance is the topic of this post.

We measure disk performance in IOPS or Input/Output per second. One read request or one write request = 1 IO.  Each disk in you storage system can provide a certain amount of IO based off of the rotational speed, average latency and average seek time.  I’ve listed some averages for each type of disk below.

DiskIOPS

sources: 

http://www.techrepublic.com/blog/datacenter/calculate-iops-in-a-storage-array/2182

http://www.yellow-bricks.com/2009/12/23/iops/

http://en.wikipedia.org/wiki/IOPS

So for some basic IOPS calculations we’ll assume we have three JBOD disks at 5400 RPM, we can assume that we have a maximum of 150 IOPS.  This is calculated by taking the number of disks times the amount of IOPS each disk can provide.

But now we assume that these disk are in a RAID setup.  We can’t get this maximum amount of IOPS because some sort of calculation needs to be done to write data to the disk so that we can recover from a drive failure.   To illustrate lets look at an example of how parity is calculated.

Lets assume that we have a RAID 4 system with four disks.  Three of these disks will have data, and the last disk will have parity info.   We use an XOR calculation to determine the parity info.  As seen below we have our three disks that have had data written to them, and then we have to calculate the parity info for the fourth disk.  We can’t complete the write until both the data and the parity info have been completely written to disk,  in case one of the operations fails.  Waiting the extra time for the parity info to be written is the RAID Penalty.

RAID4Parity

Notice that since we don’t have to calculate parity for a read operation, there is no penalty associated with this type of IO.  Only when you have a write to disk will you see the RAID penalty come into play.  Also a RAID 0 stripe has no write penalty associated with it since there is no parity to be calculated.  A no RAID penalty is expressed as a 1.

RAIDPenalties

RAID 1

It is fairly simple to calculate the penalty for RAID 1 since it is a mirror.  The write penalty is 2 because there will be 2 writes to take place, one write to each of the disks.

RAID 5

RAID 5 is takes quite a hit on the write penalty because of how the data is laid out on disk.   RAID 5 is used over RAID 4 in most cases because it distributes the parity data over all the disks.  In a RAID 4 setup, one of the disks is responsible for all of the parity info, so every write requires that single parity disk to be written to, while the data is spread out over 3 disks.  RAID 5 changed this by striping the data and parity over different disks.

The write penalty ends up being 4 though in a RAID 5 scenario because for each change to the disk, we are reading the data, reading the parity and then writing the data and writing the parity before the operation is complete.

RAID 6

RAID 6 will be almost identical to RAID 5 except instead of calculating parity once, it has to do it twice, therefore we have three reads and then three writes giving us a penalty of 6.

RAID DP

RAID DP is the tricky one.  Since RAID DP also has two sets of parity, just like RAID 6, you would think that the penalty would be the same.  The penalty for RAID DP is actually very low, probably because of how the Write Anywhere File Layout (WAFL) writes data to disk.  WAFL will basically write the new data to a new location on the disk and then move pointers to the new data, eliminating the reads that have to take place.  Also, these writes are written to NVRAM first and then flushed to disk which speeds up the process.  I welcome any Netapp experts to post comments explaining in more detail how this process cuts down the write penalties.

Calculating the IOPS

Now that we know the penalties we can figure out how many IOPS our storage solution will be able to handle.  Please keep in mind that other factors could limit the IOPS such as network congestion for things like iSCSI or FCoE, or hitting your maximum throughput on your fibre channel card etc.

Raw IOPS = Disk Speed IOPS * Number of disks

Functional IOPS = (Raw IOPS * Write % / RAID Penalty) + (RAW IOPS * Read %)

To put this in a real world example, lets say we have five 5400 RPM disks.  That gives us a total Raw IOPS of 250 IOPS.  (50 IOPS * 5 disks = 250 IOPS).

If we were to put these disks is a RAID 5 setup, we would have no penalty for reads, but the writes would have a penalty of four.  Lets assume 50% reads and writes.

(250 Raw IOPS * .5 / 4) + (250 * .5) = 156.25 IOPS

33 Responses to Understanding RAID Penalty

  1. RAID DP avoids many of the pitfalls of R6 because of the benefits of the Write Anywhere File Layout architecture. As you correctly point out, the old bits and old parity bits need not be read in this design. The only back end disk operations are the writing of the new bit, and the writing of the two new parity bits, for a total write penalty of three. Again as you point out, some of this is further masked because the writes are executed as NVRAM flushes, rather than as individual I/O write operations.

    This architecture moves the resource constraint up to the file heads, as they must manage the incoming writes into the NVRAM buffer, and manage the pointer tables. As a result, you typically find that NetApp arrays are controller bound, rather than disk bound (you saturate the controllers rather than run out of disk slots). This isn’t a bad thing, per se, just a different architectural model and design consideration.

    • Great follow up, John. You are right, it’s more common for NetApp implementations to be controller bound vs disk bound.

      I do always like to reinforce that NetApp allows (even encourages) the implementation of fairly large RAID sets as well. So your overhead for RAID parity can be much lower. For instance it’s common to see anywhere from 10 to 20 disk RAID sets depending on the drive architecture (FC vs SAS vs SATA vs SSD. You get a lot of data written for a fairly small amount of Parity penalty (I like saying that).

      Also be aware that the Filer actually stores the writes in system memory. NVRAM is a journal that tracks those writes and can reconstruct them if necessary. The write data isn’t actually stored in NVRAM, however. Think of it like a Database’s log file. The great thing is that we “commit” a write as soon as the data is written to memory and the transaction is logged into NVRAM. Therefore NetApp storage systems can be very high performance when recording small random writes, because they don’t have to wait for the spindles to line up and the write actually to be written to disk.

      Full disclosure… I’m an SE for NetApp, albeit a fairly new one (about 18 months in). All of this stuff is pretty well documented, though. Feel free to let me know if you think I’ve mis-represented anything. @tkresler is the best way to reach me.

      • Hi Tim,

        Thanks for your valuable information!
        I am working on netapp filer for last 2 years I just have one doubt …… Netapp filers stores only journals on NVRAM not the actual data but if it store data in the memory buffer , how it can replay the journals after sudden reboot because data will be vanish from memory after reboot.

        Can you please help me to understand this

        Thanks
        Vivek

      • It is my understanding that RAID DP has a low write penalty due to the Write Anywhere File Layout that Netapp utilizes. In this situation there will be two writes. Write-Data then Write-Parity. So the write penalty ends up being 2. Now once the writes are done the controller actually changes the data pointers from the old location to the new location but this doesn’t count up against the write penalty.
        I will admit that RAID DP is difficult to find a ton of information about comparing it to other raid types, but to my knowledge it is a write penalty of 2. Other Netapp Engineers may be able to elaborate on this subject more.

        Thanks for reading!

    • Sure,
      Raid 6 splits the parity information across disks to eliminate getting a “hot” disk meaning that it may be writing more than other disks so it may fail first. RAID DP stores the parity on two disks, but with Netapp you’re writing to NVRAM first and then it flushes these writes to disk which creates a nice even distribution among disks and alleviates the “hot” disk issue.
      I hope this helps.

  2. Raid 6 uses all of its disks for parity, like RAID 5. And all your storage vendors are trying to write to their Write Cache when they can instead of disk to improve write performance. That’s certainly not a RAID type differentiator…

  3. > The write penalty ends up being 4 though in a RAID 5 scenario because for each
    > change to the disk, we are reading the data, reading the parity and then writing the
    > data and writing the parity before the operation is complete.

    But you never say *WHY* that has to be.

    Why can’t a *WRITE* operation… just be “write data”, “write data”, “write parity”… period. and no “read” (or “read” and “read” as you say) at all?

    I’m *WRITING* data…. not reading it.

      • Say you have four disks A,B,C,D and in fifth position data is on A5, B5, C5 and Parity is on D5.

        Parity can be calculated (for example) by simply XOR’ing all the datas, so D5 = A5 XOR B5 XOR C5
        With A5 = 0000 1111 B5=1111 0000 and C5 = 1010 1010 this would give
        A5 XOR B5 = 1111 1111
        (A5 XOR B5) XOR C5 = 0101 0101 => D5
        (would give the same result if positions were changed).

        So say in A5 you’d like to write “0001 1110” instead. You the controller has to write A5 (obviously) and change D5 to the new parity which is
        0001 1110 XOR 1111 0000 XOR 1010 1010 = 1110 1110 XOR 1010 1010 = 0100 0100

        But the controller *doesn’t know* the values of B5 and C5, so it has to read them back to calculate D5 –> two read operations for a four-disk RAID5.
        If you have only three disks this obviously changes to one additional read, for five and more disks it would change to three and more reads (n disks => n-2 reads), which is bad.

        The controller can easily limit this to only two reads: A5 and D5 (overwritten data plus parity) and then calculates
        D5(old) XOR A5(old) = 0101 0101 XOR 0000 1111 = 0101 1010
        0101 1010 XOR A5(new) = 0101 1010 XOR 0001 1110 = 0100 0100 = D5(new)
        which is the same as above.

        So the variable number of reads is reduced to always (max) two — plus of course the two writes. Maximum because in case of three disks the controller (in theory) could only read B5, but methinks for simplicity even then both old value and old parity are read in.

        Given enough cache in the controller (if any, cheap ones and fake-raids don’t have it or not much) it is possible that some of the values are in the (read-)cache of the controller already so the read operations are from cache and not disk, but I don’t know if this is really used and obviously only works for recently-worked-with data.

        WAFL (if I understand correctly) does things different: it waits until it has enough data for a whole new “line” and then writes A6, B6, C6, D6, E6, F6 plus parity G6 and H6 in one single write operation, plus changes the pointers in the file system to the new value. That’s why you can’t give a simple “write penalty” for netapp systems. But then again this tends to fragment systems quite a bit, which impacts read performance. A plus is that deduplication in easy, since copy-on-write is system-immanent.

  4. Thanks for the post. I have been searching for a method to calculate the theoretical maximum throughput of a RAID array based on the number of disks, RAID level and disk rpm speed. I found several calculations including this post and they are all consistent in the method of calculation / RAID IO penalty, etc… I am just thinking that there are a couple of assumptions that have not been mentioned or that there is something I don’t understand. It would be really helpful if you help / explain on this.
    1- The IO penalty assumes that there are 4 disks in the array right? In a 6 disk RAID5 array, 1 IO will cause 6 IOs thus the penalty will be 6. This means that the penalty increases with the number of disks, correct?
    2- RAID penalty calculation says that 1 IO causes 4 IOs in case of RAID5, therefore, when calculating IOPS we divide by 4. What I don’t get here is;
    Are we assuming that 1IO = 1IOPS and that 4IOs will take the same time and thus 4IOs will also take 1 second thus 4IOPS?

    • Hello Waseem,

      I also had similar questions about this topic. Please let me know if you had already found answers to your questions.

      Thanks,
      Madhu.

  5. Pretty Baby
    this might help you understand why there are reads for writing data for raid 5
    http://rickardnobel.se/raid-5-write-penalty/

    It is the parity that causess the reads.
    parity is calculated against ALL the disks in the raid

    read the old data (IOP1)
    XOR old data and new data to get NewDataDelta
    read Old parity (IOP2)
    XOR NewDataDelta and Oldparity to get New Parity
    Write new data (IOP3)
    write new parity (IOP4)

    the reason for this is parity is calculated amongst all disks.
    doing it this way you dont have to read all the disks to create parity.

  6. A great article! Very interested on the DP stuff.

    The overall IOPS estimation method is over-estimating though; the stated 156.25 IOPS prediction with 50% writes works back to 390 IOPS RAW (78 IOPS from reads plus 78 * 4 IOPS caused by the writes). Try instead:

    Total Array IOPS / ((write penalty * write %) + (read penalty * read%))

    so for the 5x 50 IOPS RAID-5 example that gives 250 / ( (4 * 50%) + (50%) ) = 100. Working back, 50 IOPS from reads + 50 * 4 from writes gives the 250 total IOPS available.

  7. I don’t know why everybody think there is no read penalty. Saying “Notice that since we don’t have to calculate parity for a read operation, there is no penalty associated with this type of IO. ” is true but on read we have to CHECK the parity. OR We assume if the read operation does not have read CRC error the controller or software will not check the parity. I don’t know what the real implementation is! On the other hand multiplying the RAW IOPS with number of reads is not correct again because if we have a 5 drive RAID 5 every 5th read are on the same drive, where we have to wait the end of the previous read, so the read penalty is minimum 1+1/number of drives!

    • Haraszti, you wrote: “but on read we have to CHECK the parity”, however this is not really correct. The parity in RAID5 is combined from all disks in the RAID set, e.g. in a RAID5 set of ten disks. However, the parity can not be used to verify if a READ against a certain disk is correct, since that read + the parity will not be enough.

      If someone would want to implement such feature every single read would cause reads on ALL disks + re-calculate the parity. In a RAID5 set of ten disks the read penalty would be x10, which really would be impossible to handle performance-wise.

  8. Ritesh….assuming it’s a CLARiiON/VNX array, you’ll get 3 RAID5 groups in that DAE (4+1).

    What type of drives are we talking about? SAS? FC? NL-SAS? SATA?

    Need those numbers to proceed.

  9. It’s a wonderful thread and some good information on RAID penalty, I was more interested b’cos I saw RAID-DP. There isn’t much information on the internet on RAID-DP penalty. There are lot of documents & TRs on RAID-DP, but then exact numbers aren’t explained anywhere. Most of them just mentions penalty in terms of percentile, and I have read 2 % is what they quote.

    I am just trying to understand the exact numbers in terms of RAID-DP penalty in-case when you can find a full stripe and when you don’t find a full stripe. [Hopefully NetApp folks might give some input here]

    RAID PENALTY: For FULL STRIPE
    As per design, WAFL will continue to push FULL STRIPES in chain.[Even though WAFL stands for write anywhere file layout, but do not take the meaning literally, it basically collects all the random IOs, and then turns them into sequential FULL stripes] But, just for the analogy if we consider the penalty for full stripe writes :>

    RP=ROW PARITY
    DP=DIAGONAL PARITY

    READS = None, WRITES=New Data + RP + DP with parity calculated in the memory, so it writes FULL Stripe and then writes the RP + DP.

    RAID PENALTY for RAID DP: FULL STRIPE

    In terms of IOPS:[XOR’ing in memory to calculate both RP & DP, as no reads are required]
    =1[New-data] + 1[RP] + 1[DP]
    =3 IOs Total for full stripe.

    I guess, when you chain write these stripes, 3 IOs do not happen for every single IO.

    What if the WAFL is fragmented or disk high disk utilization makes FULL stripes impossible. In that scenario, modified blocks will have to be replaced [provided they aren’t locked by snapshot] and this where I am guessing, large stripe [parity computation by re-calculation] and small stripe [parity computation by subtraction] come into picture ? Is that correct ? [Question mark, b’cos I am not sure, and would love to get confirmation from NetApp].

    If we consider the above theory, RAID penalty may look like this:

    RAID PENALTY:LARGE STRIPE [writing more than half blocks in a stripe]
    In terms of IOPS: [Parity = XOR’ring blocks not writing to + New data + old_RP & XOR’ring diagonal stripe for DP]
    =1[for reading blocks not writing to] + 1[old_RP] + 1[old_DP] => 3 IOs
    +
    =1[new_data] + 1[new_RP] + 1[new_DP] => 3 IOPS
    = 3 + 3 = 6 IOs

    RAID PENALTY:SMALL STRIPE [writing to less than half the blocks in a stripe]
    In terms of IOPS:
    =1[reading old_data] + 1[old_RP] + 1[old_DP] => 3 IOs
    +
    =1[new_data] + 1[new_RP] + 1[new_DP] => 3 IOPS
    =3 + 3 = 6 IOs

    So, basically unless we have this scenarios, full stripes penalty would ideally be 3 IOs for RAID-DP as mentioned by John at the very beginning of the comments ?

    Is that a correct assumption ?

  10. Hello,

    I am I am writing on the post after a really long time. But I just wanted to ask a question regarding this topic. I know that we discussed about the penalty assuming that we are changing the data on a single disk of the raid group or changing the chunk in the stripe. If my application is different and I am using smaller chunk sizes meaning I should write a complete stripe inorder to accommodate a IO what would be my write penalty in such instances? Please shed some light if my assumptions or my understanding is wrong. Thank you very much in advance.

  11. Raid 6 parity is split all over array. There is no any hot disks. Following from what you said if this 2 disks fail entire raid 6 array is lost which is not true. Any 2 disks from raid 6 array may fail and array will survive.

  12. Sorry about that, I modified the comment to reflect that RAID 6 parity is split across disks to eliminate a hot disk. Also, correct about 2 disks failing. RAID 6 can recover from losing 2 disks but not 3.

  13. there is a battery for nvram and a flash chip on board. it flushes the whole nvram to that chip. after reboot, it reloads the data from chip. this is a new appliance on new systems. old systems have another approach which requires more battery time.

  14. Interesting information.
    NetApp storage comes with RAID 4 and RAID DP. I found here information about RAID-DP, but I am not sure how to calculate RAID-4.
    So, what is a RAID Penalty for RAID 4?

    • I believe the articles are coming from two different angles.
      Here, we’re saying that the array has ~250 total IOPS but can only serve ~156 functional iops due to the RAID Penalty.
      Duncan’s post looks at this the other way and says that that VM requesting ~156 IOPS would actually produce ~250 IOPS due to the write penalty.

  15. Hi there,

    Thanks for this great post.
    I’m a bit confused for RAID 5 and 6 READ performance.
    I understand fully the WRITE performance and it’s penalty that you explained quite well, but I’m unsure for READ.

    You state that read performance = N (number of disks) * x (iops of single disk)

    But with RAID 5, one disk is holding parity, thus won’t be used for the read operation. Same for RAID 6 but with two unused disks.

    Is that correct ?

    I’m trying to figure out if there would be any performance difference between two RAID 6 configurations :
    48 total disks
    Config 1 : 8 groups of 4+2 RAID 6
    Config 2 : 6 groups of 6+2 RAID 6

    If I use your maths, both configurations would perform the same.
    If I use Read iops = (N – 2) * x per subgroup of R6, assuming we are in 50r/50w config, and individual disk iops = 75, I find this :

    Config 1 : ( (4*75)*0.5 + (6*75/6)*0.5 ) * 8 = 1500 iops
    Config 2 : ( (6*75)*0.5 + (8*75/6)*0.5 ) * 6 = 1650 iops

    What is wrong in my logic ?

    Thanks

Leave a reply