Tuesday, 18 October 2011

Equallogic Automated Performance Load-balancing


About a week ago I was troubleshooting some latency issues on our Equallogic SAN, and came across a nice illustration of APLB in action. The basic test was this – run a consistent load on a LUN to see if we could reproduce the latency spikes I was intermittently picking up on our SQL instances. That way we could determine if it was a constant issue, or something to do with the load that SQL was putting on the SAN.

Our SAN is made up of the following: 1 x PS6510X and 3 x PS6010S units. All 10GbE, running on their own set of switches. We upgraded to FW5.1.1-H1 about a month ago (just before H2 was released) and we had enabled load balancing to allow unused SSD capacity to enhance our SUMO.

IOMeter Setting:

8K request size
Single worker
100% Read
100% Random
1 outstanding request (To make sure it wasn’t going to affect the rest of the SAN)
10GB test file (To make sure it wasn’t cached)
LUN on the SUMO (to begin with…)

Initially, we were seeing reasonably consistent performance and also managed to get one some of those spikes to appear.
Avg latency  - 8ms
Avg IOPS – 180

I then left this running overnight to see if the issue became more obvious over time.
In the morning, it looked like this:
Avg latency - <1ms
Avg IOPS - 1K+

What had happened is that APLB had kicked in and over time realised that some blocks in this LUN were doing more work than some others that had been migrated to SDD member, and then proceeded to move these over. By morning, the entire 10GB LUN had been moved and was running an order of magnitude faster than it was the day before.

You may be wondering why there were only 1K IOPS when this was sitting on SSD, but remember that I only had a single request outstanding, and therefore was more than likely being throttled by the network latency rather than the SSD storage itself.

To be fair, EQL SSD units are not the fastest – in fact I would go as far as to say that they are one of the slowest rackmount 10GbE SSD implementations out there. However, they are faster than any configuration of HDD’s you could fit into the same form factor, and we purchased them on the basis that they were good enough for what we wanted to do with them at the time.

So in conclusion:
·         Equallogic Load balancing across SAN members works, and does so especially well on consistent loads (The jury is still out on volatile workloads)
·         Mixing SSD and HDD models can have large benefits, if particular if there is a subset of data that you want to accelerate of a large LUN, but don’t want to put the whole LUN on SSD.
·         The load balancing happens automatically – you just turn it on for the group and it will do the rest. This is both a positive and a negative – it works but you lose control of the placement of the data. You can use volume pinning to fix a volume to a member, but that’s another story …
·         Load balancing is not instant – it will take time to kick in. It appears to me that capacity balancing kicks in first, and then performance load balancing kicks in afterwards.

4 comments:

  1. Nice post Brett.

    It would be interesting to see how well APLB works for mixed members, for example a PS6010X with a PS4010E. Can you even mix members of different models (in this case different network speed and drive types)?

    cheers.
    Vic

    ReplyDelete
  2. Hi Vic,

    My understanding is that if they are in the same group and can be added to the same pool, APLB should work. Having said that, I'm not sure that I would risk such a big discrepancy unless I both understood the workload extremely well, and was very tolerant of increased latency.

    The way to judge it is to assume your workload is going to run on the 4000 (as a worst case scenario). If you can't risk it, then I don't think APLB is a good fit for your application.

    Next time I chat to the EQL guys, I will see if I can get some feedback on if this mix would work.

    ReplyDelete
  3. So I spoke with one of the senior tech support team for Dell UK today, and got the following advice:
    Mixing 4000's and 6000's is not recommended, but is OK. You may get some benefit from the APLB, but controlling placement of the data is more than likely a better idea.

    Mixing 10GbE with 1GbE is a BIG NO NO! Servers connecting to the 10GbE network will potentially flood the 1GbE SAN with requests, and the net result would be significantly degraded performance from the 1GbE SAN. Keep them in separate pools to ensure the correct servers are connecting to the correct units.

    Hope this helps!

    ReplyDelete
  4. Great illustration, thank you for posting. I came across your post while evaluating pooling of a PS6010XV Raid50 + PS6110E Raid6.

    ReplyDelete