About a week ago I was troubleshooting some latency
issues on our Equallogic SAN, and came across a nice illustration of APLB in
action. The basic test was this – run a consistent load on a LUN to see if we
could reproduce the latency spikes I was intermittently picking up on our SQL
instances. That way we could determine if it was a constant issue, or something
to do with the load that SQL was putting on the SAN.
Our SAN is made up of the following: 1 x PS6510X and 3 x
PS6010S units. All 10GbE, running on their own set of switches. We upgraded to
FW5.1.1-H1 about a month ago (just before H2 was released) and we had enabled
load balancing to allow unused SSD capacity to enhance our SUMO.
IOMeter Setting:
8K request size
Single worker
100% Read
100% Random
1 outstanding request (To make sure it wasn’t going to
affect the rest of the SAN)
10GB test file (To make sure it wasn’t cached)
LUN on the SUMO (to begin with…)
Initially, we were seeing reasonably consistent
performance and also managed to get one some of those spikes to appear.
Avg latency - 8ms
Avg IOPS – 180
I then left this running overnight to see if the issue
became more obvious over time.
In the morning, it looked like this:
Avg latency - <1ms
Avg IOPS - 1K+
What had happened is that APLB had kicked in and over
time realised that some blocks in this LUN were doing more work than some
others that had been migrated to SDD member, and then proceeded to move these
over. By morning, the entire 10GB LUN had been moved and was running an order
of magnitude faster than it was the day before.
You may be wondering why there were only 1K IOPS when
this was sitting on SSD, but remember that I only had a single request
outstanding, and therefore was more than likely being throttled by the network
latency rather than the SSD storage itself.
To be fair, EQL SSD units are not the fastest – in fact I
would go as far as to say that they are one of the slowest rackmount 10GbE SSD
implementations out there. However, they are faster than any configuration of
HDD’s you could fit into the same form factor, and we purchased them on the
basis that they were good enough for what we wanted to do with them at the
time.
So in conclusion:
·
Equallogic Load balancing across SAN members
works, and does so especially well on consistent loads (The jury is still out
on volatile workloads)
·
Mixing SSD and HDD models can have large
benefits, if particular if there is a subset of data that you want to
accelerate of a large LUN, but don’t want to put the whole LUN on SSD.
·
The load balancing happens automatically – you
just turn it on for the group and it will do the rest. This is both a positive
and a negative – it works but you lose control of the placement of the data.
You can use volume pinning to fix a volume to a member, but that’s another story
…
·
Load balancing is not instant – it will take
time to kick in. It appears to me that capacity balancing kicks in first, and
then performance load balancing kicks in afterwards.
Nice post Brett.
ReplyDeleteIt would be interesting to see how well APLB works for mixed members, for example a PS6010X with a PS4010E. Can you even mix members of different models (in this case different network speed and drive types)?
cheers.
Vic
Hi Vic,
ReplyDeleteMy understanding is that if they are in the same group and can be added to the same pool, APLB should work. Having said that, I'm not sure that I would risk such a big discrepancy unless I both understood the workload extremely well, and was very tolerant of increased latency.
The way to judge it is to assume your workload is going to run on the 4000 (as a worst case scenario). If you can't risk it, then I don't think APLB is a good fit for your application.
Next time I chat to the EQL guys, I will see if I can get some feedback on if this mix would work.
So I spoke with one of the senior tech support team for Dell UK today, and got the following advice:
ReplyDeleteMixing 4000's and 6000's is not recommended, but is OK. You may get some benefit from the APLB, but controlling placement of the data is more than likely a better idea.
Mixing 10GbE with 1GbE is a BIG NO NO! Servers connecting to the 10GbE network will potentially flood the 1GbE SAN with requests, and the net result would be significantly degraded performance from the 1GbE SAN. Keep them in separate pools to ensure the correct servers are connecting to the correct units.
Hope this helps!
Great illustration, thank you for posting. I came across your post while evaluating pooling of a PS6010XV Raid50 + PS6110E Raid6.
ReplyDelete