About a week ago I was troubleshooting some latency
issues on our Equallogic SAN, and came across a nice illustration of APLB in
action. The basic test was this – run a consistent load on a LUN to see if we
could reproduce the latency spikes I was intermittently picking up on our SQL
instances. That way we could determine if it was a constant issue, or something
to do with the load that SQL was putting on the SAN.
Our SAN is made up of the following: 1 x PS6510X and 3 x
PS6010S units. All 10GbE, running on their own set of switches. We upgraded to
FW5.1.1-H1 about a month ago (just before H2 was released) and we had enabled
load balancing to allow unused SSD capacity to enhance our SUMO.
IOMeter Setting:
8K request size
Single worker
100% Read
100% Random
1 outstanding request (To make sure it wasn’t going to
affect the rest of the SAN)
10GB test file (To make sure it wasn’t cached)
LUN on the SUMO (to begin with…)
Initially, we were seeing reasonably consistent
performance and also managed to get one some of those spikes to appear.
Avg latency - 8ms
Avg IOPS – 180
I then left this running overnight to see if the issue
became more obvious over time.
In the morning, it looked like this:
Avg latency - <1ms
Avg IOPS - 1K+
What had happened is that APLB had kicked in and over
time realised that some blocks in this LUN were doing more work than some
others that had been migrated to SDD member, and then proceeded to move these
over. By morning, the entire 10GB LUN had been moved and was running an order
of magnitude faster than it was the day before.
You may be wondering why there were only 1K IOPS when
this was sitting on SSD, but remember that I only had a single request
outstanding, and therefore was more than likely being throttled by the network
latency rather than the SSD storage itself.
To be fair, EQL SSD units are not the fastest – in fact I
would go as far as to say that they are one of the slowest rackmount 10GbE SSD
implementations out there. However, they are faster than any configuration of
HDD’s you could fit into the same form factor, and we purchased them on the
basis that they were good enough for what we wanted to do with them at the
time.
So in conclusion:
·
Equallogic Load balancing across SAN members
works, and does so especially well on consistent loads (The jury is still out
on volatile workloads)
·
Mixing SSD and HDD models can have large
benefits, if particular if there is a subset of data that you want to
accelerate of a large LUN, but don’t want to put the whole LUN on SSD.
·
The load balancing happens automatically – you
just turn it on for the group and it will do the rest. This is both a positive
and a negative – it works but you lose control of the placement of the data.
You can use volume pinning to fix a volume to a member, but that’s another story
…
·
Load balancing is not instant – it will take
time to kick in. It appears to me that capacity balancing kicks in first, and
then performance load balancing kicks in afterwards.