About a week ago I was troubleshooting some latency issues on our Equallogic SAN, and came across a nice illustration of APLB in action. The basic test was this – run a consistent load on a LUN to see if we could reproduce the latency spikes I was intermittently picking up on our SQL instances. That way we could determine if it was a constant issue, or something to do with the load that SQL was putting on the SAN.
Our SAN is made up of the following: 1 x PS6510X and 3 x PS6010S units. All 10GbE, running on their own set of switches. We upgraded to FW5.1.1-H1 about a month ago (just before H2 was released) and we had enabled load balancing to allow unused SSD capacity to enhance our SUMO.
8K request size
1 outstanding request (To make sure it wasn’t going to affect the rest of the SAN)
10GB test file (To make sure it wasn’t cached)
LUN on the SUMO (to begin with…)
Initially, we were seeing reasonably consistent performance and also managed to get one some of those spikes to appear.
Avg latency - 8ms
Avg IOPS – 180
I then left this running overnight to see if the issue became more obvious over time.
In the morning, it looked like this:
Avg latency - <1ms
Avg IOPS - 1K+
What had happened is that APLB had kicked in and over time realised that some blocks in this LUN were doing more work than some others that had been migrated to SDD member, and then proceeded to move these over. By morning, the entire 10GB LUN had been moved and was running an order of magnitude faster than it was the day before.
You may be wondering why there were only 1K IOPS when this was sitting on SSD, but remember that I only had a single request outstanding, and therefore was more than likely being throttled by the network latency rather than the SSD storage itself.
To be fair, EQL SSD units are not the fastest – in fact I would go as far as to say that they are one of the slowest rackmount 10GbE SSD implementations out there. However, they are faster than any configuration of HDD’s you could fit into the same form factor, and we purchased them on the basis that they were good enough for what we wanted to do with them at the time.
So in conclusion:
· Equallogic Load balancing across SAN members works, and does so especially well on consistent loads (The jury is still out on volatile workloads)
· Mixing SSD and HDD models can have large benefits, if particular if there is a subset of data that you want to accelerate of a large LUN, but don’t want to put the whole LUN on SSD.
· The load balancing happens automatically – you just turn it on for the group and it will do the rest. This is both a positive and a negative – it works but you lose control of the placement of the data. You can use volume pinning to fix a volume to a member, but that’s another story …
· Load balancing is not instant – it will take time to kick in. It appears to me that capacity balancing kicks in first, and then performance load balancing kicks in afterwards.