Wednesday 22 February 2012

Compellent iSCSI Optimisation on Windows


We now have our shiny new pair of Compellent Storage Center's installed, up and running.
This means that I get to spend some quality time testing, benchmarking and evaluating them before they go into production.
Previously I mentioned that the Compellent's were too smart for SQLIO which really disappointed me no end.
Not only did we use it as a tool to compare arrays through the purchase decision, but I find it significantly easier to test with than IOMeter.

Owing to a suggestion from @tonyholland00 (thank you Tony!!) it turns out that if you pre-allocate the disk and turn off caching, you can get realistic results from SQLIO again. Now most other array admins would cry out in horror at the idea of turning off caching on their arrays, but that's because their arrays are completely dependent on their write cache to deliver their advertised performance. Compellent arrays are different - they only have a tiny (512MB) write cache, and are properly architected around the number of spindles required for their target workload and don't require cache as a crutch. In addition, when testing the SSD Tier we have, the SSD's are actually faster than the cache. Pre-allocation is slightly out of the ordinary, but if that last 1-2% is that important to you for bulk loading, you would be pre-allocating your LUN anyway!

So there I was merrily testing away, and I wanted to both confirm the performance I was promised in our POC, and to see how far I could push one of these arrays. I'll definitely do some more posts on my findings, but I came across a bottleneck that may be common in many iSCSI configurations.

We are using 10GbE and I configured a blade with 4 ports, but was only able to get 2.4GB/sec even on bigger block sizes. Interestingly, the performance data looked almost identical to the POC data I received from Dell, but I assumed they had only used 2 HBA's.
All targets were connected, but I couldn't get as much info on the connections as I had been getting with my EQL arrays, as there is no connections tab comparable to what you get with the EQL HIT kit.
After a bit of investigation, it turned out that we were effectively getting one session on each fault domain (VLAN), and hence restricting us to just over 20Gb.

The solution is to manually create a session for each NIC to each target. Through the GUI, you can go to each connection, go to the properties, and add a new session. (Remembering to enable multi-pathing both when you create the initial connection, and when you add the additional session).
Now that's great if you are setting up a single server, going to a single array, but it's going to get very tedious when you are doing large numbers of servers, and connecting to more than one array.



We have 2 arrays, dual controller, each with 6 ports - resulting in 48 sessions to be manually configured. That's 10 min of tedium per server that I want to avoid. Thankfully, I found this link:
http://mrshannon.wordpress.com/2010/01/08/making-iscsi-targets-via-cmd/
You configure your list of targets once, and then on each server simply choose which nics you want to use. Run the batch file, reboot and you should be good to go.

After configuring the additional sessions, I was able to max out bandwidth on all 4 10GbE ports using a block size as low as 128K on a single LUN.
I suspect this oversight is common, as I think the same happened even in our POC setup, and it's not intuitive that you need to do it.
Coming from EQL, you expect it to be done automatically... the HIT Kit makes you lazy... :-)
Hopefully this will help even if you have fewer nics, by saturating all bandwidth available.


Friday 3 February 2012

Compellent too clever for SQLIO


Was doing some testing during my compellent admin training yesterday, and came across some rather interesting behavior.
My IO testing tool of choice is SQLIO and it allows for very quick and easy perfomance baselining for your storage.
It's not SQL specific as the name would suggest, is command line and can easily be adapted to test almost any IO profile.
The best way to test is to ceate a test file larger than the cache on the SAN to ensure that the test is not completely cached.
(Unless of course you want to test cache performance...)

In this instance I created a 10GB file which is significantly bigger than the 3.5GB read cache on the compellent.
I ran my tests but considering that the underlying disks were only 5 x 15K drives, the perfomance was WAY too high. (30K IOPS - would save a lot on SSD's)
In addition, RAID 5 was writing as fast as RAID 10 - so something was fishy...

Having a look at the disk utilisation on the SAN I could see why straight away : the SAN was only storing change data !
My 10GB file was using less than 150MB because the rest was filled with 0's and the compellent does not write large blocks of sequential zero's...
Volume with 10 GB data file...






















As a result my test file was easily fitting into cache - it was basically just metadata, which also explains why R5 and R10 were performing the same - 100% cache hit for both read and write!!

Disabling the cache allowed me to get a bit further but the fantastic thin write technology means that I'm going to have to get creative if I want to continue using SQLIO in future.

So lessons learned (by me) are:
1) When doing testing, always have expectations of what the outcome should be, as a both a sanity check, and to make sure the results are valid.
2) Make sure that your synthetic test works for the platform you are testing!


P.S.Subsequently found a good explanation of how SQLIO works and similar phenomena by Grant Fritchey which would confirm my observations:
http://www.simple-talk.com/community/blogs/scary/archive/2011/06/28/102058.aspx

Sunday 15 January 2012

Framed VS Frameless SAN

Been a while since I posted on storage theory - this time I'm briefly covering the difference between framed SAN's like Compellent, 3Par, EMC etc. and frameless SAN's like Equallogic, Lefthand, Solidfire etc.

Framed

There are now 2 major SAN philosophies in terms of design. The traditional design is called framed and is characterised by head units with disks attached. This stems from the principle of taking a pair of servers, attaching a lot of disks and sharing them out over a SAN. 
The upside is that it’s been around for a long time and vendors have gotten extremely good at building them. However their reputation has generally let to them charging a premium on these when you want additional functionality such as replication etc. 
They can potentially be scaled up to the biggest storage clusters and can have extensive connectivity options. The biggest downside is that it is very easy to incorrectly spec the head unit and buy too small or too much. Buying too small, the head units become a bottleneck and upgrade costs are excessively high – in some cases complete replacement. Buying too big means a massive amount of over-expenditure on a unit that will be underutilized for its entire life. 
There are exceptions to this where they only have one option for a head unit no matter what capacity you buy. In general, performance on all framed SAN's can only scale in terms of additional spindles, and once a bottleneck is reached in the head unit, any further scalability can be costly.



Frameless

Frameless SANS are a newer generation that are characterised by consisting of groups of self-contained units, each with their own controllers, disk, cache and connectivity. As more units are added, so capacity and performance scale. Groups of these units are usually managed as a single SAN and volumes usually span multiple of these units to gain performance or redundancy - sometimes both. The biggest advantage of this model is that you are never overspending on a head unit, and the head unit does not become a bottleneck for the disks. The downside to this model is that adding capacity generally gets more expensive the larger you grow, as you are buying more than a shelf of disks each time you add to the SAN. You are buying controllers, cache and capacity and this all adds to the incremental cost. Maximum scalability is also capped to the maximum number of units in a group.



As this model is relatively new, the vendors that made them needed to add to the value proposition, and several of them throw in all software value adds for the base price. This means that from a cost point of view, they can be far more attractive than the old framed SAN model if you want all the functionality they can provide, even if the incremental expansion cost is higher.


The bottom line - neither solution is perfect, and both have use cases in almost every level of business.

Thursday 12 January 2012

Why Compellent is going to be more efficient than Equallogic … for me…


I tweeted about EQL and Compellent efficiency the other day, and thought I would elaborate a little, to clarify my point.

First off, our Equallogic units are great, especially our SUMO’s. We are still going to be using them for the next few years and I do think that we made the correct choice in buying them just over a year ago.

Going forward we will be using Compellent and these are a few of the efficiency reasons why:

Block size
Equallogic uses a 15MB block size compared to Compellent’s 2MB or 512KB. For structured data that means that the EQL is an order of magnitude less efficient when it comes to snapshots. I have seen a log of a few hundred MB generate a snapshot of hundreds of GB on the EQL which becomes expensive on their SSD units. On the flip side, if your app is sequentially adding data to a file, the block size will be completely irrelevant so it does come down to your application.
I do not believe this is something they can change on a whim and is something you buy into when you buy EQL.

Thin provisioning
On EQL right now there is no space reclamation on thin provisioned volumes. As a result, once you have written to a LUN, it doesn’t know when you have deleted data (under windows anyway) and hence eventually becomes fat provisioned. Compellent has an agent that communicates with the array and tells it which blocks can be freed up, allowing thin provisioning to reclaim space. In addition Compellent only writes thin – there are no fat provisioned LUNs as well as the fact that if it sees the OS zeroing out a large amount of disk space, it doesn’t commit them to disk. No more accidental full formats on LUNS…

Pre-allocation
On the EQL you have to pre-allocate LUN space and snapshot space among other things. This is great for knowing that you won’t run out of capacity, but does reduce the efficiency of your space utilisation. Space allocated to snapshots on one LUN cannot be temporarily used by another LUN. On the Compellent arrays you don’t need to pre-allocate snapshot space per LUN but do run a greater risk of running out of capacity on badly managed systems.

Tiering / Data Progression
EQL implementation of auto tiering is great for consistent workloads, and fantastic at balancing loads on similar arrays. Where it’s not as good is inconsistent workloads or giving you granular control on how to tier data to lower performance disks. I will do a post at some stage about Compellent’s tiering, but for now lets just say that it’s far more robust and has some efficiency advantages like restriping snapshot data from R10 to R5. This robust tiering will allow us to more effectively use our SSD tier on the Compellent, than we are able to right now on the EQL.

For me these enhancements are theoretical as our arrays have arrived, but have not gone into production yet. Over the next few months I’m pretty sure I’m going to be giving the Compellents some stick when I get frustrated by its quirks, but for now I’m just really excited to see if it can live up to my expectations!