Scatterling DBA

Wednesday, 22 February 2012

Compellent iSCSI Optimisation on Windows

We now have our shiny new pair of Compellent Storage Center's installed, up and running.
This means that I get to spend some quality time testing, benchmarking and evaluating them before they go into production.
Previously I mentioned that the Compellent's were too smart for SQLIO which really disappointed me no end.
Not only did we use it as a tool to compare arrays through the purchase decision, but I find it significantly easier to test with than IOMeter.

Owing to a suggestion from @tonyholland00 (thank you Tony!!) it turns out that if you pre-allocate the disk and turn off caching, you can get realistic results from SQLIO again. Now most other array admins would cry out in horror at the idea of turning off caching on their arrays, but that's because their arrays are completely dependent on their write cache to deliver their advertised performance. Compellent arrays are different - they only have a tiny (512MB) write cache, and are properly architected around the number of spindles required for their target workload and don't require cache as a crutch. In addition, when testing the SSD Tier we have, the SSD's are actually faster than the cache. Pre-allocation is slightly out of the ordinary, but if that last 1-2% is that important to you for bulk loading, you would be pre-allocating your LUN anyway!

So there I was merrily testing away, and I wanted to both confirm the performance I was promised in our POC, and to see how far I could push one of these arrays. I'll definitely do some more posts on my findings, but I came across a bottleneck that may be common in many iSCSI configurations.

We are using 10GbE and I configured a blade with 4 ports, but was only able to get 2.4GB/sec even on bigger block sizes. Interestingly, the performance data looked almost identical to the POC data I received from Dell, but I assumed they had only used 2 HBA's.
All targets were connected, but I couldn't get as much info on the connections as I had been getting with my EQL arrays, as there is no connections tab comparable to what you get with the EQL HIT kit.
After a bit of investigation, it turned out that we were effectively getting one session on each fault domain (VLAN), and hence restricting us to just over 20Gb.

The solution is to manually create a session for each NIC to each target. Through the GUI, you can go to each connection, go to the properties, and add a new session. (Remembering to enable multi-pathing both when you create the initial connection, and when you add the additional session).
Now that's great if you are setting up a single server, going to a single array, but it's going to get very tedious when you are doing large numbers of servers, and connecting to more than one array.

We have 2 arrays, dual controller, each with 6 ports - resulting in 48 sessions to be manually configured. That's 10 min of tedium per server that I want to avoid. Thankfully, I found this link:
http://mrshannon.wordpress.com/2010/01/08/making-iscsi-targets-via-cmd/
You configure your list of targets once, and then on each server simply choose which nics you want to use. Run the batch file, reboot and you should be good to go.

After configuring the additional sessions, I was able to max out bandwidth on all 4 10GbE ports using a block size as low as 128K on a single LUN.
I suspect this oversight is common, as I think the same happened even in our POC setup, and it's not intuitive that you need to do it.
Coming from EQL, you expect it to be done automatically... the HIT Kit makes you lazy... :-)
Hopefully this will help even if you have fewer nics, by saturating all bandwidth available.

Friday, 3 February 2012

Compellent too clever for SQLIO

Was doing some testing during my compellent admin training yesterday, and came across some rather interesting behavior.
My IO testing tool of choice is SQLIO and it allows for very quick and easy perfomance baselining for your storage.
It's not SQL specific as the name would suggest, is command line and can easily be adapted to test almost any IO profile.
The best way to test is to ceate a test file larger than the cache on the SAN to ensure that the test is not completely cached.
(Unless of course you want to test cache performance...)

In this instance I created a 10GB file which is significantly bigger than the 3.5GB read cache on the compellent.
I ran my tests but considering that the underlying disks were only 5 x 15K drives, the perfomance was WAY too high. (30K IOPS - would save a lot on SSD's)
In addition, RAID 5 was writing as fast as RAID 10 - so something was fishy...

Having a look at the disk utilisation on the SAN I could see why straight away : the SAN was only storing change data !
My 10GB file was using less than 150MB because the rest was filled with 0's and the compellent does not write large blocks of sequential zero's...

Volume with 10 GB data file...

As a result my test file was easily fitting into cache - it was basically just metadata, which also explains why R5 and R10 were performing the same - 100% cache hit for both read and write!!

Disabling the cache allowed me to get a bit further but the fantastic thin write technology means that I'm going to have to get creative if I want to continue using SQLIO in future.

So lessons learned (by me) are:
1) When doing testing, always have expectations of what the outcome should be, as a both a sanity check, and to make sure the results are valid.
2) Make sure that your synthetic test works for the platform you are testing!

P.S.Subsequently found a good explanation of how SQLIO works and similar phenomena by Grant Fritchey which would confirm my observations:
http://www.simple-talk.com/community/blogs/scary/archive/2011/06/28/102058.aspx

Sunday, 15 January 2012

Framed VS Frameless SAN

Been a while since I posted on storage theory - this time I'm briefly covering the difference between framed SAN's like Compellent, 3Par, EMC etc. and frameless SAN's like Equallogic, Lefthand, Solidfire etc.

Framed

There are now 2 major SAN philosophies in terms of design. The traditional design is called framed and is characterised by head units with disks attached. This stems from the principle of taking a pair of servers, attaching a lot of disks and sharing them out over a SAN.

The upside is that it’s been around for a long time and vendors have gotten extremely good at building them. However their reputation has generally let to them charging a premium on these when you want additional functionality such as replication etc.

They can potentially be scaled up to the biggest storage clusters and can have extensive connectivity options. The biggest downside is that it is very easy to incorrectly spec the head unit and buy too small or too much. Buying too small, the head units become a bottleneck and upgrade costs are excessively high – in some cases complete replacement. Buying too big means a massive amount of over-expenditure on a unit that will be underutilized for its entire life.

There are exceptions to this where they only have one option for a head unit no matter what capacity you buy. In general, performance on all framed SAN's can only scale in terms of additional spindles, and once a bottleneck is reached in the head unit, any further scalability can be costly.

Frameless

Frameless SANS are a newer generation that are characterised by consisting of groups of self-contained units, each with their own controllers, disk, cache and connectivity. As more units are added, so capacity and performance scale. Groups of these units are usually managed as a single SAN and volumes usually span multiple of these units to gain performance or redundancy - sometimes both. The biggest advantage of this model is that you are never overspending on a head unit, and the head unit does not become a bottleneck for the disks. The downside to this model is that adding capacity generally gets more expensive the larger you grow, as you are buying more than a shelf of disks each time you add to the SAN. You are buying controllers, cache and capacity and this all adds to the incremental cost. Maximum scalability is also capped to the maximum number of units in a group.

As this model is relatively new, the vendors that made them needed to add to the value proposition, and several of them throw in all software value adds for the base price. This means that from a cost point of view, they can be far more attractive than the old framed SAN model if you want all the functionality they can provide, even if the incremental expansion cost is higher.

The bottom line - neither solution is perfect, and both have use cases in almost every level of business.

Thursday, 12 January 2012

Why Compellent is going to be more efficient than Equallogic … for me…

I tweeted about EQL and Compellent efficiency the other day, and thought I would elaborate a little, to clarify my point.

First off, our Equallogic units are great, especially our SUMO’s. We are still going to be using them for the next few years and I do think that we made the correct choice in buying them just over a year ago.

Going forward we will be using Compellent and these are a few of the efficiency reasons why:

Block size

Equallogic uses a 15MB block size compared to Compellent’s 2MB or 512KB. For structured data that means that the EQL is an order of magnitude less efficient when it comes to snapshots. I have seen a log of a few hundred MB generate a snapshot of hundreds of GB on the EQL which becomes expensive on their SSD units. On the flip side, if your app is sequentially adding data to a file, the block size will be completely irrelevant so it does come down to your application.

I do not believe this is something they can change on a whim and is something you buy into when you buy EQL.

Thin provisioning

On EQL right now there is no space reclamation on thin provisioned volumes. As a result, once you have written to a LUN, it doesn’t know when you have deleted data (under windows anyway) and hence eventually becomes fat provisioned. Compellent has an agent that communicates with the array and tells it which blocks can be freed up, allowing thin provisioning to reclaim space. In addition Compellent only writes thin – there are no fat provisioned LUNs as well as the fact that if it sees the OS zeroing out a large amount of disk space, it doesn’t commit them to disk. No more accidental full formats on LUNS…

Pre-allocation

On the EQL you have to pre-allocate LUN space and snapshot space among other things. This is great for knowing that you won’t run out of capacity, but does reduce the efficiency of your space utilisation. Space allocated to snapshots on one LUN cannot be temporarily used by another LUN. On the Compellent arrays you don’t need to pre-allocate snapshot space per LUN but do run a greater risk of running out of capacity on badly managed systems.

Tiering / Data Progression

EQL implementation of auto tiering is great for consistent workloads, and fantastic at balancing loads on similar arrays. Where it’s not as good is inconsistent workloads or giving you granular control on how to tier data to lower performance disks. I will do a post at some stage about Compellent’s tiering, but for now lets just say that it’s far more robust and has some efficiency advantages like restriping snapshot data from R10 to R5. This robust tiering will allow us to more effectively use our SSD tier on the Compellent, than we are able to right now on the EQL.

For me these enhancements are theoretical as our arrays have arrived, but have not gone into production yet. Over the next few months I’m pretty sure I’m going to be giving the Compellents some stick when I get frustrated by its quirks, but for now I’m just really excited to see if it can live up to my expectations!

Sunday, 20 November 2011

Direct Attached Storage

DAS is the de facto standard against which all other storage is compared. Most servers have built in storage and every server admin is comfortable using it. It is easy to implement with SQL and it is a known quantity in terms of support, configuration and performance. It can potentially be the cheapest option for your server and can have a relatively small footprint in terms of rack space. Until recently, DAS was slightly more limited in terms of maximum capacity and maximum IO performance, but this has been drastically changed in recent years by virtue of large capacity near-line SAS drives and PCIe SSD’s. Most SMB workloads can be configured in one way or another to fit on DAS and generally means that a SAN is only required when shared storage is desirable.

The biggest downsides to DAS are that in general it is considered to be dumb storage and it is less flexible than a SAN. There is no fancy functionality such as snapshotting, replication or thin provisioning. Online volume expansion is not really possible if there is no existing spare capacity available and clustering (shared disks) is not an option. Moving volumes is not trivial and unused capacity is stranded on islands scattered among the individual servers.
Of late the functionality gap has started to be filled by software solutions. Some vendors are offering solutions that provide drive replication, snapshotting etc. which allow DAS to fulfil more of the roles that SAN used to dominate. Examples of these are Steeleye Datakeeper from SIOS and DoubleTake from Vision Solutions. Some of these packages will go as far is integrating with MSCS and allow for clustering with DAS, but I have never personally tried to use it.

PCIe SSD

PCIe SSD’s are a disruptive technology which causes us to re-think our standards. It is now possible to get high end SAN performance or better, in a single, half-height card that can slot into any modern server. It integrates its own form of chip level redundancy on the card to handle individual chip failures, and is directly connected to the PCIe bus resulting in the lowest latency possible. Over and above this, bandwidth and capacity can be scaled by striping data over multiple cards and card redundancy can be achieved by mirroring cards. In addition capacities of over a terabyte can be handled by a single card, which means that almost any workload can be run on a standard 2U or 4U server. There are multiple vendors with products like this, the best know being FusionIO (Who also happen to give a lot back to the SQL community in the UK). At present, no SAN can compete with PCIe SSD’s in terms of latency, and few can compete in terms of price/performance. The biggest downside in my opinion is that you cannot use them for clustering, but this has forced me to re-evaluate the use of mirroring in our environment. Cost per TB can also climb quickly when adding redundancy.
At present I see 2 very interesting use cases for PCIe SSD’s in conjunction with SAN’s in the near future. Say you presently have one or 2 mission critical DB’s which also have high IO requirements, but you can’t afford a second SAN for redundancy. Using the PCIe SSD on a mirrored copy will give you complete storage independence on a second copy, give you the IO performance you need at a significantly lower cost and use hardly any rack space. The second use case is for the TempDB when SQL Server 2012 (Denali) comes out. Not only do you get the benefit of lower latency for the TempDB, but you also reduce the load on the SAN and your HBA’s, thereby making your entire storage subsystem more efficient.

There is also the PCIe SSD SAN acceleration option which I will discuss later.

Storage Area Networks

SAN’s have now become common place in all but the smallest businesses. While they used to be the purview of the large corporations, SAN functionality has filtered down in terms of cost and footprint to the point where you can download a trial version and run it as an application on a server. There is a distinction between SAN software and physical SAN infrastructure, but the majority of benefits are present in both.

The fundamental use of a SAN is to share storage between multiple servers. Storage is consolidated into a smaller number of large pools which can be accessed by several servers simultaneously. Generally an individual SAN consists of a pair of servers acting as head units with a large number of storage interconnects and a large number of communications ports. This allows these head units to connect to very large volumes of storage and to share it out over a variety of connection methods to a large number of application servers. By virtue of the large volume of storage that could be connected, SANs were able to deliver both large capacity and potentially higher performance than was possible through DAS. Over time this has been mitigated, but for very large volumes of data, some kind of pooled storage is going to be desirable.

Along with capacity and performance, SAN vendors also started to introduce functionality that could be used to justify the additional cost of the commodity hardware they were packaging. This is where the value add of SAN’s comes in: online capacity expansion, snapshots, thin provisioning, LUN cloning, SAN replication, boot from SAN, storage tiering and increased availability are all possible benefits, depending on the vendor.

The main downsides to SAN solutions are their cost, complexity and of late, latency at the high end.
Some SAN vendors charge a ridiculous premium that is not justified in terms of the value it brings to the business. If you are not going to be using clustering and the value-add features that a SAN can bring, there is no reason to invest in the technology. A SAN does add complexity and usually adds an additional layer of abstraction that can make troubleshooting more difficult. At the very high end, SAN’s will battle to beat out PCIe SSD’s in terms of latency.

Next up - framed vs frameless SAN...

Wednesday, 9 November 2011

Storage Concepts - Part 2

IO Latency

Disk IO latency is the most accurate way to find out if you have a bottleneck on your storage subsystem. Latency is measured in milliseconds, with general guidance suggesting that log latency should be below 10ms and data latency should be below 20ms. In the SME world, the rules are not so hard and fast. I find that we don’t have super consistent workloads, and that short periods of higher latency are acceptable. In data warehouse applications, it is generally preferable to have slightly higher latency if it means that higher overall throughput is achievable. On the other hand, if your business is handling credit card transactions, the faster you can store them, the more money you make, and hence that 10ms for log is FAR too high. (New PCI SSD’s have latencies in the microseconds…) How do I know if disk latency is an issue? Firstly monitor it and see what your latencies are, and then look at your wait stats. If the SQL disk related waits are high and your perfmon counters are telling you that disk latencies are consistently high, then more than likely you may be encountering a storage system bottleneck. (Provided the server has sufficient RAM and you have tuned your most expensive queries.) To measure disk latency, the disk sec per read/ write counters are used. Trend these over time, and look out for busy spikes through the day/week/month. E.g. you will probably notice latency spikes when running backups, but this is generally acceptable if the timing of the backups is out of the busiest office hours.

Throughput

Throughput is a measure of the maximum sum of concurrent IO’s that can be passed through your storage connectivity medium. In other words – how big is the pipe connecting the disks to the server. In general this will be your SAN HBA speed or on local disks, your disk connectivity type or RAID controller connection speed. If you have a single disk, this will be the SAS or SATA speed – 3 or 6Gb/s depending on how old it is. If you have a RAID controller or PCIe SSD it will be the bus speed – 4x or 8x and this will translate into a figure in Gb’s. If you have a fibre channel SAN it will probably be 4 or 8Gb/s and if you are using iSCSI, it will be 1 or 10Gb/s.

At small IO sizes or low IO loads, throughput generally does not play a part. However if the IO’s are large this can quickly get saturated and limit the rate at which data can get into the system. Especially in 1GbE iSCSI environments, this will probably be the first storage bottleneck you hit as a single physical disk can read more data than 1GbE can transfer. Remember that when dealing with networking and storage connectivity, sizes are measured in bits – not bytes! To make matters worse for iSCSI, there is TCP overhead and other inefficiencies that reduce the realistic throughput on a 1GbE link to just under 100 MB per second. To compensate for this, most iSCSI SANS use multiple links to give greater overall throughput. This phenomenon is not limited to iSCSI – if you have an older fibre channel SAN you may be limited to 2Gb/s which is almost as bad.

You have to take into account that just because you have a low throughput limit, your access profile may not necessarily be limited by this. If you are hammering your storage array with 8K IO’s chances are that you are not hitting a throughput limit, unless you have a very large number of spindles. There is a direct correlation between IO size and bandwidth – you will need to determine which of these is your bottleneck, if any at all.

IO Size vs Throughput

http://download.intel.com/support/network/sb/inteliscsiwp.pdf

Queue Depth

Queue depth used to be the standard indicator of a storage bottleneck. This was true because you knew how many disks were attached to your server, and if you took that number, multiplied it by 2, you had value by which you could judge if you had a backlog of outstanding IO requests. Since then, a series of changes have invalidated this view. Firstly, with a SAN, it is difficult to know exactly how many spindles your files reside on unless you are the SAN admin as well. Secondly, depending on workload, a higher number of pending IO requests may be beneficial when trying to drive higher throughput – especially on warehouse data drives. Thirdly, SSD’s redefine the number of requests that a particular drive may be able to handle, and hence the 2x multiplier may be completely invalid.

On the other hand, extremely low queue depths may indicate that your storage is not a problem, so the metric does have its uses but should never be used in isolation.

Next up - some info on DAS and SAN

Tuesday, 8 November 2011

Volume placement in a mixed Equallogic environment

Situation:
So you have a mixture of HDD based EQL’s and SSD based EQL’s in your array, you add the units into the same pool and now you have lost control over what data resides where! Your dev/test environment is using valuable SSD, and half your production DB is sitting on SATA – what do you do?

Maybe if we had so 6110S's we would have some space for snapshots...

Background:
First – the reasons for this happening: EQL’s try to proportionately distribute data across all members in a pool. So you have 8TB of SATA and 2TB on SSD, it will put 80% of each LUN on SATA and 20% on SDD. Which 20% is out of your control, unless you have Automated Performance Load Balancing turned on, in which case it should be the most used 20%. When you turn on APLB, you will also find that the percentages will start to skew over time, so the percentage on SSD should start to drop on quiet LUNS and grow a little on hot LUNs. However, you still can’t directly control what goes where.

Your options:
The easy option is to split the SSD and HDD units into separate pools and move the LUNS accordingly, but then you basically have trapped islands of capacity and performance. You don’t get any APLB benefit and when you run out of capacity in a pool, you are out.

On the other hand there are 2 other mechanisms that can be employed – neither is perfect, but at least you have some options.

Firstly there is volume binding. This is a CLI only operation allowing you to specify where a LUN must reside. In our SSD/HDD example, you could bind a LUN to the SSD and then you know that the data will always be there. On the negative side – what happens if you run out of snapshot space on that member? You also cannot bind a LUN that is bigger than a single member which could be an issue if you are using multiple SSD units to give you the capacity you require.

Next there is RAID preference. If your SSD and HDD members are on different RAID policies, you could specify the RAID type that you want per LUN. This means that if the SSD is RAID50 and the SATA member is RAID10, specifying RAID50 preference on your target LUN would force it to reside on the SDD members even if APLB is enabled. This is a best effort operation, and if the LUN space utilisation grows past the preferred RAID capacity, any other pool capacity is used and the data is re-distributed as per normal. Once capacity is available again, the RAID preference is honoured and data is moved back to the RAID type of choice.
The big advantages of this method are that LUNS can span multiple units, and there are no hard limits in terms of capacity – you just get degraded performance.

Summary:
You have 3 options:
• Pool separation – putting members in separate pools – manual and you lose APLB.
• Volume binding – specifying a single member to host a LUN – Limited in terms of size and flexibility.
• RAID preference – specifying a preferred RAID level if members are on differing RAID levels – best effort but most flexible.