Sunday, 20 November 2011


Direct Attached Storage

DAS is the de facto standard against which all other storage is compared. Most servers have built in storage and every server admin is comfortable using it. It is easy to implement with SQL and it is a known quantity in terms of support, configuration and performance. It can potentially be the cheapest option for your server and can have a relatively small footprint in terms of rack space. Until recently, DAS was slightly more limited in terms of maximum capacity and maximum IO performance, but this has been drastically changed in recent years by virtue of large capacity near-line SAS drives and PCIe SSD’s. Most SMB workloads can be configured in one way or another to fit on DAS and generally means that a SAN is only required when shared storage is desirable.

The biggest downsides to DAS are that in general it is considered to be dumb storage and it is less flexible than a SAN. There is no fancy functionality such as snapshotting, replication or thin provisioning. Online volume expansion is not really possible if there is no existing spare capacity available and clustering (shared disks) is not an option. Moving volumes is not trivial and unused capacity is stranded on islands scattered among the individual servers.
Of late the functionality gap has started to be filled by software solutions. Some vendors are offering solutions that provide drive replication, snapshotting etc. which allow DAS to fulfil more of the roles that SAN used to dominate. Examples of these are Steeleye Datakeeper from SIOS and DoubleTake from Vision Solutions. Some of these packages will go as far is integrating with MSCS and allow for clustering with DAS, but I have never personally tried to use it.

PCIe SSD

PCIe SSD’s are a disruptive technology which causes us to re-think our standards. It is now possible to get high end SAN performance or better, in a single, half-height card that can slot into any modern server. It integrates its own form of chip level redundancy on the card to handle individual chip failures, and is directly connected to the PCIe bus resulting in the lowest latency possible. Over and above this, bandwidth and capacity can be scaled by striping data over multiple cards and card redundancy can be achieved by mirroring cards. In addition capacities of over a terabyte can be handled by a single card, which means that almost any workload can be run on a standard 2U or 4U server. There are multiple vendors with products like this, the best know being FusionIO (Who also happen to give a lot back to the SQL community in the UK). At present, no SAN can compete with PCIe SSD’s in terms of latency, and few can compete in terms of price/performance. The biggest downside in my opinion is that you cannot use them for clustering, but this has forced me to re-evaluate the use of mirroring in our environment. Cost per TB can also climb quickly when adding redundancy.
At present I see 2 very interesting use cases for PCIe SSD’s in conjunction with SAN’s in the near future. Say you presently have one or 2 mission critical DB’s which also have high IO requirements, but you can’t afford a second SAN for redundancy. Using the PCIe SSD on a mirrored copy will give you complete storage independence on a second copy, give you the IO performance you need at a significantly lower cost and use hardly any rack space. The second use case is for the TempDB when SQL Server 2012 (Denali) comes out. Not only do you get the benefit of lower latency for the TempDB, but you also reduce the load on the SAN and your HBA’s, thereby making your entire storage subsystem more efficient.

There is also the PCIe SSD SAN acceleration option which I will discuss later.


Storage Area Networks

SAN’s have now become common place in all but the smallest businesses. While they used to be the purview of the large corporations, SAN functionality has filtered down in terms of cost and footprint to the point where you can download a trial version and run it as an application on a server. There is a distinction between SAN software and physical SAN infrastructure, but the majority of benefits are present in both.

The fundamental use of a SAN is to share storage between multiple servers. Storage is consolidated into a smaller number of large pools which can be accessed by several servers simultaneously. Generally an individual SAN consists of a pair of servers acting as head units with a large number of storage interconnects and a large number of communications ports. This allows these head units to connect to very large volumes of storage and to share it out over a variety of connection methods to a large number of application servers. By virtue of the large volume of storage that could be connected, SANs were able to deliver both large capacity and potentially higher performance than was possible through DAS. Over time this has been mitigated, but for very large volumes of data, some kind of pooled storage is going to be desirable.

Along with capacity and performance, SAN vendors also started to introduce functionality that could be used to justify the additional cost of the commodity hardware they were packaging. This is where the value add of SAN’s comes in: online capacity expansion, snapshots, thin provisioning, LUN cloning, SAN replication, boot from SAN, storage tiering and increased availability are all possible benefits, depending on the vendor.

The main downsides to SAN solutions are their cost, complexity and of late, latency at the high end.
Some SAN vendors charge a ridiculous premium that is not justified in terms of the value it brings to the business. If you are not going to be using clustering and the value-add features that a SAN can bring, there is no reason to invest in the technology. A SAN does add complexity and usually adds an additional layer of abstraction that can make troubleshooting more difficult. At the very high end, SAN’s will battle to beat out PCIe SSD’s in terms of latency.

Next up - framed vs frameless SAN...

Wednesday, 9 November 2011

Storage Concepts - Part 2


IO Latency

Disk IO latency is the most accurate way to find out if you have a bottleneck on your storage subsystem. Latency is measured in milliseconds, with general guidance suggesting that log latency should be below 10ms and data latency should be below 20ms. In the SME world, the rules are not so hard and fast. I find that we don’t have super consistent workloads, and that short periods of higher latency are acceptable. In data warehouse applications, it is generally preferable to have slightly higher latency if it means that higher overall throughput is achievable. On the other hand, if your business is handling credit card transactions, the faster you can store them, the more money you make, and hence that 10ms for log is FAR too high. (New PCI SSD’s have latencies in the microseconds…) How do I know if disk latency is an issue? Firstly monitor it and see what your latencies are, and then look at your wait stats. If the SQL disk related waits are high and your perfmon counters are telling you that disk latencies are consistently high, then more than likely you may be encountering a storage system bottleneck. (Provided the server has sufficient RAM and you have tuned your most expensive queries.) To measure disk latency, the disk sec per read/ write counters are used. Trend these over time, and look out for busy spikes through the day/week/month. E.g. you will probably notice  latency spikes when running backups, but this is generally acceptable if the timing of the backups is out of the busiest office hours.

Throughput

Throughput is a measure of the maximum sum of concurrent IO’s that can be passed through your storage connectivity medium.  In other words – how big is the pipe connecting the disks to the server. In general this will be your SAN HBA speed or on local disks, your disk connectivity type or RAID controller connection speed. If you have a single disk, this will be the SAS or SATA speed – 3 or 6Gb/s depending on how old it is. If you have a RAID controller or PCIe SSD it will be the bus speed – 4x or 8x and this will translate into a figure in Gb’s. If you have a fibre channel SAN it will probably be 4 or 8Gb/s and if you are using iSCSI, it will be 1 or 10Gb/s.
At small IO sizes or low IO loads, throughput generally does not play a part. However if the IO’s are large this can quickly get saturated and limit the rate at which data can get into the system. Especially in 1GbE iSCSI environments, this will probably be the first storage bottleneck you hit as a single physical disk can read more data than 1GbE can transfer. Remember that when dealing with networking and storage connectivity, sizes are measured in bits – not bytes! To make matters worse for iSCSI, there is TCP overhead and other inefficiencies that reduce the realistic throughput on a 1GbE link to just under 100 MB per second. To compensate for this, most iSCSI SANS use multiple links to give greater overall throughput. This phenomenon is not limited to iSCSI – if you have an older fibre channel SAN you may be limited to 2Gb/s which is almost as bad.
You have to take into account that just because you have a low throughput limit, your access profile may not necessarily be limited by this. If you are hammering your storage array with 8K IO’s chances are that you are not hitting a throughput limit, unless you have a very large number of spindles. There is a direct correlation between IO size and bandwidth – you will need to determine which of these is your bottleneck, if any at all.

IO Size vs Throughput

Queue depth used to be the standard indicator of a storage bottleneck. This was true because you knew how many disks were attached to your server, and if you took that number, multiplied it by 2, you had value by which you could judge if you had a backlog of outstanding IO requests. Since then, a series of changes have invalidated this view. Firstly, with a SAN, it is difficult to know exactly how many spindles your files reside on unless you are the SAN admin as well. Secondly, depending on workload, a higher number of pending IO requests may be beneficial when trying to drive higher throughput – especially on warehouse data drives. Thirdly, SSD’s redefine the number of requests that a particular drive may be able to handle, and hence the 2x multiplier may be completely invalid.
On the other hand, extremely low queue depths may indicate that your storage is not a problem, so the metric does have its uses but should never be used in isolation. 

Next up - some info on DAS and SAN

Tuesday, 8 November 2011

Volume placement in a mixed Equallogic environment


Situation:
So you have a mixture of HDD based EQL’s and SSD based EQL’s in your array, you add the units into the same pool and now you have lost control over what data resides where! Your dev/test environment is using valuable SSD, and half your production DB is sitting on SATA – what do you do?

Maybe if we had so 6110S's we would have some space for snapshots...
Background:
First – the reasons for this happening: EQL’s try to proportionately distribute data across all members in a pool. So you have 8TB of SATA and 2TB on SSD, it will put 80% of each LUN on SATA and 20% on SDD. Which 20% is out of your control, unless you have Automated Performance Load Balancing turned on, in which case it should be the most used 20%. When you turn on APLB, you will also find that the percentages will start to skew over time, so the percentage on SSD should start to drop on quiet LUNS and grow a little on hot LUNs. However, you still can’t directly control what goes where.

Your options:
The easy option is to split the SSD and HDD units into separate pools and move the LUNS accordingly, but then you basically have trapped islands of capacity and performance. You don’t get any APLB benefit and when you run out of capacity in a pool, you are out.

On the other hand there are 2 other mechanisms that can be employed – neither is perfect, but at least you have some options.

Firstly there is volume binding. This is a CLI only operation allowing you to specify where a LUN must reside. In our SSD/HDD example, you could bind a LUN to the SSD and then you know that the data will always be there. On the negative side – what happens if you run out of snapshot space on that member? You also cannot bind a LUN that is bigger than a single member which could be an issue if you are using multiple SSD units to give you the capacity you require.

Next there is RAID preference. If your SSD and HDD members are on different RAID policies, you could specify the RAID type that you want per LUN. This means that if the SSD is RAID50 and the SATA member is RAID10, specifying RAID50 preference on your target LUN would force it to reside on the SDD members even if APLB is enabled. This is a best effort operation, and if the LUN space utilisation grows past the preferred RAID capacity, any other pool capacity is used and the data is re-distributed as per normal. Once capacity is available again, the RAID preference is honoured and data is moved back to the RAID type of choice.
The big advantages of this method are that LUNS can span multiple units, and there are no hard limits in terms of capacity – you just get degraded performance.

Summary:
You have 3 options:
Pool separation – putting members in separate pools – manual and you lose APLB.
Volume binding – specifying a single member to host a LUN – Limited in terms of size and flexibility.
RAID preference – specifying a preferred RAID level if members are on differing RAID levels – best effort but most flexible.

Sunday, 6 November 2011

Storage Concepts - Part 1



SQL Storage 101

Having completed my session at SQLBits, I felt that there was too much information that I didn’t cover as comprehensively as I would have liked. So what I’m going to do is break up some of my session into a series of posts to try flesh out a high level understanding of SQL storage.
Welcome to the first part of SQL Storage for the Real World – Storage 101.

Disk IO

Disk IO are the physical reads and writes that make their way to the storage medium of your SQL server. The two types are writes and reads, and as their name entails, are either written to / read from the drive, from / into memory on the server. SQL server predominantly uses 8K – 64K block sizes, but these can be bigger when the server is doing large sequential operations such as read-ahead’s, backups, checkDB’s etc. In reality you will want to test your storage subsystem on 8k, 64K, 256K and possibly 1024K block sizes to get a feel for what it is capable of. I will expand on testing your storage, especially with SQLIO in the near future.

Read / Write Ratio

To understand the requirements that your storage subsystem will have to fulfil, you will need to find out what the ratio is between reads and writes under peak loads. The reason for this is, depending on the storage you choose, the number of writes the system can sustain will be drastically different depending on this ratio. For example, SSD’s generally have different specifications for reads and writes. Some can be drastically slower to write data, and some can be significantly faster. This will come down to manufacturer and model. Equallogic SSD’s for example have much lower write speeds than read speeds and as a result the specified IO performance of a member is highly dependent on the R/W ratio. On the other hand, FusionIO PCI SSD’s are able to write much faster than read, so read speed may well be the limiting criteria.
On the other hand, individual hard drives have more symmetrical R/W ratios, but are generally used in RAID configurations. Depending on the type of RAID, this can have a massive effect on the number of writes a set number of disks can handle. For example, a RAID 0 stripe has no overhead, and hence the write throughput should be roughly equal to the sum of the number of drives in the stripe and also be equal to the number of reads possible. On the other hand, RAID10 arrays have to write each block twice, and hence are only able to write at half the number of IO’s that they are able to read. This gets progressively worse for RAID 5 and RAID 6. This phenomenon is termed the RAID Factor. Some vendors have put in place mechanisms to mitigate the impact of this – Netapp is an example that comes to mind.
To determine your read/write ratio is relatively easy – simply run perfmon over the busiest periods of disk activity, capturing reads/sec and writes/sec, put this into excel and work out a 95th percentile average. It is common to see a 70/30 read/write ratio on an OLTP or mixed workload. However this can skew massively in both directions - on one of my systems I have seen a continuous 45/55 ratio - meaning that there were more writes than reads.

Random vs Sequential

There are 4 main pillars of disk performance:

·         Random Read
·         Random Write
·         Sequential Read
·         Sequential Write

Each of these profiles needs to be tested and validated against what you will be expecting from your system. Depending on your requirements, some types of data access will be more valuable than others. For example: sequential writes are extremely important for log files, while random reads are more important for OTLP data files. On the other hand, data warehouses generally prefer sequential reads. The good / bad news is that no load is 100% random and no load is 100% sequential and I’m not aware of an accurate way of measuring random vs sequential. You can argue that some access types will be more prevalent in certain types of databases, but remember that almost all disk access is abstracted or virtualised in some way. For example, some SSD controllers cache and then compress incoming writes before they are stored in NAND, more than likely your SAN is not storing your writes in the exact order they came in, on a single disk. In the world of hypervisors, almost all access will end up being random in any case.
As a result, all you can really do is understand what your system is capable of, so that you can firstly check that it is performing as you expect it to, and secondly test your application and see if its performance is sufficient for your needs. How do you know if it is sufficient?

That's where IO latency comes in, and that will be covered in the next post.