Sunday 20 November 2011


Direct Attached Storage

DAS is the de facto standard against which all other storage is compared. Most servers have built in storage and every server admin is comfortable using it. It is easy to implement with SQL and it is a known quantity in terms of support, configuration and performance. It can potentially be the cheapest option for your server and can have a relatively small footprint in terms of rack space. Until recently, DAS was slightly more limited in terms of maximum capacity and maximum IO performance, but this has been drastically changed in recent years by virtue of large capacity near-line SAS drives and PCIe SSD’s. Most SMB workloads can be configured in one way or another to fit on DAS and generally means that a SAN is only required when shared storage is desirable.

The biggest downsides to DAS are that in general it is considered to be dumb storage and it is less flexible than a SAN. There is no fancy functionality such as snapshotting, replication or thin provisioning. Online volume expansion is not really possible if there is no existing spare capacity available and clustering (shared disks) is not an option. Moving volumes is not trivial and unused capacity is stranded on islands scattered among the individual servers.
Of late the functionality gap has started to be filled by software solutions. Some vendors are offering solutions that provide drive replication, snapshotting etc. which allow DAS to fulfil more of the roles that SAN used to dominate. Examples of these are Steeleye Datakeeper from SIOS and DoubleTake from Vision Solutions. Some of these packages will go as far is integrating with MSCS and allow for clustering with DAS, but I have never personally tried to use it.

PCIe SSD

PCIe SSD’s are a disruptive technology which causes us to re-think our standards. It is now possible to get high end SAN performance or better, in a single, half-height card that can slot into any modern server. It integrates its own form of chip level redundancy on the card to handle individual chip failures, and is directly connected to the PCIe bus resulting in the lowest latency possible. Over and above this, bandwidth and capacity can be scaled by striping data over multiple cards and card redundancy can be achieved by mirroring cards. In addition capacities of over a terabyte can be handled by a single card, which means that almost any workload can be run on a standard 2U or 4U server. There are multiple vendors with products like this, the best know being FusionIO (Who also happen to give a lot back to the SQL community in the UK). At present, no SAN can compete with PCIe SSD’s in terms of latency, and few can compete in terms of price/performance. The biggest downside in my opinion is that you cannot use them for clustering, but this has forced me to re-evaluate the use of mirroring in our environment. Cost per TB can also climb quickly when adding redundancy.
At present I see 2 very interesting use cases for PCIe SSD’s in conjunction with SAN’s in the near future. Say you presently have one or 2 mission critical DB’s which also have high IO requirements, but you can’t afford a second SAN for redundancy. Using the PCIe SSD on a mirrored copy will give you complete storage independence on a second copy, give you the IO performance you need at a significantly lower cost and use hardly any rack space. The second use case is for the TempDB when SQL Server 2012 (Denali) comes out. Not only do you get the benefit of lower latency for the TempDB, but you also reduce the load on the SAN and your HBA’s, thereby making your entire storage subsystem more efficient.

There is also the PCIe SSD SAN acceleration option which I will discuss later.


Storage Area Networks

SAN’s have now become common place in all but the smallest businesses. While they used to be the purview of the large corporations, SAN functionality has filtered down in terms of cost and footprint to the point where you can download a trial version and run it as an application on a server. There is a distinction between SAN software and physical SAN infrastructure, but the majority of benefits are present in both.

The fundamental use of a SAN is to share storage between multiple servers. Storage is consolidated into a smaller number of large pools which can be accessed by several servers simultaneously. Generally an individual SAN consists of a pair of servers acting as head units with a large number of storage interconnects and a large number of communications ports. This allows these head units to connect to very large volumes of storage and to share it out over a variety of connection methods to a large number of application servers. By virtue of the large volume of storage that could be connected, SANs were able to deliver both large capacity and potentially higher performance than was possible through DAS. Over time this has been mitigated, but for very large volumes of data, some kind of pooled storage is going to be desirable.

Along with capacity and performance, SAN vendors also started to introduce functionality that could be used to justify the additional cost of the commodity hardware they were packaging. This is where the value add of SAN’s comes in: online capacity expansion, snapshots, thin provisioning, LUN cloning, SAN replication, boot from SAN, storage tiering and increased availability are all possible benefits, depending on the vendor.

The main downsides to SAN solutions are their cost, complexity and of late, latency at the high end.
Some SAN vendors charge a ridiculous premium that is not justified in terms of the value it brings to the business. If you are not going to be using clustering and the value-add features that a SAN can bring, there is no reason to invest in the technology. A SAN does add complexity and usually adds an additional layer of abstraction that can make troubleshooting more difficult. At the very high end, SAN’s will battle to beat out PCIe SSD’s in terms of latency.

Next up - framed vs frameless SAN...

Wednesday 9 November 2011

Storage Concepts - Part 2


IO Latency

Disk IO latency is the most accurate way to find out if you have a bottleneck on your storage subsystem. Latency is measured in milliseconds, with general guidance suggesting that log latency should be below 10ms and data latency should be below 20ms. In the SME world, the rules are not so hard and fast. I find that we don’t have super consistent workloads, and that short periods of higher latency are acceptable. In data warehouse applications, it is generally preferable to have slightly higher latency if it means that higher overall throughput is achievable. On the other hand, if your business is handling credit card transactions, the faster you can store them, the more money you make, and hence that 10ms for log is FAR too high. (New PCI SSD’s have latencies in the microseconds…) How do I know if disk latency is an issue? Firstly monitor it and see what your latencies are, and then look at your wait stats. If the SQL disk related waits are high and your perfmon counters are telling you that disk latencies are consistently high, then more than likely you may be encountering a storage system bottleneck. (Provided the server has sufficient RAM and you have tuned your most expensive queries.) To measure disk latency, the disk sec per read/ write counters are used. Trend these over time, and look out for busy spikes through the day/week/month. E.g. you will probably notice  latency spikes when running backups, but this is generally acceptable if the timing of the backups is out of the busiest office hours.

Throughput

Throughput is a measure of the maximum sum of concurrent IO’s that can be passed through your storage connectivity medium.  In other words – how big is the pipe connecting the disks to the server. In general this will be your SAN HBA speed or on local disks, your disk connectivity type or RAID controller connection speed. If you have a single disk, this will be the SAS or SATA speed – 3 or 6Gb/s depending on how old it is. If you have a RAID controller or PCIe SSD it will be the bus speed – 4x or 8x and this will translate into a figure in Gb’s. If you have a fibre channel SAN it will probably be 4 or 8Gb/s and if you are using iSCSI, it will be 1 or 10Gb/s.
At small IO sizes or low IO loads, throughput generally does not play a part. However if the IO’s are large this can quickly get saturated and limit the rate at which data can get into the system. Especially in 1GbE iSCSI environments, this will probably be the first storage bottleneck you hit as a single physical disk can read more data than 1GbE can transfer. Remember that when dealing with networking and storage connectivity, sizes are measured in bits – not bytes! To make matters worse for iSCSI, there is TCP overhead and other inefficiencies that reduce the realistic throughput on a 1GbE link to just under 100 MB per second. To compensate for this, most iSCSI SANS use multiple links to give greater overall throughput. This phenomenon is not limited to iSCSI – if you have an older fibre channel SAN you may be limited to 2Gb/s which is almost as bad.
You have to take into account that just because you have a low throughput limit, your access profile may not necessarily be limited by this. If you are hammering your storage array with 8K IO’s chances are that you are not hitting a throughput limit, unless you have a very large number of spindles. There is a direct correlation between IO size and bandwidth – you will need to determine which of these is your bottleneck, if any at all.

IO Size vs Throughput

Queue depth used to be the standard indicator of a storage bottleneck. This was true because you knew how many disks were attached to your server, and if you took that number, multiplied it by 2, you had value by which you could judge if you had a backlog of outstanding IO requests. Since then, a series of changes have invalidated this view. Firstly, with a SAN, it is difficult to know exactly how many spindles your files reside on unless you are the SAN admin as well. Secondly, depending on workload, a higher number of pending IO requests may be beneficial when trying to drive higher throughput – especially on warehouse data drives. Thirdly, SSD’s redefine the number of requests that a particular drive may be able to handle, and hence the 2x multiplier may be completely invalid.
On the other hand, extremely low queue depths may indicate that your storage is not a problem, so the metric does have its uses but should never be used in isolation. 

Next up - some info on DAS and SAN

Tuesday 8 November 2011

Volume placement in a mixed Equallogic environment


Situation:
So you have a mixture of HDD based EQL’s and SSD based EQL’s in your array, you add the units into the same pool and now you have lost control over what data resides where! Your dev/test environment is using valuable SSD, and half your production DB is sitting on SATA – what do you do?

Maybe if we had so 6110S's we would have some space for snapshots...
Background:
First – the reasons for this happening: EQL’s try to proportionately distribute data across all members in a pool. So you have 8TB of SATA and 2TB on SSD, it will put 80% of each LUN on SATA and 20% on SDD. Which 20% is out of your control, unless you have Automated Performance Load Balancing turned on, in which case it should be the most used 20%. When you turn on APLB, you will also find that the percentages will start to skew over time, so the percentage on SSD should start to drop on quiet LUNS and grow a little on hot LUNs. However, you still can’t directly control what goes where.

Your options:
The easy option is to split the SSD and HDD units into separate pools and move the LUNS accordingly, but then you basically have trapped islands of capacity and performance. You don’t get any APLB benefit and when you run out of capacity in a pool, you are out.

On the other hand there are 2 other mechanisms that can be employed – neither is perfect, but at least you have some options.

Firstly there is volume binding. This is a CLI only operation allowing you to specify where a LUN must reside. In our SSD/HDD example, you could bind a LUN to the SSD and then you know that the data will always be there. On the negative side – what happens if you run out of snapshot space on that member? You also cannot bind a LUN that is bigger than a single member which could be an issue if you are using multiple SSD units to give you the capacity you require.

Next there is RAID preference. If your SSD and HDD members are on different RAID policies, you could specify the RAID type that you want per LUN. This means that if the SSD is RAID50 and the SATA member is RAID10, specifying RAID50 preference on your target LUN would force it to reside on the SDD members even if APLB is enabled. This is a best effort operation, and if the LUN space utilisation grows past the preferred RAID capacity, any other pool capacity is used and the data is re-distributed as per normal. Once capacity is available again, the RAID preference is honoured and data is moved back to the RAID type of choice.
The big advantages of this method are that LUNS can span multiple units, and there are no hard limits in terms of capacity – you just get degraded performance.

Summary:
You have 3 options:
Pool separation – putting members in separate pools – manual and you lose APLB.
Volume binding – specifying a single member to host a LUN – Limited in terms of size and flexibility.
RAID preference – specifying a preferred RAID level if members are on differing RAID levels – best effort but most flexible.

Sunday 6 November 2011

Storage Concepts - Part 1



SQL Storage 101

Having completed my session at SQLBits, I felt that there was too much information that I didn’t cover as comprehensively as I would have liked. So what I’m going to do is break up some of my session into a series of posts to try flesh out a high level understanding of SQL storage.
Welcome to the first part of SQL Storage for the Real World – Storage 101.

Disk IO

Disk IO are the physical reads and writes that make their way to the storage medium of your SQL server. The two types are writes and reads, and as their name entails, are either written to / read from the drive, from / into memory on the server. SQL server predominantly uses 8K – 64K block sizes, but these can be bigger when the server is doing large sequential operations such as read-ahead’s, backups, checkDB’s etc. In reality you will want to test your storage subsystem on 8k, 64K, 256K and possibly 1024K block sizes to get a feel for what it is capable of. I will expand on testing your storage, especially with SQLIO in the near future.

Read / Write Ratio

To understand the requirements that your storage subsystem will have to fulfil, you will need to find out what the ratio is between reads and writes under peak loads. The reason for this is, depending on the storage you choose, the number of writes the system can sustain will be drastically different depending on this ratio. For example, SSD’s generally have different specifications for reads and writes. Some can be drastically slower to write data, and some can be significantly faster. This will come down to manufacturer and model. Equallogic SSD’s for example have much lower write speeds than read speeds and as a result the specified IO performance of a member is highly dependent on the R/W ratio. On the other hand, FusionIO PCI SSD’s are able to write much faster than read, so read speed may well be the limiting criteria.
On the other hand, individual hard drives have more symmetrical R/W ratios, but are generally used in RAID configurations. Depending on the type of RAID, this can have a massive effect on the number of writes a set number of disks can handle. For example, a RAID 0 stripe has no overhead, and hence the write throughput should be roughly equal to the sum of the number of drives in the stripe and also be equal to the number of reads possible. On the other hand, RAID10 arrays have to write each block twice, and hence are only able to write at half the number of IO’s that they are able to read. This gets progressively worse for RAID 5 and RAID 6. This phenomenon is termed the RAID Factor. Some vendors have put in place mechanisms to mitigate the impact of this – Netapp is an example that comes to mind.
To determine your read/write ratio is relatively easy – simply run perfmon over the busiest periods of disk activity, capturing reads/sec and writes/sec, put this into excel and work out a 95th percentile average. It is common to see a 70/30 read/write ratio on an OLTP or mixed workload. However this can skew massively in both directions - on one of my systems I have seen a continuous 45/55 ratio - meaning that there were more writes than reads.

Random vs Sequential

There are 4 main pillars of disk performance:

·         Random Read
·         Random Write
·         Sequential Read
·         Sequential Write

Each of these profiles needs to be tested and validated against what you will be expecting from your system. Depending on your requirements, some types of data access will be more valuable than others. For example: sequential writes are extremely important for log files, while random reads are more important for OTLP data files. On the other hand, data warehouses generally prefer sequential reads. The good / bad news is that no load is 100% random and no load is 100% sequential and I’m not aware of an accurate way of measuring random vs sequential. You can argue that some access types will be more prevalent in certain types of databases, but remember that almost all disk access is abstracted or virtualised in some way. For example, some SSD controllers cache and then compress incoming writes before they are stored in NAND, more than likely your SAN is not storing your writes in the exact order they came in, on a single disk. In the world of hypervisors, almost all access will end up being random in any case.
As a result, all you can really do is understand what your system is capable of, so that you can firstly check that it is performing as you expect it to, and secondly test your application and see if its performance is sufficient for your needs. How do you know if it is sufficient?

That's where IO latency comes in, and that will be covered in the next post.

Tuesday 18 October 2011

Equallogic Automated Performance Load-balancing


About a week ago I was troubleshooting some latency issues on our Equallogic SAN, and came across a nice illustration of APLB in action. The basic test was this – run a consistent load on a LUN to see if we could reproduce the latency spikes I was intermittently picking up on our SQL instances. That way we could determine if it was a constant issue, or something to do with the load that SQL was putting on the SAN.

Our SAN is made up of the following: 1 x PS6510X and 3 x PS6010S units. All 10GbE, running on their own set of switches. We upgraded to FW5.1.1-H1 about a month ago (just before H2 was released) and we had enabled load balancing to allow unused SSD capacity to enhance our SUMO.

IOMeter Setting:

8K request size
Single worker
100% Read
100% Random
1 outstanding request (To make sure it wasn’t going to affect the rest of the SAN)
10GB test file (To make sure it wasn’t cached)
LUN on the SUMO (to begin with…)

Initially, we were seeing reasonably consistent performance and also managed to get one some of those spikes to appear.
Avg latency  - 8ms
Avg IOPS – 180

I then left this running overnight to see if the issue became more obvious over time.
In the morning, it looked like this:
Avg latency - <1ms
Avg IOPS - 1K+

What had happened is that APLB had kicked in and over time realised that some blocks in this LUN were doing more work than some others that had been migrated to SDD member, and then proceeded to move these over. By morning, the entire 10GB LUN had been moved and was running an order of magnitude faster than it was the day before.

You may be wondering why there were only 1K IOPS when this was sitting on SSD, but remember that I only had a single request outstanding, and therefore was more than likely being throttled by the network latency rather than the SSD storage itself.

To be fair, EQL SSD units are not the fastest – in fact I would go as far as to say that they are one of the slowest rackmount 10GbE SSD implementations out there. However, they are faster than any configuration of HDD’s you could fit into the same form factor, and we purchased them on the basis that they were good enough for what we wanted to do with them at the time.

So in conclusion:
·         Equallogic Load balancing across SAN members works, and does so especially well on consistent loads (The jury is still out on volatile workloads)
·         Mixing SSD and HDD models can have large benefits, if particular if there is a subset of data that you want to accelerate of a large LUN, but don’t want to put the whole LUN on SSD.
·         The load balancing happens automatically – you just turn it on for the group and it will do the rest. This is both a positive and a negative – it works but you lose control of the placement of the data. You can use volume pinning to fix a volume to a member, but that’s another story …
·         Load balancing is not instant – it will take time to kick in. It appears to me that capacity balancing kicks in first, and then performance load balancing kicks in afterwards.

Wednesday 5 October 2011

My final slide deck from SQLbits 9.
http://bit.ly/qtlLyM

Thanks to all who attended - please send in your feedback on my session.

Sunday 25 September 2011

SQLBits Slide Deck

Here is my provisional slide deck for my SQLBits 9 session:
SQL Storage for the Real World:

Any feedback will be appreciated!
I will be presenting in the Aintree room, on Saturday, at 16:00 - hope to see you there!

Wednesday 21 September 2011

SQLBits Prep...

An overview of what I’ll be covering in my SQLBits session

I have pretty much finished up my slides for my session at SQLBits, and now I’m going over the content I want to discuss for each slide. To give you some idea of what I’m going to discuss, here is the summary page from the session:

  • Requirements
  • Storage concepts
  • Storage options and priorities
  • SAN / DAS
  • HA / DR implications
  • Decide on a storage strategy
  • General guidance

It feels fairly easy to talk about the storage concepts and technologies for ages on end, but the problem is that it can come across as a little dry. I did a quick run through with my team last week and I nearly put some of them to sleep. So this week and next is all about making the content interesting!


I was originally going to avoid using vendor names and naming products, but then the whole discussion becomes very theoretical and difficult to relate to. Therefore I’m going to try to be fair and balanced with everyone I discuss and not mention any products that I really don’t like...


I promise to make the session as informative and interesting as possible!

Tuesday 20 September 2011

The reason for my silence is...

My submission for the SQLBits 9 conference (Query across the Mersey) has been accepted!!!
Been spending a lot of time working on my presentation, so hopefully it will go well!

I have never presented at a conference before, so I'm a little anxious but also very excited at the prospect!
My session is the last session on Saturday (community day), so if you can make it I would be most grateful!
It is titled: SQL Storage for the real world.


If it goes well, I will expand it into a series of articles here on my blog.

More info on the conference:
http://sqlbits.com/default.aspx

And this is what I was talking about...

In my previous entry I mentioned that you could build a SAN that performed better and cost less than your commercial SANs out there. Now I'm not going to make light of doing so, but here is an article to validate what I was trying to say:

http://www.theregister.co.uk/2011/09/20/nexenta_vmworld_2011/

"Cheap and cheerful storage from Nexenta matched EMC and NetApp's multi-million-dollar systems in VMworld 2011's Hands On Lab, and took up some of the slack from its rivals when difficulties arose on the first day. EMC and NetApp have cooly brushed off Nexenta's claims it outperformed them."

Really good to see some competition out there, as well as some proof that there isn't too much magic underneath the SAN covers...

Monday 15 August 2011

Build Your Own Storage Array?

So for me it's a massive temptation to build my own storage array and pop it into production...
It's significantly cheaper and possibly more perfomant than buying a pre-configured brand name unit, and it would be fun to build. Idea's of slapping multiple PCI SSD's into a supermicro chassis, and then filling the drive cages with big SATA tubs really excites me! You can get the best of both worlds - blazing fast SSD's plus big storage all in one unit. Spec up lots of RAM for a really fat cache and you have the potential to blow away most units on the market.

You get some amazing software to give you as little or as much functionality as you could possibly want - eOpen, SAN Symphony, Starwind etc. all give you amazing features or you could go VSA and try the Lefthand or other virtual SAN solution...

Yes I know these idea's are naive, but it's the basis for a lot of the storage companies out there...
So if the hardware is there, and the software is there, then why not...?

Simple - when this solution has problems, all responsibility will lie with you alone. You can't call in the cavalry when you suddenly find there is a massive compatibility issue with a new software version or firmware edition.
I don't know about you, but I have enough stress in my job already, and don't need to be lying in bed at night wondering if a new firmware version is going to fix a bug or create new ones for me... I would rather leave that to a team of professionals who focus all their time and effort on this exact question.

Is there a time I would do it? Yes, definitely. If cost was the single biggest motivating factor for the company, or if I was unable to attain the performance levels that my system required under a specific budget. Also these kinds of solutions are great for dev and test, as long as downtime is permissible and you don't have a 5 9's SLA. You can build a very big storage solution very cost effectively in this manner and you can also build some very fast systems in this way (Tony Rogerson's blog comes to mind). But they are not enterprise class solutions - and if you consider your SQL installation to be enterprise worthy, then why risk the whole thing for a few pounds?

Sunday 14 August 2011

On Monitoring

Tonight it occurred to me that monitoring is more important than we probably give it credit for.
The way I see it, it should server 3 purposes:

1) Alerting - this is obvious but sometimes ineffective. As an example, that email that comes in at 3am, to tell you that a critical job has failed, is only going to be read when you wake up in the morning. So not only does it have to be accurate and timious, but it also needs to be appropriately delivered.

2) Provide situational information. So you get the dreaded phone call from the night shift that something is broken. You dial in and open your monitoring tools and you should be immediately be greeted with both what went wrong / is going wrong, and what is causing it. I guess this is a little eutopian to believe that both cause and effect would be shown that easily, but the goal is that at least there is enough info to diagnose the problem directly from the metrics the monitoring is displaying to you.

3) Trending. I guess most people would classify this under a different category to monitoring, but I believe that trending and performance prediction is an integral aspect of any monitoring solution. All the data you need for your capacity planning should be captured by your monitoring solution, and is probably already stored in it's database right now - the question is, can you access it easily?

I have a fetish for monitoring, and believe that none of the 3rd party products I have seen adequately  address all facets of SQL monitoring. Most do a reasonable job, and you could get by with almost any of them, but what are you inadvertently missing? To combat this, I prefer to run multiple solutions simultaneously on our high value systems. Yes, this means greater overhead on these servers, but the combination of overlapping redundancy and the different perspectives given buy different monitoring solutions can be extremely enlightening. On course you can be hardcore and roll your own, but I'm too busy doing interesting things to be writing a million scripts that probably only cover 50% of the things that I need to know.