IO Latency
Disk IO latency is the most accurate way to find out if
you have a bottleneck on your storage subsystem. Latency is measured in
milliseconds, with general guidance suggesting that log latency should be below
10ms and data latency should be below 20ms. In the SME world, the rules are not
so hard and fast. I find that we don’t have super consistent workloads, and
that short periods of higher latency are acceptable. In data warehouse
applications, it is generally preferable to have slightly higher latency if it
means that higher overall throughput is achievable. On the other hand, if your
business is handling credit card transactions, the faster you can store them,
the more money you make, and hence that 10ms for log is FAR too high. (New PCI
SSD’s have latencies in the microseconds…) How do I know if disk latency is an
issue? Firstly monitor it and see what your latencies are, and then look at
your wait stats. If the SQL disk related waits are high and your perfmon
counters are telling you that disk latencies are consistently high, then more
than likely you may be encountering a storage system bottleneck. (Provided the
server has sufficient RAM and you have tuned your most expensive queries.) To measure disk latency, the disk sec per read/ write
counters are used. Trend these over time, and look out for busy spikes through
the day/week/month. E.g. you will probably notice latency spikes when running backups,
but this is generally acceptable if the timing of the backups is out of the
busiest office hours.
Throughput
Throughput is a measure of the maximum sum of concurrent IO’s
that can be passed through your storage connectivity medium. In other words – how big is the pipe
connecting the disks to the server. In general this will be your SAN HBA speed
or on local disks, your disk connectivity type or RAID controller connection
speed. If you have a single disk, this will be the SAS or SATA speed – 3 or 6Gb/s
depending on how old it is. If you have a RAID controller or PCIe SSD it will
be the bus speed – 4x or 8x and this will translate into a figure in Gb’s. If
you have a fibre channel SAN it will probably be 4 or 8Gb/s and if you are
using iSCSI, it will be 1 or 10Gb/s.
At small IO sizes or low IO loads, throughput generally
does not play a part. However if the IO’s are large this can quickly get
saturated and limit the rate at which data can get into the system. Especially
in 1GbE iSCSI environments, this will probably be the first storage bottleneck
you hit as a single physical disk can read more data than 1GbE can transfer.
Remember that when dealing with networking and storage connectivity, sizes are
measured in bits – not bytes! To make matters worse for iSCSI, there is TCP
overhead and other inefficiencies that reduce the realistic throughput on a
1GbE link to just under 100 MB per second. To compensate for this, most iSCSI
SANS use multiple links to give greater overall throughput. This phenomenon is
not limited to iSCSI – if you have an older fibre channel SAN you may be
limited to 2Gb/s which is almost as bad.
You have to take into account that just because you have
a low throughput limit, your access profile may not necessarily be limited by
this. If you are hammering your storage array with 8K IO’s chances are that you
are not hitting a throughput limit, unless you have a very large number of
spindles. There is a direct correlation between IO size and bandwidth – you
will need to determine which of these is your bottleneck, if any at all.
IO Size vs Throughput
Queue depth used to be the standard indicator of a
storage bottleneck. This was true because you knew how many disks were attached
to your server, and if you took that number, multiplied it by 2, you had value
by which you could judge if you had a backlog of outstanding IO requests. Since
then, a series of changes have invalidated this view. Firstly, with a SAN, it
is difficult to know exactly how many spindles your files reside on unless you
are the SAN admin as well. Secondly, depending on workload, a higher number of
pending IO requests may be beneficial when trying to drive higher throughput –
especially on warehouse data drives. Thirdly, SSD’s redefine the number of
requests that a particular drive may be able to handle, and hence the 2x
multiplier may be completely invalid.
On the other hand, extremely low queue depths may
indicate that your storage is not a problem, so the metric does have its uses but should never be used in isolation.
Next up - some info on DAS and SAN
No comments:
Post a Comment