Well here we are at part 8 of my series on making SharePoint performance management a little bit easier to understand. What is interesting about this series is its timing. Much has been made about SharePoint 2013's stated requirement of 24GB of RAM for a “Single server with a built-in database or single server that uses SQL Server”. While the reality is that requirements depends on what components that you are working with, this series of articles should be just as useful in relation to SharePoint 2013 as for any other version.
If you have been following the previous articles in this series, we have been spending some time examining disk performance, as that is a very common area where a sub-optimal configuration can result in a poor experience for users. In Part 6, we looked at the relationship between the performance metrics of disk latency, IOPS and MBPS. We also touched on the IO characteristics (nerd speak for the manner in which something reads and writes to disk) of SQL Server and some SharePoint components. In the last post, we examined the Windows performance counters that one would use to quickly monitor latency and IOPS in particular. We then finished off by taking a toe dip into the coolness of the SQLIO utility, which is a great tool for stress testing your storage infrastructure by pulverizing it with different IO read and write patterns.
In this post, we will spend a bit of time taking SQLIO to the next step and I will show you how you can run a comprehensive disk infrastructure stress test. Luckily for the both of us, others have done the hard work for us and we can reap the benefits of their expertise and insights. First up, however, I would like to kick things off by spending a little time showing you the relationship between SQLIO results and performance monitor counters. This helps to reinforce what the reported numbers mean.
Performance Monitor and SQLIO
In the previous post when we used Windows Performance Monitor, we plotted IOPS and Latency by watching the counters as they occurred in real-time. While this is nice for a quick analysis, nothing is actually stored for later analysis. Fortunately, performance monitor has the capability to run a trace and collect a much larger data set for a more detailed analysis later. So first up, lets use performance monitor to collect disk performance data while we run a SQLIO stress test. After the test has been run, we will then review the trace data and validate it against the results that SQLIO reports.
So go ahead and start up performance monitor (please consult Part 7 of this series if you are unsure of how to do this). Looking at the top left of the performance manager, you should see several options listed under “Performance”. Click Data Collector Sets and look for a sub menu called User Defined. Now right-click on User Defined and choose New –> Data Collector Set as shown in figure 1.
Figure 1: Using Windows Performance Monitor to create a new data collector set that we'll run in a SQLIO stress test.
This will start a wizard that will ask us to define what performance counters we are interested in and how often to sample performance. I have pasted screenshots of the sequence below (click to enlarge any particular one). First up, we need to give a name to this collection of counters and as you can see in figure 2, I called mine Disk IO Experiments. Once we have given it a name, we have to choose the type of performance data we want to collect. As shown in figure 3, tick the Performance counter option and ensure the others are left unticked.
Figure 2: Naming the data collector set.
Figure 3: Choosing options to create performance counter data logs.
Next, we need to pick what specific counters we need. We will use the same counters that we used in Part 7, and we'll add two additional counters. To remind you of Part 7, the counters we looked at were:
- Avg. Disk sec/Read – (measures latency by looking at how long in seconds, a read of data from the disk took)
- Avg. Disk sec/Write – (measures latency by looking at how long in seconds, a write of data to the disk took)
- Disk Reads/sec – (measures IOPS by looking at the rate of read operations on the disk per second)
- Disk Writes/sec – (measures IOPS by looking at the rate of write operations on the disk per second)
In addition to these counters, we will also add two more to the collector set:
- Avg. Disk Bytes/Read – (Measures size of each read request by reporting the number of bytes each used)
- Avg. Disk Bytes/Write – (Measures size of each write request by reporting the number of bytes each)
We will use these counters to see if the size of the IO request that SQLIO uses is reported correctly.
Depending on your configuration, choose the PhysicalDisk or LogicalDisk performance object (consult Part 7 for the difference between PhysicalDisk and LogicalDisk). You will then find the performance counters I listed above. Before you do anything else, make sure that you pick the right disk or partition from the “Instances of selected object” section (figure 4). We need to specifically pick the disk or partition that SQLIO is going to stress test. Now, from the list of available counters, you should select the aforementioned six performance counters and click the “Add” button. Finally, make sure that you pick the sample interval to be 1 second as shown in figure 5. This is really important because it makes it easy to compare to SQLIO which reports on a per second basis.
Figure 4: Select the right disk or partition in the Instances of selected object area.
Figure 5: Make sure the sample interval is 1 second.
At this point you do not need to configure anything else, so click Finish rather than Next, and the collector should now be ready to go. It will not start by default, but since there is no fun in that, let’s collect some data. Right-click on your shiny new data collector set (figure 6) and choose Start.
Figure 6: Start your data collector set.
Once started, Performance Monitor collects the values of the six counters each and every second. Now let’s run a SQLIO command to give it something to measure. In this example, I am going to run SQLIO with random 8KB writes. But to make it interesting, I will use two threads and simulate eight outstanding IO requests in the queue. If you recall my grocery check-out metaphor of Part 6, this is like having eight people with full shopping carts waiting in line for a single check-out operator. Since the guy at the back of the line has to wait for the seven people in front of him to be processed, he has to wait longer. So with eight outstanding IO requests, latency should increase as each IO request will be sitting in a queue behind the seven other requests.
By the way, if none of that made sense, then you did not read Part 6 and Part 7. I urge you to read them before continuing here, because I am assuming prior knowledge of SQLIO and disk latency characteristics and the big trolley theory.
Here is the SQLIO command and figure 7 shows the result…
SQLIO –kW –b8 –frandom –s120 –t2 –o8 –BH –LS F:\testfile.dat
Figure 7: The results of the test running SQLIO with random 8KB writes.
Now take a note of the results reported. IOPS was 526, MBs/sec was 4.11 and as expected, the average latency was much larger than the SQLIO tests we ran in Part 7. In this case, latency was 29 milliseconds.
Let’s now compare this to what performance monitor captured. First up, return to Performance Monitor, and stop your data collector set by right-clicking on it and choosing Stop and shown in figure 8. Now if you cast your eye to the top left navigation pane, you should see an option called “Reports” listed under “Performance”. Click on “Reports” and look for a sub menu called “User Defined”. Expand “User Defined” and hey presto! Your data collector set should be listed, as shown in figure 9.
Figure 8: Stop your data collector set.
Figure 9: Report data for the six performance counters.
Expand the data collector set and you will find a report for the data you just collected (figure 10). The naming convention is the server name and the date of the collection. Click on this and you will then see the performance data for that collection in the right pane. At the bottom you can see the six performance counters we chose and just by looking at the graph, you can clearly see when SQLIO started and stopped.
Figure 10: Our report with the data collected by the performance counters.
Now we have to do one additional step to make sure that we are comparing apples with apples. Performance monitor will calculate its averages based on the total time displayed. As you can see in figure 10 above, I did not run SQLIO straight away, but the performance counters were collected each second nonetheless. Therefore we have a heap of zero values that will bring the averages down and mislead. Fear not though, it is fairly easy (although not completely obvious) to zoom into the time we are interested in. If you look closely, just below the performance graph, where the time is reported, there is a sliding scale (figure 11). If you click and drag the left and right boxes, you can highlight a specific time you are interested in. This will be shown in the performance graph too, so using this tool, we can get more specific about the time we are interested in. Then in the toolbar above the graph, you will see a zoom button (figure 12). Click it and watch the magic…
Figure 11: Use the sliding scale to set the start and end of the report data you want to examine in depth.
Figure 12: Click the button to zoom in on the report data (or press Ctrl+Z).
As you can see in Figure 13, now we are looking at the performance data for the period when the SQLIO was run. (Now it should be noted that Windows Performance Monitor isn’t particularly granular here. I had to fiddle with the sliding scale a couple of times to accurately set the exact times when SQLIO was started and then stopped.)
Figure 13: The performance data for the period when the SQLIO was run.
Now let’s look at the results reported by Performance Monitor. Figure 13 shows the number of Disk Writes per second. Let’s zoom into the figures for the time period and review the average result over the sample period. To save you squinting, I have pasted it below (figure 14) and called out the counter in question. Performance Monitor has reported average “Disk Writes/Sec” as 525.423. This is entirely consistent with SQLIO’s reported IOPS of 526.
Figure 14: Looking at the Disk Writes/Sec counter from the test.
Latency (reported in seconds via the counter Avg. Disk sec/Write and shown in figure 15) is also fairly consistent with SQLIO. The figure from performance monitor was 0.03 seconds (30 milliseconds). SQLIO reported 29 milliseconds.
Figure 15: Looking at the Avg. Disk sec/Write counter from the test.
What about IO size? Well, that’s what Avg disk bytes/write is for… Let’s take a look shall we (figure 16)? Yup.. 8192 kilobytes, which is exactly the parameters specified.
Figure 16: The Avg disk bytes/write counter shows the right parameters.
SQL IO characteristics revisited (and an awesome script)
Now that we understand what SQLIO is telling us via examining Windows Performance Monitor counters, I’d like to return to the topic of SQL IO patterns. Back at the end of Part 6, I spent some time talking about SQL and SharePoint IO characteristics. As a quick recap, I mentioned SQL reads and writes to databases via 8KB pages. Now based on me telling you that, you might assume that if you had to open a large document from SharePoint (say 1MB or 1024KB), SQL would make 128 IO requests of 8KB each.
While that would be a reasonable assumption, its also wrong. You see, I also mentioned that SQL Server also has a read-ahead algorithm. This algorithm means that means SQL Server will try and proactively retrieve data pages that are going to be used in the immediate future. As a result, even though a single page is only 8KB, it is not unusual to see SQL Server read data from disk in a much wider range if it thinks the next few 8KB pages are likely to be asked for anyway. Now as an aside, if you are running SQL Server Enterprise edition, the possible read-ahead range is from 1 to 128 pages (other editions of SQL Server max out at 32 pages). Assuming that you're using SQL Server Enterprise edition, this translates to between 8KB and 1024KB for a single IO operation. Think about this for a second… based on the 1MB document example I used in the previous paragraph, it is technically possible that this could be serviced with a single IO request by an enterprise edition of SQL server.
Okay, so essentially SQL Server has varying IO characteristics when it comes to reading from and writing to databases. But there is still more to it. This is because there are a myriad of SQL IO operations that we did not even consider in Part 6. As an example, we have not spoken about the IO characteristics of how SQL Server writes to transaction logs (which is sequential as opposed to random IO, and does not use 8k pages at all). Another little known fact with transaction logs is that SQL Server has to wait for them to be “flushed to stable media” before the data page itself can be flushed to stable media. This is known as Write Ahead Logging and is used for data integrity purposes. What it means though is that if logging has a lot of latency, the rest of SQL Server can potentially suffer as well (and if it was not obvious before, yet another good reason why people recommend putting SQL Server data and log files on different disks).
Now I am not going to delve deep into SQL IO patterns any more than this, because we are now getting into serious nerdy territory. However what I will say is this: by understanding the characteristics of these IO patterns, we have the opportunity to change the parameters we pass to SQLIO and more accurately reflect real-world SQL Server characteristics in our testing. Luckily for all of us, others have already done the hard work in this area. First up, Bob Duffy created a table that summarizes SQL Server IO patterns based on the type of operations being performed. Even better than that… Niels Grove-Rasmussen wrote a completely brilliant post, where not only did he list the IO patterns that SQL Server is likely to exhibit, he wrote a PowerShell script that then runs 5-minute SQLIO simulations for each and every one of them!
I have not pasted the script here, but you will find it in Niels' article. What I will say, though, is that aside from the obvious 8KB random reads and writes that we have concentrated on thus far, Niels listed several other common SQL IO patterns that his SQLIO script tests:
- 1 KB sequential writes to the log file (small log writes)
- 64 KB sequential writes to the log file (bulk log writes)
- 8 KB random reads to the log file (rollbacks)
- 64 KB sequential writes to the data files (checkpoints, reindex, bulk inserts)
- 64 KB sequential reads to the data files (read-ahead, reindex, checkdb)
- 128 KB sequential reads to the data files (read-ahead, reindex, checkdb)
- 128 KB sequential writes to the data files (bulk inserts, reindex)
- 256 KB sequential reads to the data files (read-ahead, reindex)
- 1024 KB sequential reads to the data files (enterprise edition read-ahead)
- 1 MB sequential reads to the data files (backups)
The script actually handles more combinations than those listed above because it also tests for differing number of threads (-t ) and outstanding requests (-o ). All in all, over 570 combinations of IO patterns are tested. Be warned here… given that each test takes five minutes to run by default, with a 60-second wait time in between each test, be prepared to give this script at least two days to let it run its course!
The script itself is dead simple to run. Just open a PowerShell window and save Niels' script to the SQLIO installation folder. From there, change to that directory and issue the command:
Then come back in three days! Seriously though, depending on your requirements, you can modify the parameters of the script to reduce the number of scenarios based on editing the first seven lines of code, which is quite self-explanatory.
$Drive = @('G', 'H', 'I', 'J')
$IO_Kind = @('W', 'R') # Write before read so that there is something to read.
$Threads = @(2, 4, 8)
#$Threads = @(2, 4, 8, 16, 32, 64)
$Seconds = 10*60 # Five minutes
$Factor = @('random', 'sequential')
$Outstanding = @(1, 2, 4, 8, 16, 32, 64, 128)
$BlockSize = @(1, 8, 64, 128, 256, 1024)
Now if this wasn’t cool enough, Niels has also written a second script that parses the output from all of the SQLIO tests. This can produce a CSV file that allows you to perform further analysis in Excel. To run this script, we need to know the name of the output file of the first script. By default, the filename is SQLIO_Result.<date>.txt. For example:
./SQLIO-Parse.ps1 -ResultFileName ‘SQLIO_Result.2010-12-24.txt’
By default, the parse script outputs to the screen, but modifying it to write to CSV file is really easy. All one has to do is comment out the second last line of code and uncomment the last one as shown below:
#$Sqlio | Format-Table -Property Kind,Threads,Seconds,Drive,Stripe,Outstanding,Size,IOs,MBs,Latency_min,Latency_avg,Latency_max -AutoSize
$Sqlio | Export-Csv SQLIO_Parse.csv
Figure 17 shows an example of the report in Excel. Neat eh?
Figure 17: Output of SQLIO simulations exported to Excel.
Conclusion and coming up next…
By now, you should be a SQLIO guru and have a much better idea of the sort of IO patterns that SQL Server has beyond just reading from and writing to databases. We have covered the IO patterns of transaction logs, as well as examined a terrific PowerShell script that not only runs all of the IO scenarios that you need to, but parses the output to produce a CSV file for deeper analysis. In short, you now have the tools you need to run a pretty good disk infrastructure stress test and start some interesting conversations with your storage gurus.
However, at this point I feel there are some pieces missing to this disk puzzle:
- We have not yet brought the discussion back to lead and lag indicators. So while we know how to hammer disk infrastructure, how can we be more proactive and specify minimum conditions of satisfaction for our disk infrastructure?
- Microsoft treatment of disk performance (and in particular IOPS and latency) in their performance documentation is inconsistent and in my opinion, confuses more than it clarifies. So in the next post, we are going to look at these two issues. In doing so, we are going to leave SQLIO and Performance Monitor behind and examine two other utilities including one that is lesser known, but highly powerful.