It seems to me that many SharePoint consultancies think their job is done when recommending a topology based on:
- Looking up Microsoft’s server recommendations for CPU and RAM and then doubling them for safety.
- Giving the SQL Database Administrators heart palpitations by proactively warning them about how big SharePoint databases can get.
- Recommending putting database files and logs files on different disks with appropriate RAID levels.
Now if you are more serious about SharePoint performance, you've probably read through the body of content created for the SharePoint 2010 platform, which is when Microsoft and the community really got serious about the topic. Chances are you had a crack at reading all 307 pages of Microsoft’s “Planning guide for server farms and environments for Microsoft SharePoint Server 2010.” If you indeed read this document, then it is even more likely that you worked your way through the 367 pages of Microsoft whitepaper goodness known as “Capacity Planning for Microsoft SharePoint Server 2010”. If you really searched around you might have also taken a look through the older but very excellent 23 pages of “Analysing Microsoft SharePoint Products and Technologies Usage” whitepaper.
Now let me state from the outset that these documents are hugely valuable for anybody interested in building a high performing SharePoint farm. They have some terrific stuff buried in there – especially the insights from Microsoft’s performance measurement of their own very large SharePoint topologies. But nevertheless, 697 pages is 697 pages (and you thought that my blog posts are wordy!). It is a lot of material to cover.
Having read and digested them myself, as well as chatting to SharePoint luminary Robert Bogue on occasion on all things related to performance, I was inspired to write a couple of blog posts on the topic of SharePoint performance management with the aim of making the entire topic a little more accessible. As such, all manner of SharePoint people should benefit from these posts because performance is a misunderstood area by geek and business user alike.
Here is what I am planning to cover in these posts.
- Highlight some common misconceptions and traps for younger players in this area.
- Understand the way to think about measuring SharePoint performance.
- Understand the most common performance indicators and easy ways to measure them.
- Outline a lightweight, but rigorous method for estimating SharePoint performance requirements.
In this introductory post, we will start proceedings by clearing up one of the biggest misconceptions about measuring SharePoint performance – and for that matter, many other performance management efforts. As an added bonus, understanding this issue will help you to put a permanent stop to developers who blame the infrastructure when things slow down. Furthermore, you will also prevent rampant over-engineering of infrastructure.
Lead vs. lag indicators
Let’s say for a moment that you are the person responsible for road safety in your city. What is your ultimate indicator of success? I bet many readers will answer something like “reduced number of traffic fatalities per year” or something similar. While that is a definitive metric, it is also pretty macabre. It is also suffers from the problem of being measured after something undesirable has happened. (Despite millions of dollars in research, death is still relatively permanent at the time of writing.)
Of course, you want to prevent road fatality, so you might create road safety education campaigns, add more traffic lights, improve signage on the roads, and so forth. None of these initiatives are guaranteed to make any difference to road fatalities, but they very likely do make a difference nonetheless! Thus, we should also measure these sorts of things because if it contributes to reducing road fatalities, it is a good thing.
So where am I going with this?
In short, the number of road signs is a lead indicator, while the number of road fatalities is a lag indicator. A lead indicator is something that can help predict an outcome. A lag indicator is something that can only be tracked after a result has been achieved (or not). Therefore lag indicators don’t predict anything, but rather, they show the results of an outcome that has already occurred.
Now Robert Bogue made a great point when we were talking about this topic. He said that SharePoint performance and capacity planning is like trying to come up with Drakes equation. For those of you not aware, Drakes equation attempts to estimate how much intelligent life might exist in the galaxy. But it is criticised because there are so many variables and assumption made in it. If any of them are wrong, the entire estimate is called into question. Consider this criticism of the equation by Michael Crighton:
The only way to work the equation is to fill in with guesses. As a result, the Drake equation can have any value from "billions and billions" to zero. An expression that can mean anything means nothing. Speaking precisely, the Drake equation is literally meaningless…
Back to SharePoint land…
Robert's point was that a platform like SharePoint can run many different types of applications with different patterns of performance. An obvious example is that saving a 10 megabyte document to SharePoint has a very different performance pattern than rendering a SharePoint page with a lot of interactive web parts on it. Add to that all of the underlying components that an application might use (for example, PowerPivot, workflows, information management policies, BCS and Search) and it becomes very difficult to predict future SharePoint performance. Accordingly, it is reasonable to conclude that the only way to truly measure SharePoint performance is via measuring SharePoint response times under some load. At least that performance indicator is reasonably definitive. Response time correlates fairly strongly to user experience.
So now that I have explained lead vs. lag indicators, guess which type of indicator response time is? Yup – you guessed it – a lag indicator. In terms of lag indicator thinking, it is completely true that page response time measures the outcome of all your SharePoint topology and design decisions.
But what if we haven’t determined our SharePoint topology yet? What if your manager wants to know what specification of server and storage will be required? What if your response time is terrible and users are screaming at you? How will response time help you to determine what to do? How can we predict the sort of performance that we will need?
Enter the lead indicator. These provide assurance that the underlying infrastructure is sound and will scale appropriately. But by themselves, they are not a guarantee of SharePoint performance (especially when there are developers and excessive use of foreach loops involved!) But what they do ensure is that you have a baseline of performance that can be used to compare with any future custom work. It is the difference between the baseline and whatever the current reality is that is the interesting bit.
So what lead indicators matter?
The Microsoft documentation lists many useful performance monitor counters (particularly at a SQL Server level) that are useful to monitor. Truth be told, I was sorely tempted to go through them in this series of posts, but instead I opted to pitch these articles to a wider audience. So rather than rehash what is in those documents, lets look at the obvious ones that are likely to come up in any sort of conversation around SharePoint performance. In terms of lead indicators, there are several important metrics:
- Requests per second (RPS)
- Disk I/O per second (IOPS)
- Disk Megabytes transferred per second (MBPS)
- Disk I/O latency
In the next couple of posts, I will give some more details on each of these indicators (their strengths and weaknesses) and how to go about collecting them.
A final Kaizen addendum
Kaizen? What the?
I mentioned at the start of this post that performance management is not done well in many other industries. Some of you may have experienced the pain of working for a company that chases short-term profit (lag indicator) at the expense of long-term sustainability (measured by lead indicators). To that end, I recently read an interesting book on the Japanese management philosophy of Kaizen by Masaaki Imai. Imai highlighted the difference between Western attitudes to management in terms of “process-oriented management vs. result-oriented management.” The contention in the book was that Western attitudes to management is all about results whereas Japanese approaches are all about the processes used to deliver the result.
In the United States, generally speaking, no matter how hard a person works, lack of results will result in a poor personal rating and lower income or status. The individuals contribution is valued only for its concrete results. Only results count in a result-oriented society.
So as an example, a result society would look at the revenue from sales made over a given timeframe – the short term, profit-focused lag indicator. But according to the Kaizen philosophy, process-oriented management would consider factors like:
- Time spent calling new customers
- Time spent on outside customer calls versus time devoted to clerical work
What sort of indicators are these? To me, they are clearly lead indicators as they do not guarantee a sale in themselves.
It’s food for thought when we think about how to measure performance across the board. Lead and lag indicators are two sides of the same coin. You need both of them.