On June 30, 2016, a problem with Exchange Online Protection caused Office 365 tenants in the U.S. to experience delays sending and receiving emails to and from external sources. The apparent cause was a flawed update that caused the Exchange Online Protection (EOP) infrastructure to slow down and be unable to process messages. Microsoft identified the issue and resolved it in nine hours. The question is whether EOP is a weak link for Office 365. Hopefully, Microsoft will move soon to make sure that a similar incident can’t happen in the future.
After writing about how well Office 365 has done in the five years since its introduction, I guess I should not have been surprised that fate sent along one of its little surprises when Exchange Online Protection (EOP) fell over and slowed down message delivery to many tenants on June 30. Life has a nasty habit of hitting you over the head with a 2 x 4…
A system is only as strong as the weakest of its component parts. EOP is a pretty important Office 365 component because it protects tenants against spam, viruses, and malware using an array of different tests. All Office 365 tenants that use Exchange Online automatically use EOP. Some tenants choose to augment EOP by first passing inbound and outbound email streams through a third-party email hygiene service on the basis that defence in depth is better than defence at a single place.
Advanced Threat Protection (ATP) is an optional part of EOP that adds a further layer of tests specifically designed to deal with suspicious attachments of the kind that might contain a variant of the Cerber infection. ATP is bundled into the Office 365 E5 plan and is available as an add-on for other plans.
Returning to the June 30 problem, incident EX71674 began at 2:30pm UTC when tenants started to notice delays sending and receiving messages to and from external correspondents (Figure 1 shows how the DownDetecter site tracked user reports about the issue). Another symptom of EOP failure was when timeout errors were generated for inbound checks against (Sender Policy Framework) SPF and Domain Message Authentication Reporting and Confidence (DMARC) values for Office 365 tenants.
The incident was finally resolved nine hours afterwards at 11:30pm UTC. The affected tenants were in the U.S. with some knock-on effects being felt in terms of message delays in tenants elsewhere. The fact that the problem was isolated to the U.S. reflects the modular deployment of Office 365 across twelve datacenter regions around the world. A problem has to be pretty catastrophic to be able to leak from one datacenter region to another.
Figure 1: Whoops… that peak of reports doesn’t look good
Microsoft reported to affected customers that “a portion of the infrastructure responsible for processing Exchange Online Protection (EOP) message filtering became degraded, resulting in message transport delays.” They then went on to explain that system management technique 101 had been applied by restarting the impacted services.
Later reports indicated that the reboot had not had the intended effect and more work was required to shift workload to “a healthy portion of the infrastructure”, which normally means that some network rejigging was necessary to reroute inbound traffic to servers that were capable of processing messages through the EOP pipeline. Naturally, message queues accumulated to increase pressure on the infrastructure and lead to further delays. Some users reported that they received non-delivery notifications (NDRs) when they attempted to send messages to external recipients because EOP was unable process the messages.
Interestingly, Microsoft revealed that a recent update was being reversed because it “may have caused the infrastructure to operate below the expected threshold.” Later, they said, “We identified that a recent update to the environment caused an EOP process that analyzes email to perform below acceptable thresholds, causing email messages to queue from both inbound and outbound sources.”
Anyone who has ever applied a software update to a server only to discover that the expected benefit turned into a degradation can appreciate what might have happened here.
At around 8pm UTC, the situation began to ease as the changes made to address the degraded infrastructure started to have an effect. The remaining time that elapsed before Microsoft declared the incident closed was taken up in restoring the infrastructure to its normal working state. However, during that period the EOP service was online and responsive.
I received lots of email (eventually) from irritated administrators who complained about the outage. Everyone is free to take a shot against Microsoft when things go wrong in a high-profile cloud service such as Office 365. In this case, the facts appear to indicate two things:
First, EOP is potentially a weak component for Office 365. Microsoft needs to review whether sufficient capacity is available to handle load when part of the infrastructure is degraded for some reason, such as a software update that has an unintended effect on performance. This incident reminds me of the June 2014 outage when part of the Azure Active Directory that supports Office 365 failed (also in the U.S.) and caused seven hours of problems for Exchange Online. Microsoft took appropriate action afterwards to increase the resilience of the Azure Active Directory infrastructure and I can’t recall a similar outage since (there will always be glitches in a massive cloud service like Office 365, but no 7-hour directory outages).
Second, it seems that the folks running the Office 365 datacenters responded reasonably quickly to understand and solve the problem. As anyone who has dealt with a malfunctioning IT infrastructure knows, a certain amount of time is necessary to realize that a problem really exists (i.e., it’s more than a temporary glitch), where the problem lies, the consequences of the problem for end users, and come up with a diagnosis and action plan. Even then, sometimes the action plan is flawed and needs to be reworked. Solving issues in an on-premises IT environment can be difficult. Solving them when massive scale is involved, when a bad fix can wreak havoc for countless hours, requires cool heads, enormous attention to detail, and great execution. I give Microsoft a passing grade for this incident.
One thing that is improving is the way that Microsoft communicates to customers when problems occur inside Office 365. The text that appeared in the Office 365 Service Health Dashboard (SHD) for this incident was pretty informative and didn’t include quite so much of the cut-and-paste repetition that has marked previous events. Focus needs to continue on providing clear and concise communications because tenant administrators operate in the dark when components fail inside Office 365 datacenters. Which is why so many tenants use a third-party monitoring solution to help them understand what’s happening inside Microsoft’s walled garden.
Follow Tony on Twitter @12Knocksinna.
Want to know more about how to manage Office 365? Find what you need to know in “Office 365 for IT Pros”, the most comprehensive eBook covering all aspects of Office 365. Available in PDF and EPUB formats (suitable for iBooks) or for Amazon Kindle.