Building Microsoft's hyper-scale cloud

Thu, 18th Aug 2016

FYI, this story is more than a year old

By Karl Ots, Business Development Executive, Microsoft Azure

Microsoft opened its first data center on the Redmond campus (Building 11) in September of 1989. Since then, Microsoft has invested more than $15 billion in building our global cloud infrastructure and more than $9 billion in research and development to improve efficiency of our IT solutions, cloud services, and operations.

Motivation

As one of the largest cloud operators in the world, Microsoft's hyper-scale workloads have changed the way we design and operate data centers. Our focus is on building a resilient cloud infrastructure and cloud services that deliver higher availability and security, while lowering overall costs. To do so, we build software applications as distributed systems to drive integration throughout our facilities for greater reliability, scalability, security, efficiency, and sustainability. From the server design to the building itself, we consider every aspect of the physical environment to drive improvements in our data centers and networks.

Designing and building our own data centers provides strategic advantages and 25-35% reduced COGS compared to the total cost of leased facilities. While it may seem that enterprise data centers are commoditized, the modern hyper-scale data centers designed by Microsoft are delivering increased performance and efficiency. Through innovations in software, hardware, cooling design and operations, our internally engineered data centers are more energy efficient than traditional data centers in the industry today, and use a fraction of the water.

Designing our own servers

Microsoft has developed the Microsoft Cloud Server that can efficiently and reliably scale to deliver all of our key online services on a common hardware platform. Our cloud server design fully integrates from the silicon, to the rack, to the full data center. It incorporates the blade, storage, network, systems management, and power mechanicals, and it comes together in a highly efficient single modular design. In 2014, Microsoft shared its cloud server specification and design with the Open Compute Project Foundation to help drive greater efficiency across the industry. This significant contribution demonstrates our continued commitment to sharing our key learnings and experiences from more than 20 years of operating online Services.

Hyper-scale paradigm shift

Delivering hyper-scale services requires a radical restructuring of technology, processes and people. There are a significant number of design points that contribute to the differences between enterprise IT and hyper-scale infrastructures. From the number of customers that need to be serviced, to the quality of data we need to be hosting, the supply chain, the architecture, hardware reliability, security, network design, systems administration, and operations; the sheer scale demands a very different approach.

It boils down to a simple idea: hyper-scale infrastructure is designed for massive deployments, sometimes on the order of hundreds of thousands of servers at a time. At hyper-scale, equipment failure is an expected operating condition – whether it be servers, circuit breakers, power interruption, lightning strikes, earthquakes, or human error – no matter what happens, the service should gracefully failover to another cluster or data center while maintaining end-user service level agreements (SLAs). Below is a table to give you a sense of the difference in scale across key areas.