How Uber is deploying globally safely and quickly

Author | Mathias Schwarz

Translated by | Wang Qiang

Planning | Tina

The main points of this article

Uber's infrastructure platform allows thousands of engineers to make parallel changes to systems without sacrificing stability.

We extended the legacy system and gradually increased the level of abstraction from a single host to a multi-region region as the business grew.

Abstracting physical hosts and regions from day-to-day operations has allowed us to greatly reduce the friction that arises when Uber runs a range of stateless services.

Not only is the deployment process simple and easy to operate, but its automated nature is key to large-scale automation of entire stateless infrastructure.

Unified into a single managed control plane, we have dramatically improved Uber's ability to efficiently manage stateless load across multiple available geographies.

At QCon Plus, Mathias Schwarz, a software engineer at Uber, shows how Uber can deploy securely and quickly on a global scale. Uber is a large enterprise with a wide range of products. In most cases, they will be deployed to dozens or hundreds of markets around the world. Our biggest product is Uber Rides, an Uber ride-hailing product that takes you from this part of town to that place with the click of a button. Every day, Uber completes 18 million travel orders — and that's just the number for the first quarter of 2020. In addition to the travel services on the Uber ride-hailing platform, Uber also has Uber Eats for food delivery.

How Uber is deploying globally safely and quickly

To run all of these products, Uber has a large number of back-end services. There are about 4,000 different microservices deployed on machines in multiple Uber data centers.

At Uber, we make 58,000 builds per week and make 5,000 changes to production every week. In other words, each of Uber's back-end services is deployed in a production environment more than once a week on average.

Since it takes a while to perform the service upgrade process, this also means that the system will make at least some upgrades to one of our back-end services at any one moment of the moment.

Regions and geographies

For Uber's infrastructure, we'll split it into many layers to consider. At the bottom are the individual servers. The server is a separate machine, the hardware that runs each process. The physical entities of the server are placed in a zone. A region can be our own, such as Uber's data center, or a cloud region where the machine is part of the GCP public cloud or AWS.

A region never spans multiple providers—it always has only one provider, and a set of regions makes up a region. Regions are essentially a collection of regions that are physically close to each other, so calls between processes within those regions have lower latency, which means you can expect requests from one region to another with lower latency. Together, these regions form our global infrastructure. So when you're deploying new builds to production, it's basically the process of deploying those new builds globally to all relevant servers in all regions of Uber's infrastructure.

Early: Unstructured deployment

When Uber started building their deployment strategy and deployment system, the initial approach was similar to that of most other companies. Each of Uber's service teams has a specific set of hosts where they deploy new builds. Then, whenever they want to publish changes, they manually access those servers, or use Jenkins scripts to deploy the build to the server and make sure they have upgraded all the processes before releasing the new build. However, this approach has several drawbacks. For example, when a server fails, the team needs to clean it up manually. Worse, if there are errors in the changes being rolled out, it means that the team must clean up the erroneous changes after removing them from the production system to get the system back to good condition.

Important deployment system features

In 2014, we took a step back and started thinking about how to create a deployment system to automate all of these operations, making it easier for our engineers to maintain a high-frequency deployment cadence while ensuring security. We have made a series of requirements that we hope the system can accomplish. We want our build to be consistent, in addition to:

You want the builds to look the same, no matter what language they use, what framework they use, and which team is building the service. The build should be consistent with the deployment system, which makes it easier to manage.

In addition, we want zero downtime for all deployments, which means that when you want to deploy your build, you want the system to be able to automatically manage the deployment order of the servers. We want the system to try not to stop more processes without interfering with the traffic entering the service.

Want to make downtime prevention a first-class citizen of the system. Essentially, we want the system to be able to identify and respond to possible problems as we deploy new versions to production.

Finally, we want the system to get our backend back to a good state. Overall, we want engineers to be able to easily deploy new changes and trust the system to handle the security of those deployments.

Use uDeploy for structured deployment

Based on these needs, Uber began building a Micro Deploy system. Micro Deploy went live in 2014. In that year, we moved all our back-end services to this new platform. In Micro Deploy, we set all builds to Docker images. We also did this with an in-house built build system called Makisu. Essentially, the combination of these two systems means that all of our Docker images look the same and perform the same on the deployed system, significantly simplifying deployment management.

Deploy to a cluster in a region

At Uber, we also changed the level of abstraction for engineers. We told them that they didn't have to worry about which servers they were going to deploy to, but just tell us which regions they needed and the capacity they wanted in each region. That said, we don't ask engineers to find specific servers, but rather provide capacity for those areas. We then deploy to the target region. Whenever there is a server failure, we replace it and the service is transferred to these new servers without any human intervention. We did this in uDeploy, combining an open-source cluster management system called Mesos with a stateless load scheduler called Peloton that we built inside Uber (the latter of which has been open sourced). Today, you can use Kubernetes to achieve similar goals.

Security - Monitor metrics

We also decided to build security directly into the deployed platform to maximize the security of the deployment. We have built our surveillance system, uMonitor, into our deployed platform. All of our services emit metrics ingested by uMonitor. uMonitor continuously monitors these metrics in time series and ensures that the metrics do not exceed certain predefined thresholds. If we see a database metric break through these predefined thresholds, we start rolling back to a safe state, which is automated in the Micro Deploy system. Micro Deploy captures the previous state of the system and then automatically restores the service to its old state when a rollback is initiated.

Secure - White box integration

In addition, for those services that are most important to Uber, we also conducted white-box integration tests. We used an in-house developed system called Hailstorm. When you deploy your first instance to a new region, it runs load tests on those specific instances in production and runs white-box integration and load tests. This kind of testing is a complement to the large number of integration tests that are run before the code is released.

These integration tests target the API endpoints of the deployed service and ensure that the APIs still function as we expect. After doing this on the first few instances deployed to a region, we can discover issues in the production environment before they affect multiple hosts. If some of these tests fail, we can also roll back to a previously known security state of the service.

Secure - Continuous black box

Finally, we built what is called black-box testing. Black-box testing is essentially a virtual travel order that is constantly happening in all the cities where Uber's products are located. Black box testing executes these virtual travel orders, and if we find that a city order cannot be completed, a page pops up to an engineer. The engineer must then manually decide whether to roll back or continue with the deployment. They also have to figure out which services are causing travel orders on the platform to suddenly start to have problems. So black box testing is the last line of defense for our problem detection mechanism.

Micro Deploy gives us security at scale. It provides us with the availability of services even if individual servers fail. A few years ago, we found that we were spending more and more engineering time managing services. Engineers still have to figure out in which areas they should place services in. For example, are they going to provide services on AWS or in our own data centers? How many instances do they need? Wait a minute. Two years ago, service management was still a task that required a lot of human intervention.

Increase efficiency at scale

So once again, we took a step back and started thinking, how can we build a system that automates all these day-to-day tasks for our engineers and ensures that the platform is self-managing?

We propose three principles to build into the system:

First, we want it to be a true multi-cloud architecture, which means that whether the service is running our own data center or a host or server in some public cloud, it should be the same for engineers, and there is no difference. We should be able to deploy effortlessly anywhere.

Second, we also want it to be fully managed, which means we want engineers to just make changes, make sure those changes work, and push them into production. We no longer expect them to handle manual management tasks such as locating in a region and extending services. At the same time, we still want the deployment system to behave predictably.

Finally, we still want engineers to be able to understand and predict what will happen to their services. So even if we decide to change the scale or move them to the cloud region, we want to tell the engineer exactly what happened and why it was doing it.

Up improves infrastructure efficiency

Based on these three principles, we began building the deployment platform we use today at Uber called Up. In Up, engineers take another step closer to the level of abstraction they need to consider when managing their services and deploying changes. For example, we don't ask them to care about a specific region, but rather ask them which physical region they want to deploy to. This way, the engineer using Up only needs to indicate where he wants the service to be deployed, and Up will take care of the rest. For our engineers today, the process looks like this.

As we can see, this service is deployed into a canary and deployed to two different regions, in this case they are called "DCA" and "PHX". We don't tell engineers whether physical servers are running in cloud zones or in our own data centers. We just tell them that there are two regions, and how many instances there are in these two regions.

When engineers deploy to a production environment, they see such a plan if the system decides to make changes to the service. The plan lists the steps that have been performed, so you can see exactly what has happened to the service so far. Second, it shows what is currently happening. For example, which region we are currently upgrading for the service, and our specific upgrade progress. Finally, after the current changes are applied, there is a list of changes that will be applied later — which means that engineers can fully predict what will happen throughout the deployment process.

Let's add a new area

One of the things we want the Up system to do is that our back-end engineers stop caring about the infrastructure, especially the topology of the underlying infrastructure. Specifically, we want to add or remove zones without affecting engineers. If there is a region here and I am adding a new region to an existing region, as shown in the following image.

The infrastructure team will set up physical servers, set up storage capacity, and physically connect regions to the existing infrastructure. The challenge now is to move 4,000 service owners and 4,000 services, or at least some of them, to the new region to use all of our new capacity in that region. Before Up came out, dozens of engineers were needed to deploy a new area, and the whole process was labor- and time-consuming, so we wanted Up to automate this step for us.

Statement arrangement

Suppose we have a new region as previously described; then, the engineers will only configure their capacity based on the region and its physical location on Earth. They'll tell us they want to deploy 250 instances in the DCA region and 250 instances in the PHX region. In addition, they can tell the deployment system some basic information about their dependencies on other services, and whether they want to use canary for those services. Up is then responsible for reviewing this configuration and constantly evaluating the current location of the service and whether the current configuration is appropriate for each service. Up will continually compare the current topology of the infrastructure with these declarative service configurations and figure out how best to place this service.

With this configuration and continuous evaluation cycle, what happens to the system when we add a new zone? First, the Up system automatically discovers better deployment locations; for example, for some specific services, there may be a new region available with much larger capacity than the existing region. Then after evaluating these limits and determining that there is a better location, we also have a Balancer that initiates the migration task from one region within the region to another. Engineers no longer need to spend time manually moving these services because our Balancer will automatically do the job for them.

Summary

This article describes our journey from a small-scale system where engineers manage individual servers to a Micro Deploy system that automates server management. At that time, service management as a whole was still a daily maintenance task for our engineers. Finally, by the time our new Up system was launched, we were fully automated at the regional level. You can safely deploy 5,000 changes to production every week, and you can easily manage systems as large as the Uber backend. The key to making it work in practice is automation. Its level of abstraction allows you to automate many tasks that would otherwise require engineers to manage manually. This means that whether it's where to deploy, where hosting providers to choose, or how to scale the service, we can completely hand the job off to the machine.

About the author

Mathias Schwarz has been an infrastructure engineer at Uber for over 5 years. He and his team are responsible for developing a stateless service deployment platform that is used by the entire Uber engineering team. Mathias holds a Ph.D. in Computer Science from the Programming Languages Group at Aarhus University.

https://www.infoq.com/articles/uber-deployment-planet-scale/

How Uber is deploying globally safely and quickly

Read on