laitimes

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

The author | to Jane

Editor| Xin Xiaoliang

Software evolves in a continuous iterative way. To some extent, we are not worried about the imperfection of the software, but we are worried that the iteration speed of the software is too slow and affects the speed of perfection. In the field of distributed software, how to quickly and safely verify new software versions has always been a concern and exploration. The advent of service meshes has taken the exploration of this field to new heights. The concept of "swimlane" is not new in the field of distributed software, but this time we are building on the basis of service grid technology, giving full play to the advantages of cloud native technology with flexible traffic governance.

This article shares the ability of full-link traffic marking and routing precipitated by Alibaba Cloud, and makes a new experience of service grid technology and well realizes the new value of service grid.

Concepts and scenarios

Figure 1 illustrates the key concepts in a usage scenario using the Bookinfo sample program officially provided by Istio as an example. The purple rounded boxes represent Envoy. The nature of all the swimlanes in the figure is the same, and the different names are only to distinguish the subdivision scene or user.

Baseline: Refers to the deployment of all services of the business into this environment. The baseline can come from a real production environment, or it can be built for development work in a completely separate environment from the production environment.

Traffic lane: Represents a soft environment isolated from the baseline environment, which is added to the lane by labeling the machine (i.e., the Pod in K8s). Obviously, machines that join the swimlane are interoperable with machines in the baseline at the network level.

Traffic fallback: The number of services deployed in the swimlane is not required to be exactly the same as the baseline environment, and when there are no other services in the call chain in the swimlane, the traffic needs to fall back to the baseline environment and further reflow the swimlane when necessary. For example, the reviews service that the productpage service depends on does not exist in the dev1 swimlane in Figure 1, so traffic needs to fall back to the reviews service in the baseline (shown by the dark blue line in the figure), followed by the reviews service in the baseline that punches traffic back to the ratings service in the dev1 swimlane.

Traffic label passthrough: All sidecars on the service side need to have the ability to automatically put the traffic label carried in the call-in request into each callout request forked by this request, so as to achieve full-link traffic identification through transmission and traffic identification route, otherwise the traffic between the swimlane and the baseline cannot travel back and forth.

Entrance service: refers to the first service reached when traffic enters the swimlane. The graphic representing the service in Figure 1 is marked by a triangle on the left border to indicate that it is an entry service.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

Figure 1

Swimlane technology can be applied to the following scenarios:

The day-to-day development of a single service or the daily development of multiple services. Developers build swimlanes, deploy services with new features into swimlanes, and traffic-based features introduce test traffic into swimlanes for validation by defining rules. Since the swimlane only needs to deploy the new version of the tested service, it eliminates the need to build a full-link test environment. In this scenario, you need to pay attention to the data drop problem of the test traffic and deal with the dirty data left in the development and joint commissioning process.

Full link grayscale. For multiple services involving major function launches, more comprehensive functional verification can be done in the form of full link grayscale through swimlanes. After the full link function is accepted, the new version of the service is released to the baseline.

Business Critical Rehost. Similar to the retail scenario of the business (such as POS machine cash register), we do not want to cause huge public opinion because of the failure of the software, then you can isolate the business traffic through the swimlane to achieve re-insurance.

Technical implementation

Traffic marking scheme and implementation

When using swimlane technology, there are three different schemes depending on where the flow is marked. It is worth noting that although the schemes are different, the technical implementations in terms of service meshes are completely consistent, and the schemes are listed to help the reader better understand.

Figure 2 illustrates scenario one. In this scenario, traffic entering the service mesh is preceded by a first-level gateway before the Ingress gateway, which we'll call an API gateway (e.g., Nginx). Usually, API Gateway can add additional headers before forwarding received requests according to the characteristics of the traffic, so as to complete the marking action of the traffic. An HTTP header named x-asm-traffic-lane: dev1 is added for specific traffic, representing the need to hit traffic into the dev1 swimlane. In this scenario, Envoy in the service mesh does not need to have any traffic markings.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

Figure 2

Figure 3 illustrates scenario two. In this scenario, the client's traffic hits directly to the Ingress gateway of the service mesh. After being identified by the Ingress gateway through Istio's native VirtualService matching rules based on the characteristics of the traffic, the request is forwarded with an HTTP header named x-asm-traffic-lane, and the traffic is subsequently routed to the appropriate swimlane.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

Figure 3

Figure 4 illustrates scenario three. Essentially, this scenario is identical to scenario two, with the corresponding traffic identified by Istio's native VirtualService matching rules plus an HTTP header named x-asm-traffic-lane. The only difference is that the role of Envoy in Scenario Two is Ingress, while the role of Envoy in Scenario Three is Sidecar.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

Figure 4

Once the traffic is marked, each Envoy in the service mesh does the full-link standard transmission and standard-by-standard routing based on the traffic standard and the configuration issued by the control plane.

Traffic identification is passed through

Figure 5 illustrates the traffic details between a service in a service mesh and envoy (Sidecar) on the edge.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

Figure 5

From Envoy's point of view, it includes forwarding of both incoming and outgoing traffic. I1 is the incoming traffic that will be forwarded to the local Svc A when received; O1 is the outgoing traffic (caused by calling another service because of the need to handle l1) and forwarded to the external called service after receiving it. The inflow and outflow are only related to the request, and have no relationship with the response corresponding to the request. Obviously, one incoming request can lead to multiple outgoing requests (i.e., "forks"), depending entirely on Svc A's specific business logic.

The core point that swimlane technology needs to solve is that when the incoming traffic is labeled accordingly, how to make each outflow of traffic from its fork carry the same label, and the solution we use is to combine link tracing technology (for example, OpenTelemetry) to solve. Link tracing technology uniquely identifies a call chain tree through traceId, assigns and carries a unique traceId for the root request, and then all new calls forked by it must carry HTTP headers with exactly the same value, in other words, the service developer needs to ensure that this header is propagated to subsequent service calls during programming (for example, calling the SDK of OpenTelemetry to complete header propagation). In other words, the premise of using swimlane technology requires that each service uses link tracing technology, which is easily met as one of the best practices for microservices architecture. Going back to Figure 5, Svc A needs to make an O1 call when it receives and processes an I2 request, ensuring that the traceId header in I2 propagates to the O1 request is a detail that Svc A's developers need to pay special attention to.

Once all service requests in the service mesh are accompanied by traceId, it is very simple to achieve full-link traffic standardization through Envoy. Roughly divided into these steps:

Envoy builds a mapping table internally to record the mapping of traceId and traffic targets. For example, the traffic token shown in Figure 5 is placed in the HTTP header x-asm-traffic-lane. x-asm-traffic-lane: dev1 represents the traffic label dev1, and x-asm-traffic-lane: canary represents the traffic label canary.

When requesting I1 into Envoy, Envoy adds a mapping record to the mapping table based on the traceId and traffic tokens carried in the request.

Envoy For each O1 request received, based on the traceId in the request, the corresponding traffic token is found in the mapping table and added to the O2 request before forwarding.

The advantage of the technical solution based on traceId marking through the service mesh is that the traffic marking action and traffic target transmission are completely decoupled from the service, and this ability is sunk into the service grid that is originally good at traffic governance, so that the flexibility of traffic scheduling can be further unlocked.

The definition of traffic ID and traceId

We've added TrafficLabel, a new CRD, to Istio's existing CR. The reason for choosing to add new instead of directly extending VirtualService is that VirtualService is designed to be applied at the beginning, and when a business is complex to the point that many applications need to be put into the swimlane, it is necessary to change the VirtualService of each application, and the timeliness and operability behind it will be a problem. Another way to extend VirtualService to implement is to give VirtualService the ability to configure global rules, which requires the use of a merge mechanism of rules, which is also problematic from the practical level. The Istio community has discussed the need to merge multiple VirtualServices, which are currently only supported on gateways, but not for Sidecar due to concerns about failures caused by different orders of merging.

Figure 6 illustrates how to use trafficLabel, a CR, to define a globally valid traffic marking method in the istio-system root namespace. It defines a tag named x-asm-traffic-lane as the header of the HTTP request to hold the traffic ID (e.g., dev1, dev2, canary, etc.), and the traceId is obtained based on x-request-id. Users can set it according to the specific implementation of their selected type of link tracing system. The figure is set to be obtained from the x-request-id header because Envoy implements the function of unique identification of the entire network link through x-request-id. Using x-request-id as the mapping table key means that we can demonstrate the effect of the swimlane directly using the Bookinfo sample program provided by the Istio open source community, because all services in Bookinfo do the propagation of the x-request-id header from the call-in request to the call-out request.

Figure 6

Route by traffic standard

To support traffic-by-standard routing, it is necessary to extend Istio's VirtualService to include the destination field support specifying the destination of traffic with variables such as $x-asm-traffic-lane, as shown in Figure 7 below. In other words, traffic containing the x-asm-traffic-lane: dev2 header hits the dev2 lane behind it, behind which is a subset named dev2 defined using DestinationRule, as shown in Figure 8. Note that the name $x-asm-traffic-lane in VirtualService in Figure 7 should match the name defined in trafficLabel in Figure 6.

Figure 7

Figure 8

It is not difficult to see from the definition of DestinationRule in Figure 8 that only dev2 is defined except for baseline, and Figure 7 is the virtualService definition in the corresponding case. The corresponding usage scenarios for both are the baseline and dev2 swimlanes in Figure 1.

Product realization

In the context of cloud-native technology, ease of use is put in the spotlight, and we have a deep understanding of what is behind it. To this end, when designing the interaction of the product, we strive to clear up what we know, think and optimize from the scene that the user is facing, and strive to balance functionality and ease of use.

Before the user used the swimlane, we thought he had built a baseline environment that included all services. In K8s, the baseline environment is typically deployed in a specific namespace to better operate and manage the services in it. When a user creates a swimlane, they only need to provide the swimlane name. The rest of this section expands to create a swimlane named dev2.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

Once the swimlane is created, you need to publish the service to the swimlane. Since the published service is stored in the baseline environment and the K8s Service resource is created, publishing the service in the swimlane is actually creating a deployment under the corresponding service, which is intuitively understood to create another software version of the existing service. It's not hard to imagine that this release action includes confirming the baseline version, the number of instances, and the container image address.

After a service is published to a swimlane, you need to ensure that all services start correctly through the swimlane's service list. At this time, there is no traffic entering the swimlane, and the flow from the baseline needs to be introduced into the swimlane by configuring drainage rules.

Drainage rules can be configured based on the characteristics of HTTP headers, URIs, and cookies, so that we can accurately select the measured traffic into the swimlane. The rule in the following figure refers to the HTTP header end-user that leads traffic to dev2 in the dev2 swimlane. While configuring the rules, you need to specify the ingress service correctly.

Once the drain rule is applied, you can sign in with the dev2 username on the web page to see the effect of the service in the dev2 swimlane. The following two images illustrate the page effect seen by the full baseline and dev2 swimlanes, respectively. Since the productpage and details services are not deployed in the dev2 swimlane, the two services fall back to using the baseline, and the final effect is that the content of The Comedy of Errors and Book Details in the two diagrams is exactly the same.

When a service is published to the swimlane, you can easily view the traffic comparison between each service and the baseline version in the swimlane service list. Helps developers better understand how services are performing in swimlanes.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

In addition, the service topology diagram makes it clear that the invocation of the service in the dev2 swimlane (lane-dev2 in the diagram) can be clearly seen.

Realizing the New Value of Service Mesh: Precise Control of the "Explosion Radius"

Summary and outlook

The service mesh-based swimlane technology we are exploring allows developers to create isolated environments in seconds for development testing or business re-assurance, minimizing the "radius of explosion" with precise drainage rules. It is a good realization of the new experience and new value of cloud-native service mesh technology.

Next, we will further open up the functions of swimlanes and version grayscale in a scenario-based manner, so that users can use these functions based on intuition. At the functional level, we will further improve the protocols supported in the swimlane, such as RocketMQ, Dubbo 3.0, etc., to maximize its value by enriching the application scenarios of swimlane technology.

Finally, we will continue to build a modern service governance platform for microservices architectures with the concept of Service Mesh as Infra, and work with industry partners to accelerate the development and promotion of this new technology.

About the Author:

Li Yun (Flower name: Zhijian), Alibaba Cloud Service Grid Hybrid Cloud Product Technology Leader. In 2018, he began to lead the team in Alibaba Group to engage in the development and construction of service grid technology, and did many cloud native and service grid technology sharing at QCon.