Cui Jianfei (Mu Yi) Ali Developer 2023-05-18 09:02 Posted in Zhejiang

Share an interesting example of grayscale design flaws and talk about the design of grayscale solutions

Ali Mei's guide

Grayscale is very important, grayscale strategy also needs to be flexibly adjusted in combination with the actual situation, this article shares with you a grayscale design bug found some time ago.

1. Case sharing

I would like to share with you a grayscale design bug found some time ago, this bug is quite interesting, it seems to be a perfect solution, but because some technical features are not considered, there are defects. For a more smooth explanation, let's introduce some nouns.

Grayscale publishing: Grayscale publishing refers to a way to smoothly transition between black and white. AB-test is a grayscale release method, so that some users continue to use A, some users begin to use B, if the user has no objection to B, then gradually expand the scope and migrate all users to B. Grayscale release can ensure the stability of the overall system, and problems can be found and adjusted at the initial grayscale to ensure its impact. (Quote from Wikipedia)

Secure production environment: (SPE) provides a grayscale traffic production environment to ensure online stability. Before a release is pre-shipped to the production environment, it needs to go through SPE for traffic verification.

1.1 Case Description

If the same application needs to consume the same topic message twice, the message has consumption omissions due to inconsistent configurations in different environments.

1.2 Case Background

The main modification of the demand is to change the settlement model with different history to a unified settlement model, and the new and old models are quite different. In the processing of Notify messages, it was originally intended to be rewritten in the same consumer processing class, but the cost was large, and considering that the old model needed to be taken offline in the future, from the perspective of code cleanliness, a new message processing class was written, using another group group subscription, and the final result was that the same topic message would be subscribed by two group groups and consumed by two consumer classes.

1.3 Scenario Description

The original grayscale scheme is as follows: set a time as the grayscale effective time, and then use the tail number of the buyer in the message as the flow cutting condition. When a message is received, it will first determine whether the current time is greater than grayscale time, and if the condition greater than grayscale time is met, it will determine whether the buyer's tail number is in the grayscale tangent flow range, and when it is satisfied, go through a new settlement process. If one of the two conditions is not met, the old settlement process will be followed.

1.4 Program defect analysis

The above scheme, in general, is a very good and complete plan, but combined with the background of this settlement model migration, there is a flaw. The origin of the defect is "double message". So how does dual-messaging cause problems? The reason is that when the grayscale profile is pushed through the ladder between safe production and online production, it needs to wait for 1 hour, which will cause some messages to be missed. Assuming that the current SPE and online production have been synchronized to 10%, the configuration of SPE is promoted to 20%, and the online production is still retained at 10%, then when the two messages go to SPE and the other to safety production, there will be problems, because the old group implementation class of SPE only cares about 80%-100% of the traffic, while the online production environment only cares about 0-10% of the traffic, and there are messages missing in between. The logic is as follows:

1.5 Special features of the case

1. The upgrade of the settlement model, combined with business considerations, introduces the method of dual message consumption.

2. New and old consumers listen to messages for consumption without discrimination, resulting in consumption omission.

3. SPE safe production stay mechanism for 1 hour.

1.6 Improvement Strategy

1. The grayscale time total switch remains unchanged

2. The grayscale strategy of the buyer's tail number is changed to a KV pair of "the tail number is K, and the interval production time is V", where V is a future time, and "V value > system time + safe production residence time". As follows, pushing from 50% traffic to 100%, the effective time of 100% traffic configuration is a future time.

[
    {
        "rate":5000,
        "whiteList":[
        ],
        "activeTime":1682928000000
    },
    {
        "rate":10000,
        "whiteList":[




        ],
        "activeTime":1683795376000  //未来时间
    }
]

After following up the above problems, I reviewed the departmental failure review documents in the past year, and basically all the failures and capital event reviews mentioned grayscale means. Like in this case, grayscale control in "uncertain behavior" is a common misunderstanding of newcomers. Therefore, combined with some of my own experience in the trading line, I will talk about the grayscale in my cognition, and will explain what I understand the MVP version of the grayscale scheme, at least what is needed, and what needs to be paid attention to in the grayscale design.

Second, design an MVP version of the grayscale scheme

2.1 What is the grayscale scheme?

Before talking about how, it is necessary to talk about WHAT and WHY. Let's talk about WHY first, assuming that there is no control process, the code is immediately launched after the release, and unfortunately there are problems in the system, data or logic, etc., that disaster will follow, and the larger the business volume, the greater the exposure to risk & public opinion & asset loss, and the more irreparable the loss. Therefore, the release process requires the necessary control and supervision to control the risk within a limited scope, and such a release process is grayscale. Smooth transitions are an important feature of grayscale, and this feature also determines the role of grayscale publishing at least two points:

1. Reduce the risk caused by release, let a small number of users use the new version of the new function first, find bugs or performance problems in a small range in advance, fix them in time, and reduce the impact of the new version of the new function.

2. Through the comparison of the old and new versions, observe the effect brought by the new function and better play the transition effect.

2.2 MVP version of grayscale solution (Minimum Viable Product minimum executable version)

2.2.1 Clarify the grayscale dimension

Common grayscale rules include user tail number (buyer or seller), business document ID (such as goods, orders, settlement notes, waybills), black and white lists, and crowd selection (such as targeted targeting). Black and white lists and crowd selection are somewhat similar to A/B-test, which can be used more accurately for online functional testing. It is generally used in two scenarios:

1. For scenarios with great risk and functional testing that cannot be completely covered, you can use the whitelist and crowd circle selection as the first grayscale strategy.

2. The function is controversial, beta version function test, you can use this method to collect feedback.

Using user ID or service ID as a grayscale condition is a more general way, and the circulation mode of these two solutions is as follows:

The tail ID grayscale strategy is generally used in combination with the whitelist strategy to achieve the control effect more securely, combined with my current project practice, there is a more versatile grayscale formula to share with you.

"1. Whitelist (individual users) --> 2. Buyer tail number grayscale 1% --> 3. Tail number: 3% -->4. Tail number 10% -->5. Tail number: 30% --> 6. Tail number grayscale 50%-->7. Tail number grayscale 70%-->8. Tail number grayscale 100%".

In most scenarios, this set of formula processes is more applicable. Some students may have questions, whether the "seller ID" grayscale dimension can be used, in most cases it is possible, but with the seller ID as the grayscale rule, it is easier to hit to the super large merchant, and there may be two impacts, 1. The centralized documents are affected, 2. Compared with the buyer to buy, the seller dimension is easier to hit the explosive hot goods, causing database hotspots. Therefore, the choice of grayscale dimension is a matter that needs to be pondered, and I think there are several principles that need to be followed.

1. Avoid partial generalization of samples, try to ensure the randomness of samples, and approximate uniform distribution.

This is easy to understand, for example, if you want to count how many people have taken the Hangzhou subway, and then you run to the Hangzhou subway station to do a questionnaire, in addition to getting 100% results, you will definitely get more white eyes. Similarly, in the grayscale dimension selection, it is also necessary to maintain the randomness of the sample. Such as user ID and commodity ID, it can generally better meet the needs of daily grayscale. Like the crowd ID of the business, you have to pay strict attention to the randomness of the sample when screening.

2. The screening conditions need to be relaxed from strict to wide.

For example, when upgrading the middle office of a payment and refund business last year, after completing all verifications, the grayscale strategy included a clause, that is, gradually release the amount of the refund order, from "control refund order 1 yuan --> refund order 10 yuan - > 30 yuan - > 100 yuan - > 300 yuan - > full release". This is a typical grayscale strategy at the risk control level, from strict to wide.

2.2.2 Do a good job in process observation and promotion

Grayscale is a gradual process from white to black, in this process, "grayscale observable" and "grayscale process management", both need to be done.

Grayscale is observable

The grayscale process must be observable so that problems can be found in time and the value of grayscale can be truly exerted. I have summarized four points, which are the minimum set of conditions that can be observed by grayscale, including complete traffic logs, check alarms, timely reach of feedback channels, and performance concerns. Let's go through each of them.

Policy 1: Complete traffic logs

A complete log is very necessary for the problem discovery in the grayscale process, and the detailed processing logic, especially the log of errors and exceptions, is a necessary condition for grayscale monitoring. On the one hand, technical students can observe the traffic health by observing the error log of the log, and on the other hand, they can also combine Sunfire's statistical monitoring capabilities to make threshold alarms for errors in the grayscale process.

Strategy 2: Check the alarm

In grayscale projects, verification is also a commonly used observation strategy, whether it is the gradual upgrading of new services or the gradual switching of new and old services. If a new business related to heavy capital is launched, in the gradual grayscale process, the use of reconciliation is the emergence of verification illegal documents. This is also a strategy that I often use in my daily business follow-up, such as checking orders marked with new logical targets when RP3 migrates grayscale. A complete log is observed from the perspective of system processing, while monitoring and checking is an observation of another dimension from the perspective of data, which complements each other indispensably.

Strategy 3: Internal feedback channels & public opinion attention

This strategy is generally used in the case of grayscale whitelist, and the selected whitelist is a business that communicates more with the group's service students and is friendly, and before the percentage cut, the pilot of individual merchants on the whitelist is carried out to pay attention to abnormal situations. After the whitelist is tested, in the flow cutting stage, if there are unavoidable risks, technical students need to pay attention to customer service feedback at all times, and if necessary, they need to give unified words to customer service.

Strategy 4: Focus on performance issues in the grayscale process

Grayscale is not only used for functions, but also for performance observation. The flow rate of the grayscale process is gradually increased, and the performance impact caused by the difference between the old and new functions is also gradually amplified. For example, in a change, in the new and old traffic models, for the different information channels of a certain information field, the performance difference between the new and old models needs to be paid attention to. At this time, not only the log monitoring of the business, but also the monitoring of the application system also needs to be arranged, especially when the grayscale range expands, especially the interface performance needs to be paid attention to, including issues such as the dependence on RT becoming higher, its own RT becoming higher, and database hotspots.

Grayscale process management

A good grayscale process needs to be elegant and reliable. I believe that "orderly advance" and "timely rollback" are necessary considerations in the grayscale process.

Orderly advance

The grayscale process from 0 to 1 is a process of advancing while observing, and after effective observation through logs, verification, automation and other means, after accumulating a certain number of documents of magnitude, it is gradually released. What needs to be paid attention to is whether the old logic can still be consumed normally if the configuration is illegal. For example, the following pseudocode is judged according to the user's tail number, and the tail number is configured in the grayscale configuration platform, then, this piece of code seems normal, but it actually hides a greater risk:

//预期从灰度配置文件中读取一个int型的值，但配置中grayRange设置了一个字符串型“50%”,
int grayRange = GrayHanlder.getConfig("灰度配置id").getInteger("grayRange");3
if(userId mod 100 < grayRange){
    //走新逻辑
}else{
    //走老逻辑
}

When the above code is executed, an error will be reported, resulting in the old and new logic not going. Of course, in actual business needs, such low-level errors rarely occur, but what I mean is that depending on the configured grayscale advancement, it is necessary to ensure that the grayscale logic is necessary to verify, and the grayscale advancement also needs to be extremely careful. Generally speaking, grayscale code is just an if+else, but the impact behind it is huge.

The premise of the grayscale process advancement must be effective after the flow observation, rather than metaphysical based on the grayscale duration, please ensure this. Share a real failure case:

Case name: The sub-account ID of a member of a B-series business is written incorrectly, resulting in the refund not being processed.

Cause: Due to the repair of a member login bug, the acquisition logic of an unpopular field was updated, and a B-series business used this field for permission control, resulting in business impact.

Grayscale process: In the release application in this case, grayscale stays during the release process, but the grayscale time is night, and the affected B-series business characteristics determine that the evening time is the traffic trough period, resulting in problems in the grayscale process not being paid attention to in time, and the grayscale process does not play its due effect.

Grayscale improvement: The system involving member and merchant operations, grayscale release time includes the peak period of merchant operations (8 am - 10 am).

Grayscale fallback

When the performance of the live function does not meet expectations, you need to consider controlling grayscale fallback. Generally, in the grayscale configuration file, you need to introduce a logic switch, the value is true or false, and after hitting true, it will enter a more fine-grained grayscale hit. Therefore, when the result observation does not meet expectations, you can quickly advance the configuration to false, and if necessary, go for emergency approval. Ensure that no new traffic goes to the grayscale process.

Grayscale fallback can only prevent the problem from spreading, but documents that have problems need to be properly handled, and in general, there are two options.

1. Revise data: write and correct the interface or change the DB data in batches, but you need to pay attention to compliance issues.

2. Quickly fix and release new versions, using upgrades to "roll back" and overwrite the modifications released in grayscale.

2.3 Grayscale tool

Here is only about the commonly used grayscale tools on the server side, generally a platform for lightweight configuration, the configuration content is separated from the code, can be clearly and quickly promoted, in the grayscale promotion, only need to do configuration updates, no need to publish code. The configuration can be as simple as a String or a json, as shown below.

{
        "flag": true,           //总开关，true为开启，false为关闭恢复
        "buyerAccessFlow": 10,  //用户尾号控制，当前为尾号 0-9用户进入恢复
        "amountLimit": 30000    //控制金额，金额小于300元，才进入恢复
      }

Third, about grayscale other common problems

3.1 Machine batch release is not a grayscale strategy in the strict sense

Many students do not know much about grayscale, and think that batch release is also a grayscale strategy. In fact, strictly speaking, the batch release of applications is more for the consideration of system stability than the consideration of grayscale verification. From the perspective of "fallback", in the process of batch release, even if problems are found and dirty data is found, the data is random, and the data cannot find the data through characteristics, and it is impossible to correct or roll back the dirty data, so it is not a grayscale strategy in the strict sense.

3.2 Consistency principles for grayscale design

Grayscale means that there are two sets of processing rules or processes online, and the results of a business document processed in the two sets of processes are generally different, so it is very necessary to ensure the consistency of the document in the grayscale process, otherwise it is likely to cause online problems, here is a real example

Case name: An e-commerce business added a refund and payment channel, and the grayscale strategy was unreasonable, resulting in dual-channel billing.

Case description: When RPC calls the refund consent, it hits the tail number grayscale for the first time and enters the combined channel, but the call exception occurs in the combined channel, but the channel will retry itself. During the retry, the user clicks twice, at which point the grayscale strategy changes, resulting in another combination channel, making a payment and succeeding.

Case Study: The reason for this is that the idempotency of the payment is broken. The two channels cannot sense each other's money-making behavior.

Case resolution: When applying for a refund, that is, the standard is marked, the grayscale behavior is completely dependent on the logo, and all subsequent processing is carried out according to the standard to avoid the inconsistent grayscale behavior caused by different calls. The grayscale behaviors related to subsequent refunds have been moved to the stage of applying for refunds, at which point after hitting the grayscale rule, it will be labeled with a grayscale scale, and the subsequent behaviors will be completely in accordance with the standard.

The above case is a very classic problem, that is, the data consistency in the grayscale process is not guaranteed, resulting in two payment strategy groups, which destroys the idempotency. So how to ensure the grayscale consistency principle? I think there are three principles to follow:

Principle 1: Grayscale hit processing can only be consumed once

For example, in the above case, if the grayscale judgment is placed in "agree to refund", it is very easy to have the embarrassing situation of different processing processes when calling before and after, on the contrary, we can cleverly move the grayscale judgment forward to "apply for refund" and mark the corresponding mark, and the subsequent "agree to refund" can be carried out according to the standard to ensure that the grayscale hit processing is only one message. "Agree to refund" is a function that wants to be grayscale, but "Apply for refund" is the real grayscale object. Therefore, the function of grayscale and the object that implements grayscale do not have to be consistent.

Principle 2: Ensure grayscale consistency across environments

Many applications with a secure production environment need to be observed in the secure production environment for more than 1 hour before they are pre-released to online, including application code release or configuration release. The time difference in the middle is extremely easy to cause grayscale chaos, such as the case mentioned at the beginning, due to the indiscriminate sending of messages to the safe production and online production environments, the configuration of the application in the two environments is inconsistent, resulting in the problem of message filtering.

This kind of problem, like the use of principle 1 method for pre-marking processing, is also unavoidable, and the better way to deal with it is to create a delay cycle to ensure the consistency of online production and safe production. For example, push a future effective time, and ensure that the effective time is later than the time after the full release is completed, so as to ensure the consistency of the two environments.

Principle 3: Ensure grayscale consistency across applications

If the grayscale process involves multiple applications, the grayscale logic needs to be consistent. In short, a link shaped like "A-->B-->C" either ensures that system B ignores the grayscale conditions of system A, or ensures that grayscale logic is judged only in system A.

3.4 Front-end grayscale strategy

The grayscale strategy mentioned above, I consider it from a server-side perspective, in fact, there are some commonly used grayscale techniques on the front-end or web, here is a brief talk.

1. CDN resource offloading

The front-end resource is placed on the CDN, and each time a new version is released, the resource is incrementally uploaded to the CDN and a unique version number is specified. When processing requests, different users are assigned to use different versions of CDN according to the front-end policy to display different styles. In this case, the corresponding backend interface needs to control the grayscale policy according to the parameters and distinguish different front-end requests. This strategy is generally used when upgrading billing services. Because both the front and back ends require grayscale, the front-end needs to control the grayscale policy, and the back-end is compatible with parameters to ensure the diversity of bills.

2. Client offload

The client-side offload strategy is the same as that of CDN resources, in which the client controls the grayscale offload, and determines the service situation according to the parameters and version numbers passed by the client, combined with the current scaling policy. There will be more client offload policies, such as user device system, app version number, app installation channel, user ID, and device ID.

4. Write at the end

Grayscale is very important, and the grayscale strategy also needs to be flexibly adjusted according to the actual situation. The strategies and opinions mentioned in this article are my opinions. Bricks of jade, welcome to discuss.

Share an interesting example of grayscale design flaws and talk about the design of grayscale solutions