A 9-year-old low-level bug discovery process

Guide

A problem is often the result of the accumulation of multiple small irregularities or errors. This article records the author's whole process of discovering problems, analyzing phenomena, troubleshooting process, and finally solving problems.

Project Background

My project team is mainly responsible for shooting store signs, and I am responsible for the development of the App client. This project has a long history from the beginning of the project to the present.

Now there is a problem: when users take photos, there will be photo corruption, this problem has been in the online environment for a while, and when I took over, this problem has already appeared, and there is no in-depth investigation of the cause. The temporary solution strategy is to let the user manually delete the corrupted photos, and when the images are uploaded, the server will also perform a file corruption detection.

We will send out a variety of shooting task types, some tasks only need to take a few photos, and some tasks need to take thousands of pictures, and this problem will be more exposed. At the suggestion of a colleague, it was decided that you wanted to get to the root of the problem.

phenomenon

I only knew that there was this problem before, and I didn't study it carefully. After self-test + understanding, the following phenomena are preliminarily clarified:

Symptom 1: This issue occurs for different task types

Currently, the same photo storage module is shared for different task types within the project. This phenomenon can be made clear that the scope of error is in the underlying photo storage module, not in the upper-layer business logic.

Symptom 2: There is a stable probability of 1 in 200 that the image is damaged

Through the reproduction with colleagues, it was found that when more than 200 images were taken in a row, a damaged image appeared. We have repeated this many times, and the frequency of occurrence is very expected, and even a little weird, because the frequency of this bug is too stable, but it is a little abnormal. Faced with this phenomenon, two possible scenarios came to mind:

The probability is very close to 1/256 (the value of FF converted to decimal in hexadecimal, the 8th power of 2, the size of a byte), and it is not due to an anomaly when parsing to a certain byte.
Every time more than 200 shots are taken, the temperature of the heavy GC+ mobile phone is too high, resulting in frequency reduction, resulting in stuttering, resulting in a certain step execution timeout or failure.

The above is just speculation, there is no evidence at all, just the direction of thinking at that time.

Symptom 3: This issue occurs only for the webp format

Currently, there are two storage formats for images taken, namely JPEG and WebP format. The project used JPEG as the storage format before, but later in order to reduce the size of the image, it began to use the webp format for storage. This problem does not occur when we change the storage format to JPEG, and this problem occurs when we change to WebP format.

According to the overall time consumption of these two (from image byte streaming to storage to file), the time taken by WebP is about 5 times that of JPEG, and the storage size of JPEG is about 1.5 times that of WebP. Faced with this phenomenon, the idea at that time was that it took a long time to process the image, which led to fierce competition for locks (thread locks, IO locks), and a data conflict occurred at some point.

Troubleshooting process

First of all, let's familiarize yourself with the project code, and here's a flowchart of the entire stored procedure:

The whole process is still relatively simple and easy to understand, and according to the direction of my suspicions at the time, the following order of investigation was formulated:

There was an error when the camera generated the webp picture.
There was an error in the code call logic.
There are problems with the encryption algorithm itself.

Troubleshooting direction 1: An error occurred while suppressing the photo

The image output from the camera is damaged when it is compressed as a webp photo, but it is not damaged when it is pressed by JPEG. This problem is relatively simple, you only need to save the original unencrypted Webp image and compare it with the encrypted image that cannot be decrypted. After practice, it was found that the damaged encrypted image and the corresponding original webp photo can be displayed normally.

Therefore, the problem of mobile phone camera and suppression of webp pictures can be clearly ruled out.

Troubleshooting direction 2: An issue occurs in the encryption process

When calling the AES encryption algorithm, the call may be incorrect. For example, due to accidental circumstances, the same image is called twice in succession. To troubleshoot this issue, you'll need to read through the code in this section and sort it out.

First, I checked the relevant information about the AES encryption algorithm.

AES is the Advanced Encryption Standard, also known as Rijndael Cryptography in cryptography, which is a block encryption standard adopted by the U.S. federal government. This standard is used to replace the original DES, which is now widely used around the world, and AES has become one of the most popular algorithms for symmetric key cryptography. AES supports three key lengths: 128-bit, 192-bit, and 256-bit.

I summarized it myself:

The AES algorithm belongs to symmetric encryption, and only one same key is required for encryption and decryption;
When the AES algorithm encrypts the plaintext, it does not encrypt the entire plaintext into a whole ciphertext, but splits the plaintext into independent plaintext blocks, each of which is 128 bits.
In the absence of padding, the ciphertext and the original text are equal in length.

First, I focused on thread safety issues, checked around, and carefully looked at all the shared variables involved in this process, and found no problems.

The following combs through the encryption and decryption process and finds a very serious problem. This issue occurs in the preview image section with the following code:

public Bitmap decode(ImageDecodingInfo decodingInfo) throws IOException {

    // 如果是本地图片，OriginalImageUri会是'file:///xxx'以此来判断是否正在加载本地图片

    boolean isLoaclFile = decodingInfo.getOriginalImageUri().startsWith("file://");

    String imagePath = null;

    if (isLoaclFile) {

        imagePath = decodingInfo.getImageUri().replaceFirst("file://", "");

        if (!TextUtils.isEmpty(imagePath) && new File(imagePath).exists()) {

            ImageEncryptTool.decrypt(imagePath);

        } else {

            return null;

        }

    }




    Bitmap bitmap = super.decode(decodingInfo);

    if (isLoaclFile) {

        ImageEncryptTool.encrypt(imagePath);

    }

    return bitmap;

}

// 解密代码

public static void encrypt(String filePath) throws IOException {

    try {

        RandomAccessFile raf = new RandomAccessFile(file, "rw");

        byte[] buffer = new byte[ENCRYPTED_SIZE];

        raf.read(buffer);

        buffer = JniArithmetic.aesEncryptNoPadding(buffer);

        raf.seek(0);

        raf.write(buffer);

        raf.close();

    } catch (IOException e) {

        e.printStackTrace();

        throw e;

    }

}

When previewing a picture on your phone, you need to read the photo from the phone's disk, decrypt it, and display it on the screen as a bitmap. The process of the above code is as follows:

The logic here is very unreasonable, first read the disk file to the memory for decryption, and then write it back to the disk after decryption, at this time, the picture on the disk is a decrypted data, and then hand over the image loading frame to load the picture, and then encrypt it after the loading is completed, and write back to the file again after encryption.

This process requires multiple I/O operations and is inefficient to execute. This process is not guaranteed to be "atomic", and any problems in the process will result in the final image being corrupted, such as a crash after decryption is complete, resulting in no encryption. The unreasonable way greatly increases the probability of error.

In addition, there is a more serious problem, when the decrypted file overwrites the original file, another thread may call this file, will decrypt the decrypted file again, and then write it back after decryption, and finally the file will be "a mess", which will cause image damage.

Modify it to the following code:

@Override
protected InputStream getImageStream(ImageDecodingInfo decodingInfo) throws IOException {
    // 如果是本地图片，OriginalImageUri会是'file:///xxx'以此来判断是否正在加载本地图片
    boolean isLoaclFile = decodingInfo.getOriginalImageUri().startsWith("file://");
    if (!isLoaclFile) {
        return super.getImageStream(decodingInfo);
    }
    // 解密本地图片
    InputStream imageStream = super.getImageStream(decodingInfo);
    byte[] encodeByteArray = inputStreamToByteArray(imageStream);
    final int ENCRYPTED_SIZE = 1024;
    byte[] decodeBuffer = new byte[ENCRYPTED_SIZE];
    System.arraycopy(encodeByteArray, 0, decodeBuffer, 0, ENCRYPTED_SIZE);
    decodeBuffer = JniArithmetic.aesDecryptNoPadding(decodeBuffer);
    System.arraycopy(decodeBuffer, 0, encodeByteArray, 0, ENCRYPTED_SIZE);
    return new ByteArrayInputStream(encodeByteArray);
}

In the correct and sensible process, the decryption operation will only take place in memory and will not be written to disk, so that there will be no overwriting of all kinds.

Finally, the whole project was checked, and all the [decryption and re-encryption] processes were removed, and the whole project retained an encrypted place, which was encrypted when the picture was saved for the first time, and then written to the disk.

Successfully resolved?

Such an obvious mistake has been solved, and at this time, I think that there must be no problem. With great confidence, I did some testing, but I still had this problem. I was a little uncertain at first, but after trying it many times, I was probably 100% sure that the problem was still unresolved.

Troubleshoot Direction 3: Lock contention

At this time, I shifted my perspective to the direction of thread safety, and in order to make the time-consuming of the entire encrypted storage controllable, I decided to implement the encryption and decryption algorithm by myself. Of course, my own implementation of the encryption and decryption algorithm is simple. Here's the code:

private static final int N=100;




public static byte[] aesEncrypt(byte[] data) {

    for (int i = 0; i < N; i++) {

        data[i] = (byte) (data[i] + 1);

    }

    return data;

}




public static byte[] aesDecrypt(byte[] data) {

    for (int i = 0; i < N; i++) {

        data[i] = (byte) (data[i] - 1);

    }

    return data;

}

Just simply +1 the value of each byte, and then decrypt each byte -1 when decrypting, I can use the size of N to control the encryption and decryption time. I conducted another test, and no matter how I adjusted the encryption and decryption time, there was no damage to the picture.

With the available information, the encryption and decryption algorithm should be quite problematic. But I still believe that there is no problem with the encryption algorithm, the encryption code has been around for 9 years, and it uses a very mature AES encryption algorithm, so there should be no problem.

Troubleshoot Direction 4: Image issues

Take out the damaged picture from the mobile phone separately, and use the encryption algorithm and decryption algorithm to process the picture separately. The following data was obtained:

Figure 1: Unencrypted original photo, you can see that it starts with RIFF, which is a flag bit used to identify the webp file

Figure 2: Encrypt the result as follows the code flow

Case1: Run the original image separately with the encryption algorithm, and find that the content is consistent with Figure 2;
Case2: After trying to decrypt the encrypted image, the decryption algorithm was used to run it alone, and the content was found to be inconsistent with Figure 1.

Corresponding to this picture, the decrypted content is printed out every time, and it is found that sometimes it is correct, sometimes it is random, and the result may not be the same if the order of execution is modified. Since the encryption and decryption algorithm is written in C++, based on past experience, it is speculated that this situation is caused by the residual memory of C++.

The server also needs to decrypt the image when using it, because the server is not afraid of code leakage, so it directly uses the Java library to implement the AES decryption algorithm. With the cooperation of colleagues on the server, I uploaded the image separately, tried many times, and found that the server can decrypt it stably.

With the help of my colleagues on the server side, I got the server-side decryption algorithm, and I replaced the decryption algorithm on the side with the server-side decryption algorithm, and this damaged picture can be displayed correctly. Finally, after some testing, it was found that there was no more damage to the picture.

At this point, there is already 95% certainty that it can be proved that it is caused by the decryption algorithm. At this time, you can also get off work with peace of mind, and check the decryption algorithm the next day.

排查方向5:AES解密算法

First asked my colleague for the git address of the encryption and decryption repository, because this code has a long history, at the beginning of the project, SVN was used for management, and later the whole was packaged and put into the git repository during the migration, and the commit record can no longer be seen.

The architecture of the project itself is also relatively old, using Android.mk as a build tool, which has been abandoned for a long time now, and I have not touched it. With the help of ChatGPT, it automatically helped me convert to a modern CMakeLists building tool. Then you can trace the code normally.

The AES algorithm itself is relatively simple, but it keeps replacing the original text according to the password table. There are no third-party libraries used in the code, and the AES algorithm is implemented by itself.

The decryption algorithm is as follows:

void AES::InvCipher( BYTE *input, BYTE *output, int len)

{

    int arrLen = len;

    unsigned char *uch_input = new unsigned char[arrLen + 1];

    strToUChar((const char*)input, uch_input, len);

    for (int i = 0; i < arrLen / 16; i++) {

        InvCipher(uch_input + i*16);

    }

    ucharToStr((const unsigned char*)uch_input, (char *)output, len);

    delete[] uch_input;

}




int AES::strToUChar(const char *ch, unsigned char *uch, int len) {

    int tmp = 0;

    if (ch == NULL || uch == NULL)

        return -1;

    if (strlen(ch) == 0)

        return -2;

    while (len) {

        tmp = (int) *ch;

        *uch++ = tmp;

        ch++;

        len--;

    }

    *uch = (int) '\0';

    return 0;

}

When the line of IF (strlen(ch) == 0) is traced, it is found that -2 will be returned as an error code, and the above processing will directly ignore the error code and continue to decrypt. Of course, at this time, it will definitely not be possible to decrypt successfully.

Cause of the error

I've made a reasonable guess about this approach, and at first the encryption here is only for strings, so when entering parameters, it will be judged whether it is an empty string or not.

In C++, strings are usually represented as an array of characters, following the tradition of the C language. A string in C/C++ is an array of characters ending with a null character \0 (ASCII value of 0). This type of string is called a C-style string or a null-terminated string. The way to determine the end of a string is to check each character until \0 is encountered.

char str[] = "hello";
// 字符串实际存储为 {'h', 'e', 'l', 'l', 'o', '\0'}

IN JAVA, THE STRING ITSELF STORES THE LENGTH OF THE STRING. This length field is computed and stored when the String object is created, so it is possible to call the length() method very quickly to get the length of the string without having to iterate through the entire array of characters.

But then the binary image needs to be encrypted, and the binary image will also use the '0x00' byte in the storage process, but the meaning of the '0x00' in the string is completely different. If the first byte of the binary image is 0x00, processing it as a string will result in the encrypted content being empty, thus terminating the subsequent process.

I found a few more pictures of failed decryption, and found that they were all without exception, and the first byte at the beginning was 0x00, no wonder the probability of failure was 1/256. The first 16 bytes of all JPEG images (the first 500 bytes are visually inspected) are fixed, so the first byte is fixed after encryption. The first byte is random after the WEBP format is encrypted, so of course there will be no problem.

Problem solving

Since this part of the code has existed for 9 years, and I am not very familiar with this part of the code, adhering to the principle of minimal change, but the case of empty string verification has been removed, and the empty characters can be verified in advance.

int AES::strToUChar(const char *ch, unsigned char *uch, int len) {

    int tmp = 0;

    if (ch == NULL || uch == NULL)

        return -1;

//    if (strlen(ch) == 0)

//      return -2;




    while (len) {

        tmp = (int) *ch;

        *uch++ = tmp;

        ch++;

        len--;

    }

    *uch = (int) '\0';

    return 0;

}

Later, when I was sorting it out, I found that there was another charToHex method, which actually had the same code, and the same two lines had been commented out, and it seems that the previous people also found this problem, but they didn't solve the same problem in both places.

After the code was changed, it was packaged for testing and found that the issue did not occur again. In the end, I am very happy to have successfully solved this problem.

summary

From this incident, it is perfectly verified that "a problem is often the accumulation of multiple small irregularities or errors". Each time the code modifier didn't cause a big problem from his own point of view, and on his own, it wasn't completely unreasonable.

If the code is not written in a standardized manner, leaving security risks, although it will not be exposed at the time, when all the risk problems are put together, it will cause the final "disaster".

I have also gained the following lessons:

For the verification of the input parameters, the verification should be carried out in advance, and reasonable measures should be taken in the event of "illegal entry parameters". For example, by telling the caller in the form of a return value or flag bit that it really can't cause a crash, the problem can be exposed early.
The underlying module should be versatile, not just the situation at the time, and the module may be used in multiple situations in the future;
It is necessary to have a sense of risk, do not expand the risk problem, decrypt and encrypt the same file multiple times, and there will be mistakes on top of mistakes.
When solving a mistake, you need to see if there are any similar errors, which can be corrected as well.

Author: Jinzhi

Source-WeChat public account: Alibaba Cloud Developer

Source: https://mp.weixin.qq.com/s/lwJASuZ3BZldK6a-UHXK_g

A 9-year-old low-level bug discovery process