laitimes

Flink's landing practice in the real-time computing scenario of the auto home

Guest | Wang Gang

Edit | Strict

Powerful enough to develop and run many different kinds of applications, Apache Flink is recognized in the industry as one of the highest-performing real-time computing engines for big data. Flink has proven to scale to thousands of cores, with terabytes of state and still maintaining high throughput and low latency. Globally, more and more companies have begun to use Flink, and more well-known Internet companies in China such as Alibaba, ByteDance, JD.com, Meituan, etc., are using Flink as a distributed big data processing engine for enterprises on a large scale.

The industry's original positioning of Flink is more of a stream processor or stream computing engine, under the general trend of real-time transformation of big data, as practitioners, we can't help but think about what else Flink can do? How can you take full advantage of Flink and implement a solution that covers a larger range of real-time problems? In specific business scenarios, what challenges will we face and what solutions will there be?

Based on the above-mentioned problems, we invited Wang Gang, a data engineer at autohome smart data center, to share the core practice of Apache Flink in autohome. At the same time, Mr. Wang Gang will bring you the sharing of "Flink-based real-time computing platform and real-time data into the lake practice" in the QCon+ case study club [Flink's landing practice in real-time computing application scenarios], hoping to bring inspiration to everyone.

The following is an interview with Teacher Wang Gang:

InfoQ: Tell us about what you've been doing lately.

Hello everyone, my name is Wang Gang, and I am currently responsible for the design, development and maintenance of the real-time computing platform, real-time access platform and data lake of The Autohome. After continuous polishing in the actual production process, the ease of use of the platform and the stability of the task have been greatly improved, and it has served the business line of the whole company, and the daily computing volume has reached the scale of trillions.

InfoQ: What difficulties and pain points have you encountered in achieving these work results? What kind of effort was it made to solve it? What are the sediments and inspirations?

I started working on the real-time computing platform at the end of 2018, and there were really small difficulties in the process. But think about it, under the order of magnitude of autohome data, most of the problems we encounter with Flink are in the process of use, or small troubles caused by improper application of resources, thanks to Flink's excellent design and the strong community behind it. In terms of customization requirements, thanks to the excellent packaging of the Flink compute engine, it can be supported by some simple changes; some of the more difficult problems encountered in the computing engine can also be solved with the help of the community; and there is also a type of environmental problem that will also bring us a lot of trouble, such as the problem of glibc, which leads to the native memory leak of the JVM. At that time, we were just approaching Flink, and at one point we suspected that it was the problem with the Flink engine itself, and we took many detours, and then we found that the number of many consecutive 64MB memory segments in the process was increasing over time, which identified the problem.

There is also a part of the problem from doing platform-related work, such as with the increase in the number of users, on call pressure is very large, then you have to constantly reflect:

What is something that the current user can't do, and we must help him do it? Can it be empowered to users through the platform?

If the platform can automatically help users diagnose and give solution suggestions for frequently encountered problems in the use of on call?

How to prepare for the re-insurance task?

InfoQ: Flink has been emphasizing the integration of streaming and batching in recent years, what practices and explorations do you have in actual business scenarios?

In this regard, we mainly have two directions to explore:

When users on our platform use Flink SQL to develop stream computing tasks, they can apply the SQL of the previous batch processing task to the development of stream computing with slight changes, which not only greatly reduces the user's learning and development costs, but also unifies the computing caliber;

We introduced Iceberg as a unified Table Format in the storage layer, so that the streaming and batching of the storage layer can be read in full or incremental mode, and the batch/stream task processing can be carried out.

For future planning, we will also try to use Flink SQL as a batch calculation engine, giving full play to the advantages of Flink streaming and batch integration to further empower users and reduce user development costs.

InfoQ: What are you focusing on right now? What are the new hot spots and trends in Flink? What more needs to be done to take full advantage of Flink?

Recently, I've been focusing on the fine-grained management features of task resources in the new version of Flink. Before Flink 1.14, it was a coarse-grained way of resource management, we can make full use of resources through Slot Sharing, but in individual scenarios, it will still cause unnecessary waste of resources. I think allocating resources to SlotSharingGroup at a fine-grained level is a good idea to address waste of resources.

Flink's landing practice in the real-time computing scenario of the auto home

On the other hand, I'm more focused on the Flink CDC (Change Data Capture) project. Prior to the release of Flink CDC, we implemented a real-time access distribution platform on Flink that synchronizes business library data such as MySQL, SQLServer, and TiDB. At this year's FFA (Flink Forward Asia) conference, the Flink CDC theme shared by The Cloud Evil teacher gave me a lot of inspiration, and I think that the ease of improvement that Flink CDC needs to do (Schema changes, library-level data into the lake) are the problems that our company's digital warehouse and business line users urgently need us to solve.

Flink's landing practice in the real-time computing scenario of the auto home

InfoQ: In the exploration of Flink, what are the problems that need to be solved due to various reasons and objective conditions?

We encountered the following two obvious problems:

Flink versions are iterated faster and have lower compatibility between versions, which makes it difficult for the platform to integrate new versions of Flink;

Flink SQL allows users to complete the development of real-time computation through SQL, but the expression ability of SQL is limited, and sometimes users need to write a lot of SQL to complete the data development of a real-time large screen, so that the data will be double-calculated, resulting in waste of resources. This kind of problem is actually more like the real-time OLAP scenario, we are currently using StarRocks to support this scenario, but we expect Flink to have a one-stop solution to further reduce our maintenance costs.

InfoQ: Finally, what to say to those who are interested in Flink and want to learn more about the application!

I think the way to approach any new technology stack in software development is universal, it is recommended to find some scenarios to use first, then take the problem to understand, and finally think about what you would do if you implemented it yourself. In addition, when working, we must also develop the habit of summarizing and reflecting more. Many students who have just come into contact with software development think that the problem is solved and the problem is finished, in fact, every time the problem appears, no matter how big or small, it is an excellent opportunity to think. Once the problem is solved, we can go back and think about how we can better help users quickly locate the problem or avoid such problems next time. For example, reflect on it: is the problem so difficult to locate because the program design is too complex and the link is too long?

Guest Profiles

Wang Gang

Autohome Smart Data Center Data Engineer

Graduated from Shenyang University of Aeronautics and Astronautics, majoring in Computer Science and Technology. In 2018, he joined Autohome, redesigned and developed the log collection platform, designed and developed a real-time computing platform and real-time access platform based on Apache Flink from 0 to 1; began to explore and land on the integrated architecture of lake warehouse in 2020, leading the integration and optimization of Apache Iceberg; likes technology exploration, pays attention to user thinking, and is good at locating and solving various difficult and complicated diseases encountered in the work.

Recommended Activities:

Buy tickets now and enjoy a 20% discount, buy tickets discounted by 1760 yuan, scan the code for details.

Click on one to see fewer bugs

Read on