OPPO's big data diagnosis platform "Compass" is officially open source

First, the background

The OPPO big data platform currently has 20+ service components, with a data volume of more than 1EB, nearly one million offline tasks, thousands of real-time tasks, and more than 1,000 data development analysts. This also brings the problem of system complexity, on the one hand, users often "scratch their heads" about their task operation status, whether it is performance problems, parameter configuration problems, or even some common permission error problems, you need to consult the platform to give specific solutions; On the other hand, the platform faces various complex tasks, and O&M personnel often need to locate and troubleshoot tasks, due to long task links, many component logs, and high O&M pressure. Therefore, it is urgent to monitor and diagnose tasks in real time, not only to help users quickly locate abnormal problems, but also to give specific suggestions and optimization plans, and at the same time to control all kinds of "zombies" and unreasonable tasks, so as to achieve the purpose of reducing costs and increasing efficiency. According to the survey, there is currently no mature open source task diagnosis platform in the industry. To this end, we have developed a big data diagnosis platform, which optimizes the number of task instances by diagnosing the platform to exceed 20,000 per week, and has achieved good results.

"Compass" is an open-source project based on OPPO's internal big data diagnosis platform, which can be used to diagnose big data tasks running on scheduling platforms such as DolphinScheduler and Airflow. We hope to give back to the open source community through "Compass", and hope that more people will participate to jointly solve the pain points and problems of task diagnosis.

Second, the core function of the compass

Compass currently supports the following functions and features:

Non-intrusive, instant diagnosis, no need to modify the existing scheduling platform, you can experience the diagnostic effect.
Support a variety of mainstream scheduling platforms, such as DolphinScheduler, Airflow or self-developed.
Supports multi-version Spark, Hadoop 2.x and 3.x task log diagnosis and parsing.
Support workflow layer exception diagnosis to identify various failures and baseline time-consuming exception issues.
Supports engine-layer exception diagnosis, including 14 exception types, such as data skew, large table scanning, and memory waste.
Supports writing various log matching rules and adjusting exception thresholds, and can be optimized according to actual scenarios.

Overview of diagnostic types supported by Compass:

OPPO's big data diagnosis platform "Compass" is officially open source

(1) Non-invasive, immediate diagnosis

Here we take the DolphinScheduler scheduling platform as an example.

From the architectural point of view, MasterServer is mainly responsible for DAG task slicing, task submission monitoring, and persisting task instance data to the DB, while WorkerServer is mainly responsible for task execution and providing log services, while providing the function of viewing remote logs in the UI. To be able to get task metadata and related logs for diagnostics, one way is to listen for task status events in MasterServer, and the other way is to subscribe to MySQL binlog logs. To reduce modifications to DolphinScheduler, we took the second approach.

So we only need to create a workflow in DolphinScheduler, run it, wait for the run to finish, and we can see the task failure and other exceptions on the compass.

The compass not only decouples the scheduling platform, but also provides immediate diagnosis after the task runs, and provides rich UI display services. If you do not need the UI services we provide, you can also directly query the metadata of compass diagnosis and display it where needed.

(2) Workflow layer exception diagnosis

For task instances in the workflow layer, common issues can be divided into two categories: one is a failed task, such as first failure, final run failure, and long-term failure; The other category is tasks that take an unusual time, such as baseline time anomalies, baseline time-consuming anomalies, and long running times.

Diagnose failed tasks

Users often ignore the first failure and even increase the number of retries, which can eventually turn into a final failure if left unattended. Compass records and diagnoses the cause of each failure, not only to quickly locate the problem for the user, but also to find the root cause when the failure is backtracked. For long-term failed tasks, users need to be notified to rectify or clean up to avoid wasting resources.

Diagnose unusually time-consuming tasks

For tasks that require SLA assurance, Compass not only analyzes whether tasks that end early or late relative to the historical normal end time, that is, the baseline time is abnormal, but also analyzes whether the running time is too long or too short compared to the historical normal operating time, that is, the baseline time consumption is abnormal. For tasks that take a long time to run, such as large tasks that exceed several hours, both the user and the platform need to analyze whether it is a problem with the task itself or with the platform.

(3) Spark engine layer exception diagnosis

For Spark tasks, common problems can be grouped into three categories: one is runtime errors, another is runtime efficiency, and the last is resource usage issues.

An error is reported during the diagnostic runtime

Common errors reported by the engine layer include SQL failure, shuffle failure, and memory overflow. Such errors have obvious log characteristics, which can be extracted and classified according to keywords, and the existing knowledge base can be used to provide users with specific solutions to improve user experience and efficiency.

Compass provides rules for SQL failure log analysis, usually involving operation permissions, database table non-existence and syntax, which can directly direct users to apply for permissions.

Shuffle problems can seriously affect the operation of tasks or even lead to failure, which needs to be focused, if you don't have a better solution at the moment, you can also refer to OPPO's open-source high-performance remote shuffle service.

Memory overflow is also a problem that often causes tasks to fail, and key log diagnostic analysis can be extracted and users can be advised to optimize memory configuration parameters.

In addition to the above problems, Compass also provides 40+ log identification rules and suggestions, and you can also expand the identification rules according to actual scenarios.

Diagnose runtime efficiency exceptions

If the task execution takes a long time or suddenly slows down, the user cannot directly determine whether the task itself is a problem, a scheduling platform problem, or a problem of the computing engine. In order to troubleshoot the Spark engine, it is generally necessary to analyze SparkUI professionally, which is not intuitive. Compass comprehensively detects problems affecting engine execution efficiency, covering problems such as large table scanning, data skew, Task long tail, global sorting, OOM risk, Job/stage time-consuming anomalies, HDFS stuttering, and speculative execution of too many tasks.

Large table scanning

The number of SQL scan table rows performed on the compass pair, visualized in the table. If you do not filter by partitioning conditions, a full table scan may occur, and you need to remind you to optimize SQL to avoid memory overflow and cluster impact to improve operational efficiency.

Data skew

The compass detects the amount of data processed by each Task and determines whether the data is skewed. When the data is skewed, it may cause task memory overflow, low computing resource utilization, and job execution time that exceeds expectations.

Task Long tail

The compass detects the time-consuming time of all tasks and displays them in a bar chart by Stage, which is convenient for users to determine which Stage execution time is abnormal. The reason for this is generally too much data read or slow to read data. If too much data is read due to data skew, it is handled as data skewed. If HDFS lags at the same time, it will cause slow data reading, and you need to troubleshoot the cluster.

Global sort exception

Users often use sort functions in SQL without partitioning restrictions, resulting in global sorting. If only one Task processes data, it is recommended that users repartition to avoid wasting resources and affecting operational efficiency.

OOM early warning analysis

Compass detects the proportion of SQL broadcast memory, when the broadcast data is too large, it will cause OOM risk to the driver or executor, and you need to remind the user to disable broadcast or cancel forced broadcast, and apply for increasing memory if necessary.

Job/stage takes an unusually long time

The compass calculates the actual calculation time and idle time of each job/stage, which generally occurs when the resources are insufficient, and you need to pay attention to the cluster resource problem.

HDFS stuttering

When HDFS freezes occurs, it will affect the data rate of Task reading, thereby affecting the execution efficiency, and you need to pay attention to the running status of the HDFS cluster.

It is assumed that too many tasks are executed

Speculative means that the execution time of the task execution unit Task in the same stage is longer than that of other Tasks, and the same Task execution is initiated in other Executor, and the Task that completes first will kill another Task and obtain the result. You need to pay attention to the running status of the cluster.

Diagnose abnormal resource usage

For the problem that users are uncertain about the CPU and memory usage of tasks and do not know how to apply for resources, the compass visually presents the proportion of CPU and memory usage, which is convenient for users to optimize resource configuration parameters to save resource costs.

The compass also provides GC log analysis to see if there are performance issues with GC during execution.

(4) One-click diagnosis, report overview and other functions

In addition to the above functions, we also provide the function of one-click diagnosis to provide users with detailed diagnostic reports. At the same time, there are report overview data and whitelist functions.

Third, the compass technical architecture

The compass is mainly composed of synchronous workflow layer task metadata module, synchronous Yarn/Spark App metadata module, associated workflow layer/engine layer App metadata module, workflow task anomaly detection module, engine layer anomaly detection module, and Portal display module.

Overall architecture diagram

The overall architecture is divided into 3 layers:

The first layer is to connect with external systems, including scheduler, Yarn, HistoryServer, HDFS and other systems, synchronize metadata, cluster status, operating environment status, logs, etc. to diagnostic system analysis;
The second layer is the architecture layer, including data collection, metadata association & model standardization, anomaly detection, and diagnostic portal modules;
The third layer is the basic component layer, including MySQL, Elasticsearch, Kafka, Redis and other components.

Specific module process stages:

(1) Data collection stage: synchronize workflow metadata such as users, DAGs, jobs, and execution records from the scheduling system to the diagnostic system; Regularly synchronize Yarn ResourceManager and Spark HistoryServer App metadata to the diagnostic system, and mark the storage path of job running indicators, which is the basis for subsequent data processing stages.

(2) Data association & model standardization stage: the step-by-step collection of workflow execution records, Spark App, Yarn App, cluster running environment configuration and other data are associated through the ApplicationID medium, at this time, the workflow layer and the engine layer metadata have been associated, and the data standard model (user, dag, task, application, clusterConfig, time) is obtained;

(3) Workflow layer & engine layer anomaly detection stage: so far the data standard model has been obtained, and the workflow anomaly detection process is further Workflow for the standard model, and the platform maintains a set of data governance knowledge base that has been precipitated for many years, loads the knowledge base to the standard model, and performs abnormal mining of the indicator data and logs of the standard model at the same time through heuristic rules, combined with the cluster state and the running environment state, and analyzes the abnormal results of the workflow layer and engine layer;

(4) Business view: store and analyze data, provide user task overview, workflow layer task diagnosis, engine layer job application diagnosis, workflow layer display the exception caused by the scheduler execution task, such as task failure, loopback task, baseline deviation task and other issues, and the computing engine layer displays the time-consuming, resource usage, and runtime problems caused by Spark job execution;

DolphinScheduler & Compass

DolphinScheduler is a distributed and extensible open source workflow coordination platform with a powerful DAG visualization interface, rich usage scenarios, and 30+ types of tasks such as Spark, Hive and Flink, with high reliability and scalability. After years of practice and accumulation, DolphinScheduler has become a mature open source project with a wide range of user groups.

(1) Deployment experience

Here we use DolphinScheduler (version 2.0.6) as an example to experience how to quickly integrate a compass. If you haven't deployed DolphinScheduler yet, you can refer to the official deployment guide. If you're already using DolphinScheduler, you only need to deploy the compass. Compass supports stand-alone and cluster deployment, if you want to quickly experience the functions of the compass, you can use the stand-alone deployment mode, the compass relies on Kafka, Redis, zookeeper and ElasticSearch, you need to install in advance, and you can deploy the compass through the deployment script after the dependent service is completed:

Code compilation

Modify the configuration

One-click deployment

(2) Examples of use

First create a good project in DolphinScheduler,

THEN CREATE A WORKFLOW FOR A SPARK TASK.

Finally, go live and run the task.

Open the compass web UI, the default path is -, enter the account password of DolphinScheduler, and the compass automatically synchronizes the DolphinScheduler user information.

Finally, enter the task running page, you can see all the abnormal task diagnosis information.

5. Compass open source planning

Compass mainly focuses on offline scheduling tasks and computing engine to locate and analyze problems, and uses a rich knowledge base to provide users with optimization solutions and achieve the purpose of reducing costs and increasing efficiency.
The open-sourced section mainly contains problem diagnostics for task workflows and the Spark engine layer, and will soon release exceptions and resource issue diagnostics for Flink tasks.
In the future, deeper algorithms and diagnostic models will be introduced to achieve rule removal and thresholds, making abnormal diagnosis more intelligent.

6. Participation and contribution

Github project name: cubefs/compass

Welcome to contribute, if you have needs or suggestions can submit an issue to Github, we will answer you in time.

About the Andean Intelligent Cloud

OPPO Andean Intelligent Cloud (AndesBrain) is a pan-terminal intelligent cloud for individuals, families and developers, committed to "making terminals smarter". As one of OPPO's three core technologies, Andean Intelligent Cloud provides device-cloud collaborative data storage and intelligent computing services, and is the "digital intelligence brain" of all things.