Standardization practice of buried point data at station B

The relationship between user behavior data management and corresponding behavioral data analysis is very close, how to apply it well at the product level, the precondition is that the buried data needs to be standardized management and standardized application, which is a topic that many companies and teams in the industry will encounter. This article will share with you the practical experience of station B in the standardization of buried sites.

Full-text catalog:

Bury the standardized background
Standardized practice strategies
Follow-up prospects
Q&A

Sharing Guest|Li Kui Bilibili Senior Data Product Manager

Editor|lumiuu West Mountain Residence

Production Community | DataFun

Bury the standardized background

1. Definition of burial point

Standardization practice of buried point data at station B

(1) What is a burial point?

Let's take a practical example, for example, if the user clicks the recommendation button on a page in a certain APP at a certain moment, this information will be recorded, will be reported in the form of a log, stored in the server, such log information can be defined as a buried point.

The structure of the buried point can be abstracted into the five keywords of who, when, where, what, and how, recording a series of behaviors of users in APP, webpage or mini program. In fact, whether it is the user's behavior on the client or the change record in the interface log, it is a type of buried point, which is the common client buried point and the server-side buried point.

(2) The role of the burial point

In daily work, a very common type of data is to count the daily activity of the APP, the new users every day, the path flow of new users, etc., and these data are used for partial analysis. The other category is the tuning of recommendation algorithms, etc. These are common application scenarios of buried points, which need to be applied to the data after the buried point processing.

As you can see in the figure above, the user clicks the recommendation button, there will be JSON format log reporting, this log can be divided into two parts, is a typical report buried point log format, including for locating the user ID, the timestamp of the operation, the type of operation, and some parameters required by the business, such as the location of the click, the name of the page, and so on.

2. Bury the data link

The application process of buried points will have a relatively long link. Here we take the buried application link of station B as an example to do a simple disassembly.

From left to right in the figure, the whole process of buried point data from production to consumption and use, and the underlying buried point test and buried point metadata management are part of the data application support management.

From the perspective of production, the industry will abstract buried point collection into a reusable buried point data model and integrate it into the SDK, avoiding the need to redefine the format specification of collection every time the business is developed. This SDK is usually divided into iOS, Android, Web, server, etc., as well as offline data imported in batches in the form of backload, which is the production side of the data.

On the data flow side, the transport stream is entered by extraction, transformation, loading (ETL), and there are two links, one part of the service needs to consume the real-time stream of data, and the other part of the service is to consume offline data. The consumption of real-time streams may be used for algorithmic recommendations, or real-time data analysis, real-time monitoring dashboards. The offline data is about the ODS to DWD of the data warehouse, and then to this part of ADS. In the offline storage part, the data storage will use different media, such as common HDFS, Parquet and so on. The query engine includes ClickHouse, Presto, Hive and other mainstream engines in the industry, and the B station also provides a visual Web for product managers, analysts, operations and other students to operate and analyze.

The students of station B click the operation to view the common PV and UV data of the buried point, and the front-end passes the parameters of the operation splicing to the query engine as the spelling of the query SQL. The adoption of query engines in the industry will have a variety of different scenarios depending on the amount of data each company. The common situation is: if the amount of data is relatively small, then directly use Hive query, if the amount of data is more, then you may use Presto. At present, for the daily increase of hundreds of billions of data on station B, ClickHouse is used as the query engine.

In order to support the entire data link, buried point testing is also used to ensure the quality of buried point data. Further down is buried metadata management, which is also the focus of this sharing.

3. Common business problems

In the process of business services and business support, there are many business pain points, which can be mainly summarized into two aspects: production design and consumption use.

Production design aspects

First of all, the most common problem is attribute naming, different business and development teams have different naming preferences, some people like to name the hump, some people like to use underscores to do the division, some people like to use the middle line to do the division, which will lead to the burial point is very confusing, need to unify the naming specification.

The second problem is that when the buried point is reported, there are parameters that record the business attributes, and in the actual management process of the business, there may be a mapping value of the parameter enumeration that cannot be found, such as the original lowercase ABC, and the other business uses uppercase ABC, and the confusion of the mapping direction of the business value will lead to the confusion of the buried point management.

The third problem is that in the process of buried point production, data products, developers, testers and online applications will be involved, and the more parties involved, the more likely the buried point information of each party will be uneven.

The last problem is that Excel or documents are used to manage the buried point, and it can be operated when the amount of data is small, but when the amount of data is large and the handover is large, the information distortion will be more serious.

In terms of consumption and use

The first question is that operations often torture the product, which ID is the corresponding buried point on the page? I can't find it in the data.

The second question is which table should be queried, which buried point parameters should be filtered, and what are the private parameters when querying the buried point data.

The third problem is that when the data warehouse is managed, the pressure of storage is very large, and not all business burial points will definitely be used. Some of them, such as the buried point of exposure, its cost performance will be relatively low, so you can consider doing graded storage.

The fourth question is about permissions, operations need to query the data of a certain buried point, whether to open all or only part of it, need to be finely managed.

Standardized practice strategies

In response to the above problems, Station B proposes the management of the whole life cycle from buried production to consumption off the line. The focus is on embedded metadata management.

1. Current status and historical iteration of buried point data of station B

At present, there are 10,000+ client buried points in the application on the B station line, and the amount of metadata of the entire buried point is very large. In addition, there are 100,000+ buried points on the web side of various web pages. The daily incremental data report has reached the level of hundreds of billions, and it will reach the trillion level in a week, and the amount of data is very large.

Historically, the iteration of the buried point has gone through three and a half stages.

The first stage is to customize the management according to business needs, such as collecting the browsing of the playback detail page, designing a field for each buried point, and saving it into a Hive table or log table. The obvious disadvantage of this approach is that the management will be very chaotic, the data can only be used once, and there is no way to do a convergence.
In the second stage, after realizing that the buried point should be done for abstraction and model design, it began to reference the event model, but after the introduction of the event model, there was no support for productization and tooling, and it was still managed by the business.
In the third stage, on the basis of the event model, the characteristics of the buried point of station B business are abstracted. Unified definition of public fields, regardless of whether the private attributes of the business should be reported or not, public fields require unified reporting. Reducing the cost of repeated business development in the SDK, coupled with business customization events, has an abstract prototype.
Since 19, it has entered a new stage, gradually standardizing the buried point model of SPMID, plus the precipitated productization management, and assisting with tools and model products to standardize the definition.

2. Buried point design

In the process of buried point standardization design, there are four important parts: buried site naming convention, buried point attribute management, tooling support, and process and specification.

(1) Naming convention for buried points

First of all, let's look at the naming of the buried point, many businesses will name the eventID of the buried point separately, and need a low-threshold tool to manage the eventID of the buried point, do not need to think about how to name, do not do random coding, but to have a high business readability ID information. In addition, several versions need to have continuity, not to be confused after a few versions. Handover or readability is required, and the degree of migration between versions is high. Finally, there needs to be a tool to ensure smooth handovers between different maintainers.

Station B introduced the SPMID (Super-Model) model in standardization practice. The eventID of the buried point contains the actual information of the business, abstracts the meaning of each business into the buried point ID, and then manages this ID in a dimension. The whole is divided into five parts, including service ID, page ID, module, location, and buried point type. Through standardized naming, the readability of the entire service can be improved, and problems can be reasonably located and the cost of burying points can be reduced when doing buried point data governance. With the same name, different buried point types can be reused.

A practical example is shown in the figure above, on the home page of the buried point, how should the recommendation button be named? This burial point can be named bili.homepage.top-tabbar.0.click, which contains many meanings used by business. Disassembled, this buried point actually contains four business granularities and meta-information of the buried point type. The granularity of the business ranges from coarse to fine, covering business_id, page_id, model_id, and position_id.

For users, after getting this eventID, they can quickly locate the page module, location module, and in which page homepage, which business line it belongs to, and can accurately locate the corresponding information of the business line it is in.

This service buried point ID, for a positioning or type division, can achieve a very high readability of the business, share the cost of the business buried point, and the reusability is very high. The buried point of the click is named click, and the same module, the buried point of exposure, can be named show.

When making a burial point, the reporting will be divided into client SDK reporting and server reporting. Clients are divided by type of buried point, including startup, browsing, exposure, click, playback, system, and other events. The server includes the request records of this API and the log change information of the business table on the business side.

The above are some of the experiences of station B on the naming of buried sites.

(2) Buried point attribute management

When reporting the buried point, a very important part is to record the attribute parameters of the buried point. In the business sense, the buried point attribute is some customized collection of information for the user. There will be three levels of division:

The first is the global public field, including the buried point event ID, APP information, trigger timestamp, trigger time network, operator, operating system version, and so on.

The second is to abstract the common fields of this type for different types of buried points, including page view PV, playback, or business content buried points with business characteristics.

These two parts are preset in the SDK, and there is no need for secondary processing for service development.

The third part is the private parameters of business customization, such as when doing poster carousel, you need the bannerID of this poster carousel, or this poster corresponds to the mid and other parameters of the jump up master, which is the parameter information that the business customizes to use.

There are two other mainstream solutions in the industry, one is to collect parameters, tile private parameters that reserve 10-20 param, and the other is to only distinguish between public attributes and private field attributes. The problem with these two types of solutions is that there will be some lack of scalability, although it can quickly assist the business data collection in the early stage, but the later governance cost is relatively high. After long-term practice, it has been found that the method of public field + type general field + private field is a relatively general and extensible buried attribute specification method, which ensures flexibility and extensibility.

Regarding the buried attribute specification, in a data warehouse, such as a Hive table, there will be a table field-level data standard. In the buried point data, the buried point is abstracted into the business table, and the attributes of the buried point are actually mapped to the fields in the table, so by extension, it also has attribute standards.

A management specification will be divided into three categories, one is the basic description information, the second is the quality of attributes, and the third is the information used to assist in attribute management.

The first type of basic properties, common are whether the naming convention meets the underscore connection, dot connection, and dash connection. The type of data, whether it is a string or numeric value, or an enumeration type.

The second part is its data quality, including whether the buried point is null, enumeration value, the default value should be filled with Null, or a dash, these are used later when doing the buried point test, and the test rules are based on the attribute standards of this part of the buried point.

The third category is metadata set management, including the version of the buried point, the priority of attributes, the security level, and so on. Taking station B as an example, the safety level will be divided into several different levels: S, A, B, C, and D, of which S is the most important safety level.

(3) Instrumental support

We hope to apply tools to support the model of SPMID and avoid manual selection by business students. Inside Station B is a buried point management tool called Polaris.

As can be seen in the above figure, this is a buried point event creation module, which abstracts the buried business, page, module, location, and type into the choice of dimension table. Creating buried operations and products only requires pull-down click screening, rather than doing a buried site design from scratch. If there is a historical burial point, do a quick copy and modify some parameter information.

The first part is the naming of the buried point. The second part is to standardize the buried point attribute, including the attribute ID, attribute display name, attribute enumeration type, and so on. The third part is the reporting timing that the business is more concerned about, whether the buried point needs to be sampled and reported, and whether the remote collection is stopped.

Each link and module has a corresponding management list, which is structured and stored in the business table for downstream use.

Taking the list of modules in the figure as an example, the corresponding buried point module has been standardized naming, and its English ID and Chinese meaning are mapped to each other.

In the process of use, you only need to make a query to know which product the corresponding module is used in and which business line it is used in, so as to achieve layer-by-layer progression and dimensional table reuse.

(4) Process and specification

Station B divides the entire burying process and specifications into six links and four important participants.

The four important participants are as follows, after the business students put forward the requirements, give the data product students, the data products abstract the business requirements into buried requirements documents, called DRD, and then review the feasibility plan with the development, evaluate according to the priority and cost, and finally land for the development schedule to carry out the requirements online, the requirements development is completed through the test, and then the data students are analyzed.

In addition to the links of the above four participants, the six links also include data collection and verification. The development is based on this requirement document for embedded point development, the interface log is collected and stored for the behavior of the server or client, and finally handed over to the data product or test classmates for the buried point test. The test is carried out with the help of a test module in the buried point management tool. Finally, the test is completed and it is put into online use. The online scenarios include indicator analysis, algorithm recommendation, intermediate tables that output data warehouses, applications of ADS layers of data warehouses, and data dashboards.

3. Efficiency improvement application based on buried point standardization metadata

Station B on the practice of data standardization, also proposed the application of efficiency, storage, standardization of buried point metadata, which is not standardized for the sake of management norms, but has practical application scenarios, play the value of buried points. It can be summarized into three application scenarios, the first is that the data is reported more accurately, the second is that the storage cost becomes lower, and the third is that the query becomes more convenient.

(1) More accurate reporting

In order to make reporting more accurate, there is a very important tool, and the buried point test can be quickly and accurately, semi-automatic or even fully automatic to find out where the problem point of the business report is. In the design of the buried point, it is abstracted into DRD according to the business requirements, and this part will be entered into the structured buried point management tool, which generates some rules for test verification or DQC verification, such as enumerating null values, default values, and value ranges.

At the same time, sampling is carried out at the buried point, and the configuration metadata information is sent to the client SDK. Through this link, the tester can scan the code in the test background, report the buried point parameters through the SDK, and the server receives the buried point, parses the buried point log, including real-time data in Kafka or JSON format, and parses it into the test link.

The test link is divided into two parts, one is the summary display of the log report, the other is the analysis of the detailed test data, which buried rules are triggered on the test machine, which verification rules are hit in the module, which are up to standard, and which are not up to standard? What are the reasons for non-compliance? Through the support of the entire transmission link from production to test, the quality and efficiency of buried point reporting verification are improved.

From the actual effect, the buried point test module can be connected by scanning the code of the APP of the mobile phone client, and the test module can receive the buried point data used by the reporting terminal in real time, and can be mapped to the Chinese name, buried point attribute, DQC and other rules that have been entered in the source data before, and make real-time verification judgment, summarize and generate a visual test report. Based on the standardized metadata of buried points, station B can achieve near-real-time effect verification, covering the buried point testing of APP, web and server.

If the buried point data in the test environment is missing, this link can be quickly backfilled into the buried point management link, so as to achieve standard and rapid replenishment of the buried point data. This ensures more accurate reporting and makes testing easier and more intuitive.

(2) Storage costs become lower

The pressure of data storage will actually be more prominent and serious in the buried part. If the buried data is not stored to reduce costs and increase efficiency, then the cost is very high, because there are many businesses that will have this situation, whether it is useful or not, report it first, which means that the buried point data has a tendency to flood.

Therefore, in the downstream management direction of buried source data, according to the business to divide the database and table, so that storage management becomes easier, the intermediate table is divided according to the business type, according to the business ID of the business, that is, the first ID of the buried point, do the business sub-table, in the DWD layer or DWS layer of the data warehouse, according to this layer is a good basis.

In addition to partitioning and dividing tables of services, it can also reduce the storage cycle. With the metadata of eventID, the level of the buried point is divided into S, A, B, and C, and different levels correspond to different storage cycles and different storage granularity. At the same time, for different types of buried points, whether it is click exposure page browsing, targeted sampling and reporting of buried points.

For the buried point of exposure, many times the business value is relatively low, and the focus is only to look at the module exposure PV and UV. However, for the buried point of the click class, its business value will be higher, and it will distinguish in detail the type of buried point actively triggered by the user, such as click and start. Different buried point types, corresponding to different cost performance, according to different cost performance can do the sampling of buried points, such as 10%, 20% or 1%, or the buried point has been offline, remote configuration of the offline switch.

(3) Inquiries become more convenient

When doing embedded point analysis, I hope to decode as much as possible, and provide tools and front-end interactive UI pages to analysts and product managers, which will be more friendly and intuitive.

The metadata of the buried point has been prepared, providing a front-end query page, by obtaining the user's front-end operation, combined with the metadata module of the buried point management, as well as the metadata of the buried point stored at the DB level, the SQL that initiates the query, the result set that returns the query results, and a visual BI display is carried out, supporting a variety of visual graphics such as line charts and bar charts.

In this process, it can be seen that the SQL query or the query field will rely on the front partition to accelerate the query, reduce costs, and improve the overall efficiency.

Station B already has products serving the business team, the upper part of the figure is the module for burying point analysis, by reading the metadata of the buried point, do visual display, including the private properties of the buried point that have been abstracted before, here can do two parts of the analysis, one is to do quick screening and filtering, the other is to do group display. The efficiency of the analysis achieved here depends on the standardized storage and management of buried point data.

Follow-up prospects

Regarding the future prospects, the recent exploration of Station B is based on the standardized buried point metadata to do automatic distribution. The data architecture link often done in the industry is divided into two parts, one is the real-time stream of consumer data, often using Kafka; The other is to consume offline Hive tables to do the construction of ODS, DWD, DWS data warehouse layering. If the metadata of the buried point has been accessed, whether it can be done for unified management of stream batches, or unified consumption and use, to achieve a one-time distribution configuration, its real-time and offline can take effect, and the caliber of both sides is aligned.

For the use of the next step of business distribution, you can preset the intermediate table of the business, and if a business wants to customize the consumption of certain buried points, or some business data, use the division of the buried point ID to make an intermediate table, or view level consumption, to reduce the query cost of reading the entire table downstream.

Finally, through the link of flow and batch, real-time high-quality consumption buried data flow can be realized, which is used for recommendation algorithms on the business side.

That's all for this sharing. Welcome to pay attention to the technical public account of station B for more exchanges.

Q&A

Q1: Regarding the standardized management of buried sites, how to ensure the compatibility of new and old data after going online?

A1: Assuming that there is a set of standards for burying point management in the web side, and there are several different sets of buried point standards in the APP's buried point, the naming conventions of public parameters and private parameter buried points are not the same in the reported buried points. There are two processing solutions, the first is to do an intermediate layer integration at the offline data warehouse level, write the historically important buried point to the buried point data warehouse through the offline data warehouse backload, increase the eventID field of the buried point, and do a buried point compatible processing. The second scheme is for example, the naming of the buried point of the business, relatively can be rescued, although its buried point is not standardized, but the degree of non-standardization is not so serious, and it can be modified. For this type of incremental requirements, for business presentation, the naming should be carried out in accordance with the module standardization of SPMID, and the history can be compatible with batch import. Generally speaking, it is divided into two types according to the stock and increment and the severity of business irregularities.

Q2: Will the practice of standardization of buried sites at station B carry out the productization service of ToB? Will there be a commercialization of ToB?

A2: At present, station B is still doing ToB services internally, and it should not do SaaS services externally in the short term, but you can do a communication with you.

Q3: How to understand the sampling report of exposure data? If the exposure sampling click is fully reported, is there a problem with the click rate calculation?

A3: The metadata records this sampling ratio and sends it to the client SDK. For example, if the sampling is 10%, there are a total of 10 pieces of data triggered, and the SDK will report 1. When analyzing, it is necessary to do a conversion, for example, PV reported 10,000, that is actually sampling 10%, then the actual PV should be 10,000 / sampling rate of 10%, that is, PV is multiplied by 10 times, according to this way for conversion calculation.

Q4: How to meet the timeliness of the online demand? How to keep pace with business requirements?

A4: In fact, it is the collaborative link process mentioned in the article, how to coordinate all parties to report the buried point at the appropriate time, or standardize the unification. For example, the requirements of operation have been launched in advance, but the buried requirements have not been supplemented, which is actually not standardized in the process.

On the B side of station B, when the business plan review is actually carried out, the business requirements document must contain the buried requirements document, and the business review contains data collection. When the business module is online, the collection of buried points is also online, and the input management is a way to standardize through process collaboration, and the review process has solved these problems.

There is also a problem of synchronous online, there may be a historical module, its buried point is not collected, it is necessary to uniformly mention the requirements of a certain version, and do a centralized supplementary collection.

Q5: Will the specification design of SPMID be tedious?

A5: If the buried point is done in a purely manual way, the design of SPMID is indeed very cumbersome, but station B has these functions: fast copy, one-click copy, one-click import, and users do not need to design from scratch when doing the buried point design. Click Copy, and then modify the corresponding module parameters. Currently, the SDK can send the corresponding buried parameter information, and some public parameters are all automatically collected. The web side can be automatically reported, and the APP side also needs to do mutual checks.

That's it for today's sharing, thank you.

▌2023 Data Intelligence Innovation and Practice Conference

Data Architecture/Data Efficiency/Intelligent Application/Algorithm Innovation...

4 major systems, professional deconstruction of data intelligence

16 thematic forums, covering current hot topics and trends

70+ presentations with innovation and best practices

1000+ professional visitors, insider technology event

Click on the links below to learn more:

DataFunCon2023 (Beijing): Data Intelligence Innovation and Practice Conference - Baige Event