

author:Everybody is a product manager
In order to break down data silos and create greater data value, Alibaba designed OneEntity to provide global data and services. The author of this paper analyzes the reasons and values for the creation of OneEntity, and analyzes the OneEntity system.

In the first few articles, you learned about the construction plan of Ali Data Center with the Straw Hat Boy, and then let's decrypt the OneEntity system together.

1. Data silos

As a company with multiple business lines, such as e-commerce, finance, advertising, culture, education, entertainment, equipment and social networking, Alibaba has data areas including domestic and foreign countries, and data scenarios include online and offline data such as the location of people, goods, and goods, as well as logistics, dining, consulting, film and television, travel, reading, music, and health.

Data related to people alone includes business account information, PC cookies, device identifiers such as wireless IMEI and IDFA, and identity attributes.

With the diversification of people's Internet behavior, if hundreds of billions of pieces of entity data are generated every day, and these data belong to different business units, then the data is easy to isolate.


Straw hat boy thinks: I always didn't understand the part of data islands before, but I have established a onedata system, done a good job in data access for each business line, and the data at the ODS layer has been fully taken over, so why do you still say that data islands are all gathered together?

It wasn't until I really started to do user portraits that I found that the underlying indicator system was often directly oriented to each business line, and there was a lack of correlation between business lines, which was caused by business limitations. For example, if you are an operator of Taobao, will you pay attention to what the indicator system of DingTalk is?

The answer is obviously not.

In this way, there is a fault line in the data, only from the underlying indicator layer, the behavior habits of users in DingTalk, Taobao personnel cannot know. So if I, as a Taobao employee, want to know not only his shopping behavior on Taobao, but also his behavior and habits in places such as DingTalk, Alipay, Youku, etc., how can I know it?

Second, data can only truly generate value if it is integrated

In order to break down data silos and create greater data value, Alibaba designed OneEntity to provide global data and services. The OneEntity system mainly includes four categories: unified entities, global tags, global relationships, and global behaviors.


1. OneEntity

Several entities are grouped together and named OneEntity, which can be divided into general quality, high quality, and high-value OneEntity.

2. GProfile全域标签

Label OneEntities based on the aggregated data. In the OneEntity system, how to label OneEntities and find high-quality, high-value OneEntities is the most common problem.

This is inseparable from the extraction ability of the label, so how does Ali extract the label?


(1) Effective

On the one hand, take the initiative to find professors in demography, sociology and other disciplines to learn theoretical knowledge related to "people";

On the other hand, we have investigated the label classification systems of many industries to learn from each other's strengths.

Finally, the three-dimensional portrayal of "people" is divided into two parts: "people's core attributes" and "people's yearning and needs", which include four categories:

The core attributes of human beings can be divided into natural attributes and social attributes.

  • Natural attributes: refers to the physical existence of human beings and their characteristics, which are natural after human birth, and generally do not change greatly due to human factors. For example, "gender", "zodiac", "age", "height", "weight", etc.
  • Social attributes: refers to the sum of all social relations produced by people on the basis of practical activities. Once a person enters society, he will produce social attributes. For example, economic status, family status, social status, political religion, geographical location, values, etc.

People's yearning and needs can be divided into interest preference and behavioral consumption preference.

  • Interest preference: It is the inner psychological yearning and external behavior expression of non-objectified objects, which is an instinctive preference in the heart of the law, and has no necessary relationship with the material. For example, longing for love, needing security, hating dirty environments, etc.
  • Behavioral consumption preference: It is the expression of people's demand for materialized objects and external behaviors, involving various industries and inextricably linked with the material world. For example, the preference of the maternal and infant industry, the preference of the beauty industry, the preference of the cleaning and care industry, the preference of the home improvement industry, etc.

On the basis of the above four categories, we try to further subdivide the secondary and tertiary classifications according to different business forms.

(2) High speed

The extraction of labels includes: data collection, cleaning, noise removal and unification, repeated trial and determination of the best algorithm and model, selection of calculation factors for the model and allocation of weights for each calculation factor in the model, and output of label quality evaluation report to assist in acceptance.

We randomly sampled a number of tags in use, estimated the workload and work cycle, and the extraction of a valuable tag took an average of 2 weeks.

The main reason for the slowness is that due to the complexity of the extraction process, each label extraction relies on the underlying basic data, and less on the middle layer of the data summarized by the previous layer;

The label extraction process is complex, so what is the process that can be referred to?


First of all, at the data source level: build a complete set of data sources, with the OneEntity system as the core, connect all the related entities and their behaviors of OneEntity, and use them as data sources together with the stock tags.

Secondly, the label calculation level: the label extraction logic is precipitated into two types, which correspond to the production process of tool products with preference labels and classification and prediction labels, including business rules such as calculation factors and weights, data sample selection, model and algorithm selection, etc.

Finally, the label monitoring level: precipitation quality assessment report and production monitoring, on-line and other management processes.

When a complete set of tool-based products is launched, it only takes about 2 days to mass-produce more than a dozen tags of the same type, which is because a lot of code development and model training work is reduced in the process of supplementing data sources, determining business rules, selecting data samples, and selecting algorithms and models.

In this process, the roles of participation have also changed, from being led by data product managers, data warehouse engineers, and data scientists, to being led by business personnel and data analysts who are more familiar with the business.

3. GRelation全域关系

Find the relationship between objects, when OneEntity represents a person, you can find out his relatives, friends, alumni and colleagues, etc., and when OneEntity represents a product, you can find out his upstream and downstream goods/goods, etc.

4. GBehavior Global Behavior

Practices and behaviors related to OneEntity are linked to form a user behavior system. As:

  • Name, email, address, etc., this is the only sign in the real world, just like OneEntity represents your only sign in the big data world.
  • Origin, age, political outlook, religious beliefs, etc., these are the label portraits in the real world
  • A series of relationships between parents, children, husbands and wives, etc., are born or acquired, representing the relationship between GRelation in the world of big data
  • When and when you went to university, when and when you first joined the work, when and when you received a certain award, and who the referee is, etc

In the world of big data, the integration and extraction of siloed data can be carried out around a single topic.


Straw hat boy, public account: a data person's own place, everyone is a product manager columnist. Author of the book "The Road to Big Data Practice: Data Middle Platform + Data Analysis + Product Application", focusing on the field of user portraits.

This article was originally published on Everyone is a Product Manager. Reproduction without permission is prohibited

The title image is from Unsplash and is licensed under CC0.

The views in this article only represent the author's own, everyone is a product manager, and the platform only provides information storage space services.