10,000 attributes + 10 billion data + 100,000 throughput, it is not difficult to design this architecture

　　There is a type of business scenario that does not have fixed schema storage, but has a large number of data rows, how to realize the storage and retrieval of such services in architecture?

　　10,000 attributes, 10 billion data, 100,000 throughput, today I will talk to you about the design practice of this type of "classified information business" architecture.

　　1. Background description and business introduction

　　1. What is the core data of the classified information platform?

　　A classified information platform, there are many vertical categories: recruitment, real estate, second-hand items, second-hand cars, yellow pages, etc., each category has many sub-categories, no matter which category, the core data is "post information".

　　2. What are the characteristics of the information of each category of posts?

　　Friends who have visited classified information platforms can easily understand that the post information here:

　　The attributes of various categories are very different, the attributes of recruitment posts and second-hand posts are completely different, and the attributes of second-hand mobile phones and second-hand home appliances are completely different, and there are nearly 10,000 attributes at present;

　　The amount of data is huge, 10 billion levels;

　　Each attribute has query needs, each combination of attributes may have combination query needs, recruitment to check position/experience/salary range, second-hand mobile phones to check color/price/model, second-hand to check refrigerator/washing machine/air conditioner;

　　The throughput is large, hundreds of thousands of throughput per second;

　　How to solve the technical problems of 10 billion data volume, 10,000 attributes, multi-attribute combination query, and 100,000 concurrent query? Step by step.

　　Second, the easiest solution to think of

　　The development of every company is a small to large process, leaving aside the concurrency and data volume, let's take a look:

　　How to implement property extensibility requirements;

　　Multi-attribute combination query requirements;

　　The initial amount of concurrency and data in the company is not large, and the business problem must be solved first.

　　1. How to meet the storage needs of the business?

　　In the beginning, the business only had one recruitment category, so the post table might be designed like this:

　　tiezi(tid, uid, c1, c2, c3);

　　2. So how to meet the combined query needs between various attributes?

　　The easiest thing to think of is to satisfy query needs by combining indexes:

　　index_1(c1, c2)

　　index_2(c2, c3)

　　index_3(c1, c3)

　　3. With the development of the business, another property category has been added, how to solve the storage problem?

　　Several properties can be added to meet storage needs, so the post table becomes:

　　tiezi(tid, uid, c1, c2, c3, c10, c11, c12, c13);

　　Thereinto:

　　C1, C2, C3 are recruitment category attributes;

　　C10, C11, C12, C13 are property category attributes;

　　By extending the properties, you can solve the problem of storage.

　　4. How to meet the needs of inquiry?

　　First, cross-business attributes generally do not have a combined query requirement. Only a number of combined indexes can be established to meet the query needs of property categories.

　　I can't imagine how many indexes can cover all two-attribute queries and three-attribute queries.

　　When the business is increasing, do you find that you can't play anymore?

　　Third, vertical splitting is an idea

　　Adding attributes is a way to expand, adding tables is also a way to expand, and vertical splitting is also a common storage expansion scheme.

　　1. How to split vertically according to business?

　　It can be played like this:

　　tiezi_zhaopin(tid, uid, c1, c2, c3);

　　tiezi_fangchan(tid, uid, c10, c11, c12, c13);

　　2. In the case of different services, huge data volume and throughput, what problems will vertical splitting encounter?

　　These tables, and the corresponding service maintenance in different departments, seem to have strong business flexibility and closed loops, which is the beginning of tragedy:

　　How is TID regulated?

　　How are properties regulated?

　　What should I do if I query by uid (query all posts I have published)?

　　What to do if I check by time (latest post)?

　　What about cross-category queries (e.g. homepage search box)?

　　The diffusion of the technical range, some are stored in Mongo, some are stored with MySQL, and some are self-developed;

　　Quite a few components were developed repeatedly;

　　Excessive maintenance costs;

　　…

　　Think about it, the commodity list of e-commerce cannot be a list of categories.

　　4. Industry best practices: three major center services

　　1. Unified post center service

　　Platform entrepreneurial companies, there may be multiple categories, each category has a lot of heterogeneous data storage needs, whether it is divided or combined, there is no need to entangle: the unification of basic data basic services is a good practice.

　　This is talking about platform-based business.

　　How to store different categories and heterogeneous data in a unified manner?

　　Unified storage of common attributes of all categories;

　　Single-category unique attributes, category types and general attributes JSON for storage;

　　More specific:

　　tiezi(tid, uid, time, title, cate, subcate, xxid, ext);

　　Some common fields are extracted and stored separately;

　　Define what ext means by cate, subcate, xxid, etc.;　　

10,000 attributes + 10 billion data + 100,000 throughput, it is not difficult to design this architecture

　　Use ext to store the individual needs of different lines of business.

　　For example:

　　Recruitment post, ext:

　　{“job”:”driver”,”salary”:8000,”location”:”bj”}

　　And the second-hand post, ext is:

　　{”type”:”iphone”,”money”:3500}　　

　　Post data, 10 billion data volume, divided into 256 libraries, through ext storage of heterogeneous business data, the use of mysql storage, the upper layer set up a post center service, using memcache as a cache, is such a not complex architecture, solves the big problem of the business. This is the core post center service IMC (Info Management Center) of the classified information platform.

　　It solves the storage problem of massive heterogeneous data, and the new problems encountered are:

　　Each record requires repeated storage, which occupies a lot of space, and whether it can be compressed and stored;

　　CAID is no longer enough to describe the content in EXT, the category is hierarchical, the depth is uncertain, and whether EXT can be self-descriptive;

　　Attributes can be added at any time to ensure extensibility.

　　After solving the storage problem of massive heterogeneous data, the next step is to solve the scalability problem of categories.

　　2. Unified category attribute service

　　How many attributes each business, what do these attributes mean, value constraints, etc., coupled into the post service is obviously unreasonable, so what to do?

　　Abstract a unified category and attribute service to manage this information separately, and the JSON key in the post library ext field is uniformly represented by numbers, reducing storage space.　　

　　The post table only stores meta information, regardless of the business implication.

　　As shown in the figure above, the key in JSON is no longer a long string such as "salary", "location" and "money", replaced by the numbers 1, 2, 3, 4, what these numbers mean, which subcategory they belong to, and the verification constraints of the values are uniformly stored in the category and attribute services.　　

　　The category list stores business information, as well as constraint information, and is decoupled from the post table.

　　This table explains the numeric key in the ext field in the Post Center service:

　　1 represents JOB, which belongs to 100 subcategories under the recruitment category, and its value must be a [a-z] character less than 32;

　　4 represents type, which belongs to 200 sub-categories under the second-hand category, and its value must be a short;

　　This extends the properties to the original post table:

　　{“1”:”driver”,”2”:8000,”3”:”bj”}

　　{”4”:”iphone”,”5”:3500}

　　Both key and value have uniform constraints.

　　In addition, if the value of a key in ext is not the value of regular checks, but an enumeration value, you need to have an enumeration table that qualifies the values for verification:　　

　　This enumeration check shows that the value of the key=4 attribute (corresponding to the second-hand, mobile phone type field in the property table) is not only to be checked for "short type", but value must be a fixed enumeration value:

　　{”4”:”iphone”,”5”:3500}

　　This ext is not legal, key=4 value=iphone is not legal, but should be an enumeration property, legal should be:

　　{”4”:”5”,”5”:3500}

　　In addition, the category attribute service can record hierarchical relationships between categories:

　　The first-level categories are recruitment, real estate, second-hand...

　　Under second-hand, there are secondary categories, second-hand furniture, second-hand mobile phones...

　　Under second-hand mobile phones, there are three categories second-hand iPhone, second-hand Xiaomi, second-hand Samsung...

　　…　　

　　The category service explains the post data, describes the category hierarchy, ensures the scalability of various category attributes, and ensures the reasonableness verification of each attribute value, which is another unified core service of the classification information platform CMC (Category Management Center).

　　Category and attribute services are not like SKU extension services in the e-commerce system?

　　(1) Category hierarchy relationship, corresponding to the category hierarchy system in e-commerce;

　　(2) Attribute expansion, corresponding to the attributes of each category of commodity SKUs in e-commerce;

　　(3) Enumeration value check, corresponding to the enumeration value of the attribute, such as color: red, yellow, blue.

　　Through the category service, the problems of key compression, key description, key expansion, value verification, and category level have been solved, and there is such a problem that has not been solved: the attributes of posts under each category are different, and the query requirements are different, how to solve the retrieval and joint retrieval needs of 10 billion data volume and 10,000 attributes?

　　3. Unified search service

　　When the amount of data is large, the query requirements on different properties, it is impossible to meet all query needs by combining indexes, "external index, unified retrieval service" is a very common practice:

　　The database provides the query requirements of the "post id";

　　All personalized retrieval requirements of non-"post ID" are unified with external indexing;　　

　　Metadata and indexed data operations follow:

　　Conduct Tid front-row query on posts and directly access the post service;

　　Modifications are made to the post, the post service notifies the retrieval service, and the index is modified;

　　Perform complex queries on posts and meet your needs with a retrieval service.

　　This search service carries 80% of the requests of the classified information platform (whether from the PC or APP, whether it is the homepage, city page, category page, list page, detail page, it will eventually be converted into a search request).

　　For this search engine architecture, a brief explanation:　　

　　In order to cope with 10 billion data volumes and hundreds of thousands of throughputs, scalability is the focus of the business line for various complex and complex retrieval queries:

　　Unified proxy layer, as an entrance, its statelessness can ensure that the addition of machines can expand the system performance;

　　Unified result aggregation layer, its statelessness can also ensure that the system performance can be expanded by adding machines;

　　The search kernel retrieval layer, the service and index data are deployed on the same machine, the service can load the index data to memory when it starts, and load the data from memory when requesting access, and the access speed is fast:

　　In order to meet the scalability of data capacity, the index data is horizontally sliced, and the number of sharded copies can be increased to expand the performance indefinitely

　　In order to meet the performance scalability of one piece of data, the same data is redundant, and theoretically increasing the machine can expand the performance infinitely

　　System latency, 10 billion level post retrieval, including request splitting, zipper intersection, and 10ms return from the aggregation layer.

　　In the post business, consistency is not the main contradiction, and the retrieval service will periodically rebuild the index in full to ensure that even if the data is inconsistent, it will not last for a long time.

　　V. Summary　　

　　The article is written for a long time, and finally make a simple summary, in the face of 10 billion data volume, 10,000 column attributes, 100,000 throughput business needs, you can use metadata services, attribute services, and search services to solve:

　　One solves the storage problem;

　　One solves the problem of category decoupling;

　　one to solve the search problem;

　　The solution of any complex problem is gradual. The idea is more important than the conclusion, I hope you have gained.