Data annotator: training AI, replaced by AI|Jiazi Lightyear

Growth and obsolescence go hand in hand.

Author|Ma Hui

Editor|chestnut

Prospects and destruction exist at the same time, and data labeling practitioners have never been so contradictory.

Dai Yan, a 30-year-old native of Inner Mongolia, started his business earlier this year and set up an online labeling team of nearly 30 people. Previously, Dai Yan worked on a crowdsourcing platform for data annotation for two years. He can be called a "skilled worker", and he is both expectant and nervous about the current situation.

He has been following ChatGPT since the beginning of the year. From the gross growth of AI enterprise registration, Dai Yan sees the entrepreneurial opportunities of AI industry explosion and data annotation. Tianyan inspection data shows that 170,000 new AI-related enterprises were registered in the first quarter of this year alone, and there are currently 2.67 million in total.

He imagined that he could follow the industry and grow the company to 100 people in the future. But the current status quo is difficult to support his expectations: the circle of data labeling is quickly broken - a large number of labeling needs, labeling workers and middlemen poured in, and the unit price is lower.

Just as the engineering team cannot contact the party A with construction needs and can only take the project from the contractor, the wages of the agent contact are getting lower and lower as the project changes hands layer by layer. He refused to work and only got 30 yuan a day for the marked project.

At the same time, Daiyan also faces the embarrassment of no career promotion, no contract guarantee, and no way to complain about arrears in the labeling industry. He laughed at himself: "We are the data migrant workers of the new era." ”

But that's not the whole story. The bigger problem is that automated annotation is also eating away at the only items they have. AI trained by data labelers like Daiyan is learning and self-labeling under human supervision.

Automated labeling will greatly reduce the cost of enterprises and become the most promising direction in the data labeling market.

Dai Yan had to prepare for "AI could completely replace people." He led the team to do both teaching aid annotation and 3D point cloud annotation projects for text annotation categories. One is text, and the other is picture video. If a project is subverted by AI, Dai Yan will immediately take the team to transform to another field.

In addition, the number of team members needs to be streamlined. Dai Yan crossed out the size of the 100-person company imagined in his mind. He thinks that in the end, only a skilled team of 20 people may remain.

These AIs, trained by data labelers, make them dream of earning more while forcing them to be ready to be disrupted.

1. Annotation, let AI open its eyes to see the world

In order to make machines understand text, speech, and pictures like humans, humans have created a machine learning chain: collecting physical images and sounds of the physical world, annotating and cleaning data, converting data into strings of codes and sending them to machines.

AI scholars believe that three-year-old babies "shoot" hundreds of millions of pictures through their eyes and repeatedly recognize the world. So as long as enough data is instilled into the machine, it can also enable the machine to learn to read, recognize sentences, and finally understand the deep meaning behind the language.

There are 15 million images on the annotated atlas ImageNet, and this dataset has helped countless AI companies achieve breakthroughs in computer vision, such as face recognition and image search.

To build ImageNet, nearly 50,000 data annotators from 167 countries around the world, all from the crowdsourcing platform Mechanical Turk, spent two and a half years working together.

The labeling requirements are very simple, and MTurk's common job is to distinguish the color of the photo, or classify the animals that appear in the image, or frame the selected object with its name: this is a cake, this is a car, this is a cloud, and so on.

Graph/Integer Intelligence

The 200,000 gig workers on the platform are distributed in Africa and Southeast Asia, where labor costs are low, and even form a special "data labeling village". The data they label underpins tech companies' quest for AI.

China's millions of labelers are distributed in second- and third-tier cities in Guizhou, Shanxi, Shandong, Henan and other provinces, and gradually penetrate into counties with lower labor costs. They either rely on online crowdsourcing platforms or join offline data labeling companies and labeling bases.

The annotation content is divided into text, image, and speech according to the scene, corresponding to the function of helping the machine to obtain literacy, image recognition, and listening to sound.

Early annotation projects focused on Internet companies, mainly annotating speech and text. Now it is turning to autonomous driving companies to annotate 3D scenes obtained by lidar scanning, such as point cloud annotation; Or more vertical text and speech annotation direction: help large models of education companies provide teaching and auxiliary annotation data; Or provide proofread medical data for large models of medical institutions.

As AI enters the 2.0 era, ChatGPT amazes investors, entrepreneurs and entrepreneurs, and everyone expects more than just rigid recognition of text, speech and image information. It is also hoped that AI can truly understand the connections between things like humans, recognize small differences and emotions behind actions, and actively distinguish and collect information.

For example, let the self-driving car distinguish that in front of it is an empty flat plastic bag, rather than a stone of similar color and volume; Let the cameras next to the pool no longer just record what is happening next to the pool, but understand what is happening and sound the alarm when someone is drowning.

These still rely on data labeling, and higher requirements are placed on labeling—more vertical, more accurate, and more economical.

This is where the craze in the labeling market began.

2. "Too many orders to do"

It's hard to have data that directly illustrates the surge in demand for new annotations, but it's not hard to tell. Because in the first quarter of 2023 alone, China has added 170,000 artificial intelligence companies, and as long as it is a company that uses AI, there is bound to be a demand for data annotation.

The demand quickly passed on to the data labeling market. In the Tieba where data labeling practitioners gather, more than a dozen project recruitment posts can be refreshed a day, including but not limited to text annotation, question review, drone sales video annotation, 2D detection pole, 3D point cloud and other annotation items from text to picture video.

A data annotation worker who has been in the industry for many years noticed that this year's unmanned vehicle labeling projects have increased, and the vertical large-model entrepreneurship spawned by the AI2.0 fever has subdivided the originally declining text annotation projects into different tracks, and also increased the demand for niche data annotation.

Driven by demand, the establishment of a new team to pan for gold is not only a generation. Zhang Wei of Dongying, Shandong, also began to devote himself to data labeling entrepreneurship at the end of last year, and developed into a small team of more than a dozen people in half a year. Relying on the subsidies and support of the local government, Zhang Wei's company not only obtained a free office, but also the government helped to pull through Party A's resources.

There are many project orders, from the first hundreds of thousands of projects to the recent 400,000 orders, the urgent delivery task makes Zhang Wei more active in looking for annotators: a few days ago, Zhang Wei added 6 computers in just one day.

In Zhengzhou, Henan Province, a crowdsourcing platform for data labeling is moving to a two-story office building that can accommodate 100 people. They all write the company's positioning on the signboard at the door and in the office: "AI artificial intelligence big data research and development base" and "duplicate data cleaning is for your AI to be smarter".

"There are too many project orders to do." Its head said.

The site of the housewarming ceremony of a data labeling company (provided by interviewee)

Hot money has also entered the pockets of labeling companies for a long time. The data shows that the leading Haitian AAC stock price rose 4 times in March ~ May this year.

According to 36Kr news, since the beginning of this year, the B round and more than ten previous data labeling platforms have collectively ushered in a high valuation of nearly 100% growth. Since the second half of last year, automatic labeling companies have successively obtained new financing.

In September 2022, Boden Intelligent obtained tens of millions of yuan in financing; In December, Stardust Data completed a round of financing of 50 million yuan, four and a half years after the last financing in June 2018.

In April 2023, data annotation solution company "Kaiwang Data" obtained a new round of strategic financing; In June, AI data company "Integer Intelligence" received tens of millions of Pre Series A financing.

They fought with high morale to replace manual labeling: "reconstruct data label production", "automated production line + large-scale manpower", and "break the manual mode of automatic driving annotation".

Clearly, capital markets are also refocusing on this emerging area.

3. More volume, but also stricter

The chain of data annotation consists of three parts.

Upstream: 1~150 people data labeling companies, online skirmishers and small workshops.

Midstream: Data service providers, one is to undertake upstream and downstream intermediary crowdsourcing platforms, and the other is enterprises that choose to build their own labeling bases in order to stably invest in the industry.

Downstream: Technology companies, industry enterprises, AI companies, and scientific research units, dominated by Internet companies around 2018, are now turning to car companies and autonomous driving companies.

The industry generally adopts the subcontracting model, that is, the enterprise of Party A issues the bid first, the third-party service provider participates in the bidding, and after the successful bidding, it enters the supplier echelon of the enterprise, in which the core suppliers can enjoy the priority task selection and more orders.

The requirements of enterprises for core suppliers are to have a delivery team of at least 30 people, mature order delivery experience, establish a training system, and control the quality and quantity of delivery. Stable production team, the ultimate orientation makes the company more competitive low quotation.

However, the low-cost advantage of the control team has been disrupted. "This year's bid is fierce!" A service provider told Jiazi Lightyear, "We quote 200 yuan for a project, and some people quote 80 yuan a day. ”

In the end, the project was won by the team with a low offer, but it ended up in the hands of a more mature team. "They couldn't finish and were transferred back to us by Party A, but the price couldn't go up."

Since Daiyan's online team does not directly contact Party A. Therefore, the chaotic situation of multi-level sub-cladding and laminating prices on the market makes them feel pressure.

Data labeling is a resource-based industry, and whoever can get cooperation with Party A will have an advantage. Dai Yan revealed that after some individuals registered their companies, they falsely claimed that there were professional teams of 40-50 people to participate in bidding at a very low price, and after winning the project, they were divided into 4-5 parts and distributed to different teams, and the small team was further divided, layer by layer commission, the middleman earned the difference, and the piecework price distributed to the data annotators was getting lower and lower.

As long as someone takes over, it will spiral all the way down.

A price list obtained by "Jiazi Lightyear" shows that from 2D annotation to 3D laser point cloud annotation, the unit price of annotated items is generally 0.5~1.5 yuan / box. Dai Yan once received a single-box price that was discounted in half, and "turned at least four or five hands."

The unit price involution directly leads to the shrinkage of the salary of the labeling personnel. Dai Yan and the team belong to a semi-full-time state, and the team members are mostly mothers, college students, freelancers and vocational high school students, and they pull the box for 6 hours a day. Maintaining this state, Daiyan will have a monthly income of 4~5,000 yuan during the epidemic in 2022.

"If you have a computer and electricity, you can operate", this is a common attractive phrase in data labeling posters. In the past, this was the most significant advantage of the data labeling industry. But today this advantage has left the entire industry in a spiral. Now the monthly income of Daiyan is only 2~3,000 yuan.

While revenue is down, workload is not. On the contrary, the work of data labeling is more complex and detailed.

Senior practitioners of data annotation miss the annotation market in the Internet period: the price of a single box is 3 times higher and the number of projects is large. A team of 60~70 people can get a monthly income of 300,000. "Now the market is full of projects with an output value (the value generated by a single person marking every day) of less than 100 yuan, which used to be hundreds a day." One practitioner said.

At that time, the project operation was simple and there were no requirements, such as making 2D scene annotation for unmanned vehicles, and pulling the frame of the vehicle in the picture, as long as it could be framed, there were no requirements.

But now it is different, "fit" is the acceptance criterion that Party A values the most. "Last year, the error was required to be 5~7 mm, and this year it will be 3~5 mm. Smaller and smaller error requirements. Daiyan said.

Artificial intelligence scholar Wu Enda has repeatedly emphasized that high-quality data can release the value of artificial intelligence, and the more high-quality data, the faster the development of artificial intelligence will be.

In the labeling data of unmanned vehicles, it is shown as the fit between the rectangular frame and the labeled object, the higher the fit degree, the higher the accuracy of the algorithm, and the more accurate the control of the vehicle by the algorithm.

High-quality text annotation items are manifested in the correctness of semantic understanding and the correct rate of answering questions. The higher the accuracy rate, the smarter the large model being trained.

Skilled hands can ensure fast and good data delivery. Dai Yan once asked a novice to participate in verifying whether the math problems completed by ChatGPT were complete, the logic was correct, and whether the language could be understood by elementary school students. The 7,500 data marked by the novice was required to be reworked by Party A because the accuracy rate was too low, and it took Dai Yan and his colleagues more than ten days to correct them.

Data labeling is increasingly not a job without thresholds. Complex voice annotation, medical, legal, financial and other professional data set annotation production, more professional talents with subject knowledge reserve to do professional annotation.

Dai Yan believes that taking the unmanned vehicle project as an example, newcomers need to do 3 months to become 2D annotation masters, and 4~6 months to become 3D masters.

This exercise refers to training the accuracy of the pull frame, using the mouse to pull out a rectangular box on the computer's labeling page, which can accurately cover the labeling object, without stepping on the line, not missing points, and even tightening the seam.

Figure/data annotation experts point out problems in labeling

However, when machines begin to teach themselves, replacing people to make annotations by machines, do the skills that people spend time training still meaningful?

4. Substitution crisis

Dai Yan realized that AI was approaching, and it was a picture annotation project made some time ago.

This is an old project that Daiyan has been doing for two years - map recognition. The data annotator needs to recognize the text in the picture and print it out, and the price is 8 gross / piece. The data labeled by the extension is fed into the map recognition model. Now, the model is skilled at recognizing text in images. The annotation of the deferral began to be reduced to revision and review. The difficulty has decreased, and the unit price of labeling has also decreased.

AI trained by humans with labeling is replacing human labeling work. In the survey report of the University of Zurich, the researchers found that ChatGPT has more processing power than crowdsourcers in 15 annotation tasks. The progress bar for embedding large models into crowdsourcing platforms has also been speeded up. A subsequent study by EPFL found that more than 30% of crowdsourced annotators already use large models when dealing with text labels.

AI is undoubtedly more time-consuming and labor-intensive than human labor: the researchers say that the unit cost of ChatGPT is only 1/20th that of MTurk.

Daiyan is also ready to be replaced by "better AI" at any time. He is betting on the future in the more skillful autonomous driving label.

But autonomous driving annotations are also being invaded by AI. Compared with the manual pull-frame method, automatic annotation only requires a built-in large model, and after parameter setting, the rectangular box that originally needed to be manually annotated will be automatically generated. At present, the only problem is that the generated rectangular box has quality problems such as stepping on lines and too low fit, which need to be manually inspected one by one.

The improvement of efficiency has surprised car companies. Ideal is to use large models 2.0 for automatic calibration, the efficiency is 1,000 times that of people; Tesla has been actively promoting the progress of automatic annotation, such as in June 2022, cut 200 American employees named Tesla labeling videos to improve the assistance system, because Tesla's automatic labeling capabilities have been greatly improved, labeling 10,000 videos of less than 60 seconds, only need a large model to run for a week, and no longer need to manually annotate for several months.

Lin Qunshu, founder of AI data company Integer Intelligence, said that more and more car companies and AIGC companies are using large model products for automated labeling, and revenue is growing significantly. Their latest move is to establish a research and development branch in Singapore.

But third-party service providers are less optimistic about the growth of automated annotation. The project manager of a crowdsourcing platform in Henan said that automated labeling cannot replace more than 60% of labeling needs, and can only be used as an auxiliary labeling tool to process single or specific data and improve human efficiency.

A product manager at another data labeling company believes that automatic labeling can only filter simple basic data, and cannot accurately identify objects from complex and controversial scenes like humans. This is also the reason why the data labeling market is still dominated by autonomous driving labeling data.

However, everyone also agrees that the future trend of data labeling will shift from heavy labor to heavy technology.

In short, it is either "rolled to death" by peers or "rolled to death" by technology. But sitting still is certainly not enough, and third-party companies with data labeling are looking for a way out.

Daiyan's plan is to keep up with the market, stay vigilant, lay off employees at any time, and at the same time develop in the direction of making automated labeling tools. The founder of a crowdsourcing platform said in an exchange with peers that in the future, we must not pile up manpower, but must have research and development capabilities.

What about individuals? The career path circulating in the industry is that novice annotators - skilled annotators - annotation project administrators/managers - data analysts of Party A companies will eventually achieve a promotion with a monthly salary of tens of thousands.

No one of the data annotators Dai Yan knows is going in this direction, they either stay where they are or quit, and the best case scenario is to set up their own labeling team, like Dai Yan, but he doesn't find it easier.

On the one hand, there is the increase in project demand brought by the AI outlet, on the other hand, more chaotic bidding, lower per capita output value, and rapidly growing AI. The two emotions are intertwined, AI will bring unlimited opportunities, AI will also eliminate "us".

(At the request of the interviewee, the names in the text are pseudonyms)

Cover: Data annotation practitioners explain data annotation, and pictures are provided by interviewees

Data annotator: training AI, replaced by AI|Jiazi Lightyear

1. Annotation, let AI open its eyes to see the world

2. "Too many orders to do"

3. More volume, but also stricter

4. Substitution crisis