News about the use of big data by merchants and platforms (referring to the implementation of different prices for new and old customers, customers in different regions, or the use of big data, the use of very complex calculation methods, price discrimination against consumers) has occurred frequently, and more big data-related news has also constantly provoked people's nerves, making everyone feel that they seem to be "invisible" in the face of big data.
For ordinary people, big data is a thing that is close at hand, but not well understood. What exactly is big data? What is its connection to the numbers and mathematics we know well? How exactly has big data changed our lives? As the initiator of big data in China, big data expert Tu Zipei clearly and intuitively explained the knowledge related to big data through the development of backtracking and the all-round application of data in major historical events at home and abroad. The following is an excerpt from "Big Data for Kids", which has been authorized by the publishing house.

"Big Data for Children", by Tu Zipei, edited by Tongqu Publishing Co., Ltd., July 2020 edition of People's Posts and Telecommunications Publishing House.
Author 丨 Tu Zipei
Excerpt 丨An also
The emergence of big data has reshuffled statistical science and data science
It's a story about the retail empire Walmart.
Walmart, the world's largest retailer, has more than 11,000 stores and more than 2 million employees. Its sales revenue exceeded $500 billion in 2018, exceeding the GDP (gross domestic product) of many countries.
Walmart's database is one of the largest commercial databases in the world. Walmart was also one of the first companies to use data mining technology on a large scale. Its chief information officer is Rollin Ford, and data analytics is at the heart of his job. Luo Lin once lamented: "Every morning when I wake up, I have to ask myself, how can I make data flow better, manage better, and analyze better?" ”
After a routine data analysis, the researchers suddenly found that the most sold product with diapers was beer! Diapers and beer, it sounds like they don't match! It is difficult for any one person to link the two, but this is the result of mining historical data, reflecting the laws at the data level. It's really puzzling, is this a real law? The answer, again, is in the data.
After follow-up investigation, researchers finally found that there was a reason: some young fathers often had to go to the supermarket to buy baby diapers, and 30% to 40% of "daddies" would buy some beer to treat themselves. Even the most imaginative person may hardly think that the truth of the facts is actually like this. Walmart then bundled diapers and beer. Sure enough, sales both increased. This is a classic example of data science applications.
Stills from the movie Platinum Data (2013).
How exactly did Walmart discover this pattern? This brings us to the heart of data science: data mining. Data mining refers to the analysis of a large amount of data through a specific algorithm, and the discovery of new knowledge in a large amount of data for human reference. The reason why it is called "mining" is a metaphor for finding knowledge in massive data, just like mining and digging gold. You can understand that data mining is an algorithm-controlled excavator, and a database is like a mine.
Before 1989, data mining was not called data mining, but a long name: database-based knowledge discovery. The database as the basis for mining is not produced synchronously with the computer, it is slowly grown and independent from the software after the advent of the computer.
In 1948, When Truman and Dewey ran for president of the United States, Gallup predicted through a sample survey that Dewey would be elected. The press was so convinced of this prediction that newspapers such as The New York Times printed dewey's presidency a day in advance, ready to seize the opportunity. The result surprised everyone, and it was Truman who was elected in the end! The newspapers with news of Dewey's election had to be destroyed.
The reason for Gallup's failure is that the sampling survey needs to go through multiple steps such as questionnaire design, information collection, and data analysis, which leads to the lag of the data it has, and the real situation is rapidly changing. In the final two weeks before the election results came out, Gallup had to stop the investigation, and Truman turned the tide of the war at this last moment.
In the era of big data, there are new ways to predict presidential elections: before and after voting, mining opinions on data on social media can predict who will be elected more accurately. In the 2008 and 2012 US presidential elections, some people accurately predicted the results by mining data on Twitter and Facebook.
This kind of mining of Internet data does not require the design of questionnaires, nor does it need to be investigated one by one, and the cost is very low; such data analysis can be completed by one person, rather than a large number of people and horses like questionnaires; more importantly, this analysis is real-time and has no lag.
Therefore, there are more and more scientists who believe that because of the emergence of big data, statistical science and data science will be reshuffled and enter a new era. In this new era, data mining will become an increasingly important analytical and predictive tool; the importance of sampling technology will decline, and this technology will become an auxiliary tool. Although data mining is in the ascendant, there are also those that are stealing the limelight, which is machine learning. The chess robot "Deep Blue", which plays invincible hands all over the world, and alphago, a robot that has played many famous Go players without temper, use machine learning technology.
Machine learning also relies on computer algorithms. Different from data mining, its algorithm is not fixed, it can automatically adjust the parameters of the algorithm with the increase of calculations and mining times, so that the results of mining and prediction are more accurate.
Big data has stimulated the formation of a professional market for data visualization
In 1855, the Crimean War broke out. The war, which resulted in the deaths of more than 500,000 people, was extremely tragic. Britain, as a belligerent, of course suffered heavy casualties.
Florence Nightingale (1820–1910) was a British field nurse and a self-taught statistician. After examining the casualties of British soldiers, she found that the number of deaths due to poor medical and health conditions greatly exceeded the number of people killed directly on the front line.
Nightingale made a chart of her statistics. The chart clearly reflects the disparity in the number of deaths in the case of "death in combat" and "death in non-combat". The strong visual effects triggered a heated discussion throughout british society and led to the British government's decision to set up a field hospital. Thus, the first official field hospital in human history was established.
Nightingale was later hailed as the mother of modern nursing. Her chart is the first "polar map" in history and an early exploration by statisticians to use graphs to show data.
It's not an exaggeration to say that a chart changes a system. Human beings are naturally sensual creatures, and the visual impact on people is far stronger than simple association.
Nightingale's practice is known as data visualization. It refers to the more vivid and easy-to-understand way of graphics, images, maps, animations, etc., showing the size of the data, interpreting the relationship between the data and the trend of development, so as to better understand and use the results of data analysis.
The Case of Nightingale amply demonstrates the value of data visualization, especially in the public domain. Physiology has also proved that 40% of the human cerebral cortex is the visual response area, and the human nervous system is inherently the most sensitive to pictorial information. Through images, the expression and transmission of information will be more intuitive, fast and effective. Moreover, human creativity depends not only on logical thinking, but also on image thinking. The technology of data visualization can further stimulate people's image thinking and spatial imagination ability through images, and attract and help users gain insight into the hidden relationships and laws between data.
In the 1970s, due to the rise of computer technology, a group of visionary scholars saw great potential in this field. Some people think: "The computer of the future must not only be able to calculate, but also be able to turn the results of calculations into intuitive graphics." We should study both outcomes because each helps us understand the problem. ”
In 1983, Yale Professor Edward Taft became the head of the discipline of data visualization. The Taft system examines the origin of human beings using "graphics" to express "data" and "ideas", sorts out the graphic treasures in historical books, and combines the revolution brought by the development of computers to the field of statistics, and publishes the book "Visual Display of Quantitative Information". The book was later recognized as the pioneering work of "data visualization" as a discipline.
Taft emphasizes that the key to data visualization is "design", "information overload does not exist, the problem is bad design, if you use the graphics to express data that makes people feel confused, then modify your design." The annual announcement of the government budget by the president of the United States is a major event in the United States. Under Taft's leadership, the White House used a graph to visualize Obama's published annual budget. As shown in the figure below, the figure shows the size of the amount of income and expenditure in the thickness of the line, the left is the receipt, the right is the branch, and the red part in the middle is the deficit gap, which is appropriate. It is clear how much Obama has collected, what he wants to do, and the income and expenditure at a glance.
A visual representation of Obama's budget spending in 2010. (Source: The Washington Post, February 1, 2010)
After entering the 21st century, the explosion of big data has made people more need tools to display data, understand data, and deduce data. This demand has stimulated the formation of a professional market for data visualization, and its products have increased rapidly, which can be described as brilliant and colorful, and a hundred flowers bloom. From the earliest simple graphs such as dot line charts, histograms, pie charts, and mesh charts, it has developed to dashboards and scoreboards that mainly monitor business performance, and then to interactive three-dimensional maps, dynamic simulations, animation techniques, etc.
As an emerging industry, the potential for data visualization cannot be underestimated. Data visualization engineers understand both data analysis and the art of composition, combining storytelling and artist characteristics, by transforming complex data into intuitive graphics, they push the results of data analysis to the general public, can be described as the navigator of the era of big data.
"Evidence cloud" is the application of big data in the police system
The size of big data is not only the large capacity, but also the potential value.
The most fundamental reason why human beings can enter the era of big data is that human data technology has made major breakthroughs. Through a series of technologies with data mining as the core, human beings have discovered new knowledge and created new value in data, thus bringing great opportunities such as "big knowledge", "big technology", "big profit" and "big intelligence" to society.
In this new era, data is wealth, the ability of data analysis is the core competitiveness, many industries must successively step into a data boom, data is strong "data competition" era. It is also a competition between data scientists, and data mining and machine learning with the mission of discovering new knowledge are the most eye-catching competitive weapons of this era.
I lived in Hangzhou for 4 years. In the past 4 years, I have loved to watch "Qianjiang Evening News". One day, I saw a big news, the Noe Garden case was solved!
The case of Zhijiang Garden Villa once caused a sensation throughout Hangzhou. In 2003, someone sneaked into the Garden of Noe on a rainy night, robbed and killed people, and then absconded, never to be heard from. This year, the Chinese detective Li Changyu came to Hangzhou for the first time, and someone asked him about the case, and at that time he couldn't come up with a solution, but left a sentence, "As long as the time comes, the case will be solved sooner or later." This timing is 13 years. What people did not expect was that the time to wait was big data.
In the 1990s, the Hangzhou police began to popularize the concept of "biological traces" and introduced a physical evidence management system; in 2008, the standardized collection instrument "Trace Search Instrument" was popularized in the police station in Hangzhou, which can collect and record data such as portraits, DNA (dna(dna ribonucleic acid), fingerprints, palm prints, footprints and shoe sole styles; in 2012, these data began to transfer to the cloud, forming a "physical evidence cloud", and the data of any suspect can be compared with other data in the "physical evidence cloud". The "evidence cloud" is the application of big data in the police system. It played a key role in solving the Noe Garden case.
In September 2015, a man surnamed Yu was injured in a quarrel at a noodle restaurant in Zhuji. After the local police subdued him, they extracted his DNA and other data and entered it into the "physical evidence cloud". Unexpectedly, the cross-regional comparison of data has been realized, and the police found that his data are highly consistent with the traces left by the Zhijiang Garden case, and Yu's identity was quickly confirmed. In order to solve the Zhijiang Garden case, the Hangzhou police have been breaking the iron shoe for more than a decade, but it did not take much effort to really solve the case. This is first of all due to the strong coverage ability of the "evidence cloud", which makes some scattered data form a connection, and the time to solve the case appears in this comparison.
Between 1975 and 1986, there was a serial case in the United States in which the suspect was known as the "Golden State Killer." Investigators have tracked him down for more than 20 years and checked thousands of suspects, but none of them have been able to catch him.
Stills from the movie Source Code (2011).
In December 2017, an agent suddenly came up with a new way to apply big data. He uploaded the suspect's DNA to a family search website that could analyze the genetic data uploaded and provide clues for people to trace their ancestors. As a result, a person who partially matched the suspect's DNA was actually found. With this important finding, police narrowed the suspects from millions to a family. After the investigation, the police caught the perpetrator, Dean Gero. At this time, the "Golden State Killer" is 72 years old. Although justice is late, it is not absent, and the biggest contributor to this is also big data.
My police friend told me this: "Big data and new technology are too powerful, we are now a case must be solved, crack the backlog of cases, wait for the case to be solved, no case to be solved." "If you do something bad today, don't run, because you can't run at all, you just sit at home and wait for the police to come and find you." This may be an exaggeration, but it also shows that the police are not ordinary enough today. Why is it so confident? It is precisely because of the analysis of big data.
Today, almost all human behavior is leaving data behind. Whoever walks by will leave a trace. Through the analysis of traces, a person has almost no secrets. This is true of ordinary people, and so is the criminal, unless he has been hiding in the mountains and old forests and not coming out, obviously, this is almost impossible in modern society.
Use data and "clouds" to solve the rescue problem of "empty nest" elderly people
There are no two identical leaves in nature. Because, the texture can distinguish one leaf from another. Nor are two people's voiceprints or fingerprints the same. Similarly, in the data space, a person or an object is a unique "number body", which can be defined, supported, and endorsed by countless data. Each set and each piece of data has its own characteristics, just like voiceprints and fingerprints, which are data textures, referred to as "number patterns". Everyone's face shape, fingerprints, heartbeat and blood pressure and other physiological data are different, everyone's social activities are also different, and constantly collecting and integrating a person's data, a unique number pattern appears. With these number lines, it is possible to clearly define a person, distinguishing this person from another.
This kind of individual distinction has never been seen in human history. But today's governments are having, I call it "single-grain governance," meaning, that man is becoming an atom under the microscope. For example, just like the flying snow in the sky, although each snowflake is highly similar, but in the process of formation, due to the different conditions of water vapor, they form their own unique structures, and then affected by the air flow, the path of falling is full of variables, and each piece is different. Today's technology is like the eye of the heavens, which can lock, track, and distinguish the trajectory of each snowflake in the air.
In August 2018, in a garden community in the south, a pair of elderly people collapsed in their homes and were found dead for many days. News reports say: The two elders are retired teachers, the old man is more than 70 years old, suffering from Alzheimer's disease, and the old lady also suffers from a variety of diseases. The two old men were usually amiable, did not think to leave, and were discovered by the neighbors many days later, which was really pitiful. The old man has a son who lives in a neighborhood across the road.
The community is full of people, and the son lives in the neighborhood next door, but such a thing still happened under the eyelids, what is the problem? You may think of the indifference of neighborhood relations, the son's neglect of the elderly, the old man even if he is in the downtown area, and what is the difference between living alone?
Every family has elderly people. In today's China, the problem of elderly people living alone or called "empty nesters" has become very serious. According to statistics, by the end of 2018, there were about 250 million people over the age of 60 in China, accounting for 18% of the total population, of which about 170 million were over the age of 65, accounting for 12% of the total population. China is about to enter the peak stage of population aging.
I once read a report that a wanderer who was working in a foreign country called home and didn't answer for several days. He was so worried that he dropped his job and went straight home, pushed open the door and found that the tragedy had happened. A few days earlier, his father had died in the bathroom from a heart attack and his mother had been paralyzed in bed because there was no one to take care of him and starved to death. It's hard to keep seeing such tragedies. I think that there are so many sensors and smart bracelets today, and it will definitely solve this problem when used well. Mobile technology has made humans never go offline, and the wave of sensor popularization is coming.
If the data collected by the smart bracelet can be connected with the hospital in real time, then the data such as the heartbeat and body temperature of the monitored person can be continuously transmitted to the hospital's database. When the heartbeat of the monitored person is abnormal, the algorithm can push the data to the doctor to remind him to pay attention to timely treatment.
At present, the geyuan community in Yangzhou City, Jiangsu Province, is trying to install "intelligent care" sensors for the elderly, mainly including: mattress sensors, toilet sensors, gas leak alarms and indoor infrared sensors. These sensors will send the collected data to the system terminal on time every day, and the community responsible persons and children can learn about the life of the elderly on that day as long as they open their mobile phones.
Japanese society also has an aging problem, which is more serious than that of China. According to statistics, there are about 6 million elderly people living alone in Japan, and 40,000 people die alone every year. In order to pay attention to them, the Japanese also used big data, and they monitored through the energy provider that the faucet of the household had not been used for several days, whether the lights had been turned on, and whether the gas had been used or not turned off. The water meter of Japanese households is generally installed outside the residence, so it is convenient to renovate the water meter.
By installing an electronic indicator that can record the amount of water used in real time, the idea of focusing on the elderly living alone can be realized. Children from afar can see the energy use data of the elderly every day, and from these data, they can infer the time it takes for the elderly to cook and bathe. When you find a numerical anomaly, you can immediately contact the local community and ask the community staff to check it out at home. As a result, the number of cases in which elderly people living alone in Japan are found several days after they die in their homes has decreased by 30 percent.
Make good use of data and use the "cloud" well, and the rescue problem of the "empty nest" elderly will undergo great changes.
Knowing the cold and knowing the heat is the ultimate pursuit of big data
In July 2013, a female student from East China Normal University received a text message from the school: Hello classmate, found that you spent less on food and beverage last month, I don't know if there is any financial difficulty?
This warm text message is also due to big data. By digging up the consumption data of the campus meal card, the school found that the girl's meal fee for each meal was low, so it sent a concerned inquiry to the girl.
Stills from the movie "Penalty Shootouts" (2011). The film tells how teams use data modeling to mine potential star players.
With the help of data analysis, East China Normal University quietly listed students who ate more than 60 meals in the canteen every month and whose total consumption was less than 420 yuan as funded objects, without review or publicity, and the school directly injected subsidies of different amounts into the meal cards of these students. Because the school has found in the long-term management, many students from poor families are reluctant to apply for poor student bursaries because of face. If public evaluation and publicity are carried out, it will inevitably hurt the self-esteem of some students. This method can be said to be well-intentioned. Of course, there are beautiful mistakes once in a while! The reason why the girl just said is low consumption is only because she is losing weight.
The reason for the misunderstanding is not because big data does not work, but because the data is not enough, not comprehensive enough, not strong enough. In addition to the characteristics of big data, there are also "multiple sources". If there is data from other sources as a supplement in addition to the meal card, the judgment will be more accurate.
My friend Professor Zhou Tao works at the University of Electronic Science and Technology of China, and as a well-known big data expert in China, he has presided over a project called "Finding the Loneliest Person on Campus". The project collected more than 200 million behavioral data from 30,000 students. These data include student course selection, library swipe, dormitory access control, canteen consumption and school supermarket shopping data, all of which are generated by students when they swipe a card.
Through the analysis of the swipe data of different cards in different locations, the research team finally found that there are more than 800 students at the University of Electronic Science and Technology of China, and most of their time in school is alone. Every time they queued, there were no classmates or friends before and after, they were the "loneliest people". These lonely people are highly likely to suffer from mental illness, and if parents and schools pay attention in advance, tragedies will be avoided. This is also the warm side of big data. Knowing the cold and knowing the heat should be the ultimate pursuit of big data.
This article is excerpted from "Tell Children About Big Data", which has been abridged and modified compared with the original text, and the subtitle is added by the editor, not owned by the original text, and has been authorized by the publishing house to publish.
Editor 丨 Liu Yaguang
Proofreading 丨 Wu Xingfa