How did google-published three papers affect the big data industry? Why did Google publish these three essays? Big Data Development in 2004 - Budding Genius Programmer - Doug Cutting Debuts Software Value Points 2007-2008 Development Period Facebook released hive scheduling engine Yarn led the way Spark emerged Big Data Stream Computing came to a surprisingly similar history of NoSQL Development At the end

Google's earliest profitable project was the search engine, and how the technology continues to develop has become a problem for Google.

Search engines mainly do two things, one is web crawling, the other is index building, and in this process, there are a lot

Data needs to be stored and computed.

Google's three papers published around 2004, the "troika" we often hear, are distributed file system gfs, mapreduce, a big data distributed computing framework, and bigsql database system bigtable.

This "troika" is actually used to solve this problem.

A file system, a computing framework, a database system.

Now that you hear words like distributed and big data, it's certainly not unfamiliar at all.

How did google-published three papers affect the big data industry? Why did Google publish these three essays? Big Data Development in 2004 - Budding Genius Programmer - Doug Cutting Debuts Software Value Points 2007-2008 Development Period Facebook released hive scheduling engine Yarn led the way Spark emerged Big Data Stream Computing came to a surprisingly similar history of NoSQL Development At the end

In 2004, the entire Internet was still in the era of ignorance, and the papers released by Google really made the industry shake up, and everyone suddenly realized that they could still play like this.

Because of that time period, most companies' focus is actually on the stand-alone machine, thinking about how to improve the performance of the stand-alone machine, looking for more expensive and better servers. Google's idea is to deploy a large-scale server cluster, store massive data on this cluster in a distributed manner, and then use all the machines on the cluster for data computation. In this way, Google does not need to buy a lot of expensive servers, it just organizes these ordinary machines together, which is very powerful.

At that time, he was developing an open source search engine nutch, and after reading Google's paper, he was very excited, and then he initially implemented functions similar to gfs and mapreduce according to the principles of the paper. Two years later, in 2006, doug cutting off these big data-related functions from nutch and then launched a separate project dedicated to developing and maintaining big data technologies, which became known as HadOop, which consisted mainly of hadoop distributed file system HDFS and big data computing engine mapreduce.

When we look back at the history of software development, including the software we developed ourselves, you will find that some software is not favored or used by a few people after it is developed, and such software actually accounts for the majority of all the software developed. And some software that could create an industry that creates tens of billions of dollars in value and millions of jobs every year was once Windows, Linux, Java, and now the list has to add the name hadoop. If you have time, you can simply browse the code of hadoop, this pure java software is actually not a profound technical difficulty, the use of some of the most basic programming skills, there is nothing surprising, but it has brought a huge impact on society, and even led to a profound scientific and technological revolution, promoting the development and progress of artificial intelligence.

Think about it, where is the value point of the software we develop? Where do you really need to use software to deliver value? You should pay attention to the business, understand the business, be value-oriented, use your own technology to create real value for the company, and then realize the value of your life. Instead of burying your head in requirements documentation all day and being a code bot without thinking.

After the release of hadoop, Yahoo was quickly used. About another year later, in 2007, Baidu and Alibaba also began using Hadoop for big data storage and computing. In 2008, Hadoop officially became apache's top project, and doug cutting himself became chairman of the Apache Foundation. Since then, Hadoop has risen as a star in software development. In the same year, Cloudera, a commercial company specializing in hadoop, was founded, and hadoop received further commercial support. At this time, some people in Yahoo felt that it was too troublesome to program big data with mapreduce, so they developed pig. pig is a scripting language that uses sql-like syntax where developers can describe the operations to be performed on a large data set with pig scripts, which are compiled to generate mapreduce programs and then run on hadoop. Writing pig scripts is easier than programming direct mapreduce, but you still need to learn new script syntax.

hive supports big data computation using sql syntax, for example you can write a select statement to query the data, and then hive will convert the sql statement into a ma reduce calculation program. In this way, data analysts and engineers familiar with the database can use big data for data division and processing without barriers. After the emergence of hive, it greatly reduced the difficulty of using hadoop and quickly became popular with developers and enterprises.

In 2011, 90% of the jobs running on Facebook's big data platform came from HIVE. Subsequently, many hdoop peripheral products began to appear, and a big data ecosystem gradually formed, including: sqoop that specially imports and exports data from relational databases to the hadoop platform; flume for distributed collection, aggregation and transmission of large-scale logs; and oozie, a workflow scheduling engine such as mapreduce. In the early days of hadoop, mapreduce was both an execution engine and a resource scheduling framework, and the resource scheduling management of server clusters was done by mapreduce itself. But this is not conducive to resource reuse, but also makes mapreduce very bloated. So a new project started, separating the mapreduce execution engine from resource scheduling, which is yarn. In 2012, yarn became an independent project and was subsequently supported by various big data products, becoming the most mainstream resource scheduling system on the big data platform.

In 2012, sparks developed by UC Berkeley amp Labs (short for algorithms, machines, and people) began to emerge. Dr. Ma Tie of amp lab at that time found that the performance of using mapreduce for machine learning calculations was very poor, because machine learning algorithms usually need to perform many iterations of calculations, and mapreduce needs to restart a job every time map and reduce calculations are performed, resulting in a lot of unnecessary consumption. Another point is that mapreduce mainly uses disks as storage media, and in 2012, memory has broken through capacity and cost constraints and become the main storage medium during data operation. As soon as spark was launched, it was immediately sought after by the industry and gradually replaced mapreduce's position in enterprise applications. Generally speaking, business scenarios handled by computing frameworks such as mapreduce and spark are called batch computations, because they usually perform a calculation on the data generated in "days" and then get the desired results, which take about tens of minutes or even longer. Because the computed data is not real-time data obtained online, but historical data, this type of computation is also known as big data offline computing.

In the field of big data, there is another type of application scenario that requires real-time calculation of the large amount of data generated in real time, such as face recognition and suspect tracking for surveillance cameras all over the city. This type of computing is called big data stream computing, and there should be storm, flink, spark streaming and other stream computing frameworks to meet the scenarios of such big data applications. The data to be processed by streaming computing is data generated online in real time, so this type of computation is also called big data real-time computing. In a typical big data business scenario, the most common practice of data business is to use batch processing technology to process the entire amount of historical data, and use streaming computing to process real-time new data. Computing engines like flink can support both streaming and batch computations.

In addition to big data batch processing and stream processing, the main processing of the nosql system is also the storage and access of large-scale massive data, so it is also classified as a big data technology. Nosql was very popular around 2011, with many excellent products such as hbase and cassandra emerging, among which hbase was an HDFS-based nosql system that was separated from hadoop.

If we look back at the history of software development, we will find that almost similar functions of the software, they appeared very close to the time, such as Linux and Windows are in the early 90s, java development in various mvc frameworks are basically the same period also appeared at the same time, Android and ios are also the front foot and back foot came out

When you are in the trend, you must seize the opportunity of the trend, find a way to stand out, even if you are not successful, you will have more insight into the pulse of the times and harvest precious knowledge and experience. And if the trend has receded, at this time to work hard in this direction, it will only gain confusion and depression, which will not help the times or for yourself. Just as Google seized the opportunity, so he succeeded, he led the way.