laitimes

How big data platforms can be transformed into cloud natives

author:InfoQ
How big data platforms can be transformed into cloud natives

Today, enterprises are facing the growing amount of data, the real-time and intelligent processing of various types of data. At this time, the cloud-native big data platform's high elastic scaling, multi-tenant resource management, massive storage, heterogeneous data type processing, and low-cost computing and analysis capabilities have been welcomed by everyone. But how should enterprises do a good job in cloud-native transformation and upgrading of big data platforms?

To this end, we connected with Dr. Peng Feng, co-founder and CEO of Zhiling Cloud, to discuss how the big data platform can be transformed into cloud native. The following is based on the live content, there are deletions that do not change the original meaning, and the full content can be clicked to view the playback video.

InfoQ: What are the specific technology categories of the cloud-native technology ecosystem? What do things like IaaS, PaaS, and SaaS have to do with cloud native?

A: Although "cloud native" has only caught fire in recent years, the industry began to do it in the early 2000s. After graduating with a Ph.D. in 2005 and joining ask.com, I also worked on cloud platforms, and the team at that time was called middleware. Amzone didn't have the concept of cloud native when it first started with IaaS, and the forerunner was Heroku, who developed the cloud native 12 principles, which allowed apps to be sent directly to the web without the need to manage servers, which was also an early cloud-native application. There weren't K8s that were hot now, so cloud native wasn't just K8s. The concept of containers also came up before Docker, so containers aren't just Docker either.

The barrier to entry was high before for cloud native. When we were asking, almost all the people with doctoral degrees were doing cloud platforms, and when we were on Twitter, basically few people would use them. It took more than 20 years for the cloud native concept to become popular from the beginning to the present. The most critical of these are containers, microservices, and declarative APIs, and CI/CD, as we often say, is not unique to cloud-native architectures. The key to cloud native is resource-oriented programming, after requesting the required resources to the system, there is no need to manage the details of scheduling, and the automatic release, fault tolerance, migration, etc. of the application are all responsible for the system. Resource-oriented programming has great benefits for the development, management, and ease of use of the entire distributed system.

InfoQ: Will the cloud-native transformation of big data platforms bring these technologies to life?

A: Definitely. In fact, Alibaba uses cloud-native technology when it does flying. Hadoop has three components: the file system HDFS, the compute engine MapReduce, and the resource manager Yarn. Now MapReduce is basically replaced by Spark, as a storage HDFS and many applications, Yarn's position is more awkward, because it and K8s do resource management. So I think MapReduce and Yarn are about to be obsolete, and data applications like Spark and most of them can now run directly on Kubernetes, so big data systems don't need Yarn anymore. For HDFS, there will be a cloud-native retrofit upgrade.

Hadoop originally had an object storage system, Ozone, but now the Ozone has been moved out separately and is no longer in the Hadoop system. The original system of Hadoop is likely to be completely eliminated after three to five years. This transformation is inevitable, and big data platforms based on the original Hadoop ecosystem will definitely migrate to cloud-native platforms.

InfoQ: When did Twitter start doing cloud-native big data platforms? Why did you do it at the time? The result? What kind of thinking does this bring to domestic big data platforms?

A: Very early, around 2011. After production on Mesos, all other apps ran on Mesos except Hadoop. At the time, Mesos was able to support a cluster of more than 8,000 machines within Twitter. Mesos has its own manager, which is actually ahead of K8s.

Why did we think this was so powerful? Because in the past, to release a system, you had to apply for a machine, buy a machine, buy the machine and install it, and then test it after it was installed. Even if you test well, you still have to worry about whether there are conflicts between third-party libraries between several systems.

Big data systems are all open source. One of the benefits of open source is that it's cheap. The bad thing about open source is that there is no control between the various systems, and there are conflicts between the third-party libraries that the two open source systems rely on. For example, open source project A uses a third-party library C, and another open source project B also uses third-party library C, but the C version on which the two projects depend is not the same, and it often has problems installed on the same machine, which is a technical problem that has plagued everyone for many years.

With cloud-native capabilities, each app is its own container. We weren't using Docker yet, we were using universal containers. This is a container made by Mesos itself, the advantage is containerization, isolation, will not worry about everyone affecting each other, the application can achieve second-level release, the liberation of productivity is revolutionary. Therefore, cloudification is an inevitable trend, and big data platforms also follow this trend.

If the software made in the future is not cloud-native, 100% no one will use it. Because now all the new software is running in a cloud-native way, the old software will slowly fail to keep up, and it cannot be said that there is a separate cluster. Therefore, applications that cannot run on the cloud platform will be eliminated.

InfoQ: What types of enterprises are using cloud-native big data platforms today? How has the number of enterprises changed?

A: Mainly the Internet and big factories. Today's cloud-native big data platforms are not yet mature. In contrast, it is easier for businesses to recruit people who are familiar with Hadoop skills. Second, the cloud-native maturity of these big data components is not particularly high. Spark only had General availability (GA) in March and released version 3.1. Kafka announced Kubernetes GA in May of this year but has not yet open sourced it. These two things show that the mainstream big data component vendors are now leaning on Kubernetes. The core components involved here, such as Spark and Hive, and the related components they rely on must be upgraded, and the upgrade at the system level is more troublesome for the general enterprise.

Foreign countries like Twitter, Uber, Airbnb, etc. are doing this, but some of their solutions are not ideal. For example, Uber moved the entire Yarn to Kubernetes, which is a bit not particularly good, I think Yarn should be removed, other components are directly cloud-native, such as MongoDB-like components have gradually had Kubernetes releases.

Taken together, everyone is moving forward, but it has not yet reached a particularly mature stage.

InfoQ: What are the current problems with big data platforms? Why can cloud-native technologies solve these problems?

A: The current big data platform can solve the basic data demand problem, but there are still two big problems to solve. The first is resource management. Yarn's resource management granularity is not particularly good, in multi-tenant isolation and resource preemption are limited, Spark-like applications can not be mixed, can not be cloud native to achieve storage and computing separation, computing and storage can not make full use of the resources of each node. The most important thing is that other applications will not upgrade yarn anymore, and subsequent O&M and upgrades will be a problem.

Secondly, it is not that the current big data platform cannot solve the current problem, but the migration of communities and ecosystems is already very clear. For example, as mentioned earlier, mongoDB can run on Kubernetes, and may have to build a separate cluster for Hadoop in the future. But with cloud-native storage, there is no need for a separate cluster. Therefore, task mixing, resource isolation, and support for new applications are all relatively large hard injuries of the Hadoop system.

InfoQ: In practice, how are cloud-native technologies adapted and upgraded to address these issues? How should developers do technology selection?

A: Now all resource management and orchestration can rely on Kubernetes, and enterprises can focus on their own business logic and management. For example, now that the Container storage interface (CSI) is becoming more and more mature, as long as the storage system meets the interface requirements, then the application of any provider can be accessed. Dynamic dependencies, releases, and fault tolerance can all rely on Kubernetes, which is much better than running two different clusters at the same time.

In addition, the original Hadoop application most likely does not need to be rewritten, because it now has a dedicated HDFS-compatible storage, and the application can also run after copying the compatible data to the HDFS-compatible storage. Now what we have to do is to let Hive run directly on Spark and Spark run on K8s, so that Hive's program can be moved directly to K8s without doing a lot of migration, so that the K8s cluster can be smoothly migrated.

InfoQ: What are the technical difficulties of cloud-native transformation of big data platforms?

A: There are still many technical difficulties, mainly because the support of K8s by the current big data components is not particularly mature, which is the problem of open source components. Let me give you an example. Spark itself also relies on many other open source components, some of which do not yet support K8s, and the versions of the components that support K8s are different. Each open source component claims to support K8s, but when you put all these K8s-enabled components together, you find that there are various versions of conflict. Another problem is that the K8s upgrade is too fast. K8s is now basically upgraded quarterly, which also leads to different versions of K8s supported by each big data component below. In short, the current entire ecosystem cannot support K8s in a coordinated manner.

Although many big data product manufacturers are doing K8s support, they are still in the experimental stage in production. Everyone is still in a state of infancy. From the K8s itself, K8s' support for stateful applications and CSI storage support has only been perfected in the past two years, which are the biggest technical difficulties, which is why everyone has begun to migrate to K8s in the past one or two years.

InfoQ: How do developers choose technology?

A: I think if you have Hadoop by now, if you want to move ahead of the curve, you can start experimenting. Some unimportant businesses can first run in the cloud-native system, gradually improve their stability, and then gradually expand the cloud-native cluster. In this process, enterprises can borrow the K8s management system to continuously improve the process. I don't recommend getting Hadoop all right away, but at least one test cluster to validate the business process.

If it's a new enterprise, I strongly recommend setting up a cluster directly on K8s. Because new enterprise clusters are generally not too large, using cloud-native supported big data components generally doesn't have too many problems and will work well. It would be a hassle to use Hadoop and then migrate later. It is now very fast to build a cloud-native big data platform with the products of cloud vendors.

InfoQ: What are the challenges beyond technology?

A: I think it's mainly a talent challenge. As a technician, you need to be able to spot industry trends. It's not about chasing the latest technology, but it's important to choose the right technology at the right time. I also mentioned two days ago that if the company is still asking Hadoop questions in the interview, then the company's technology can basically be considered a bit outdated.

In the past, DevOps had dedicated operations engineers, and developers had to control the entire process from development, testing, and release. In the future, data scientists and data analysts will most likely be able to control the entire process of data exploration, data analysis, and data presentation. Former ETL engineers may have been more limited, and the demand for them has become smaller. Companies may be more inclined to data analysts, data scientists, because the underlying layer has been standardized.

From the customer, I also feel that there is a shortage of data analysts, analysts who understand business, and data analysts who understand both business and technology. Now many companies are still in the process of building a big data platform, which may not be obvious. However, with more and more standardization, the threshold for the use of big data operation and maintenance will become lower and lower, and enterprises will be more willing to use cloud-native big data platforms.

InfoQ: What are the main aspects of the cloud-native transformation of big data platforms?

A: In the cloud-native transformation, the components are part, but there are other tasks, such as CI/CD, log management, user management, monitoring, etc. In the field of big data, there are also data quality, metadata, etc. that require management systems in the K8s environment. The benefit of K8s is that all applications are now released in the same way, using the same set of resource management systems. But things like metadata management, data quality management, workflow scheduling, etc. are not provided by K8s.

Previously, Spark ran on Yarn, ETL ran on Hive, SQL ran on MySQL, and now these have to be on K8s, K8s has become very important, which also requires declarative APIs to do the whole cluster management.

Now, many Silicon Valley companies have changed ETL to code management, and Airflow has changed scheduling management to code management. Therefore, a trend now is that K8s has turned all cluster management into code management, and after the big data platform is migrated to K8s, it can also do code and pipeline-level code management, and can also use Git to manage. Therefore, the cloud-native transformation of big data platforms is not only a component, but also a great change in the form of development and management.

InfoQ: Is there any comparison between traditional big data platforms and cloud-native big data platforms, which can be described in detail?

A: Snowflake is the most typical example of a cloud-native big data platform. Snowflake itself has no storage or computing, its computing power is in K8s, storage capacity is used by various cloud vendors, it is dynamic in the middle, sending a proprietary MPP to users through K8s.

Most of the cloud native systems can do the separation of memory and computing, like CSI, my application on the above can be killed, CSI storage is still there, naturally to achieve storage and calculation separation. When the application has no traffic, it is stopped, and when there are users, the resources are redistributed, which is done to stagger the peak resources and elastic expansion. A resource pool can uniformly allocate resources, improve resource utilization, management efficiency and overall operation and maintenance efficiency, and make the system run more reasonably. A dozen years ago, a slightly larger cluster required dozens of PhDs to get it done, and now an undergraduate student can do it, so the improvement of productivity depends on standardization.

InfoQ: What is the need for DataOps to do cloud-native transformation?

A: DataOps does cloud-native transformation mainly because of two aspects. The first is the standardization just mentioned, DataOps needs to manage all the components, but if the components are not standardized, then this is difficult to do. Secondly, cloud native brings the unification of various products, such as Spark, Flink, etc. DataOps encompasses five areas: data development and CI/CD, automated scheduling, data quality, data portals, security, and audit compliance, all of which can only be achieved on the basis of cloud-native standardization.

Read on