Interview guests | Guan Tao
Edit | Tina
The rapid development of artificial intelligence is changing our world, especially for big data companies.
Led by the big language model, the future of data platform leaders Databricks and Snowflake is being rewritten. The two companies emphasized the importance of large-language models and AI capabilities at the recent press conference, trying to meet the data processing needs of users in an integrated way. At the same time, with the introduction of large language models, enterprises in general face a new challenge of how to realize the full potential of LLM in existing data platforms. Driven by this general trend, traditional data platforms need to be supplemented and optimized accordingly, and Lakehouse, a cloud technology data platform, has also emerged at this time.
In the data analysis part, the integrated platform of cloud device technology unifies stream computing, batch processing and interaction analysis by introducing a new computing paradigm - incremental computing, and the performance of cloud device is nine times faster than that of the batch engine Spark in different analysis scenarios, while surpassing the interactive analysis product ClickHouse. In the AI support part, the platform supports both semi/unstructured data storage and corresponding AIOps to achieve BI+AI integration.
We had an in-depth conversation with Guan about the evolution of well-known platforms such as Databricks and Snowflake, while focusing on the evolution and development of computing platforms. In the context of AI becoming a first-class citizen, how does he see the impact and change of LLM on big data enterprises? Has there been a fundamental change in how data is managed and processed? As an authority on databases and computing platforms, his insights will lead us to think deeply and explore some of the possibilities of computing platforms in the future.
Interviewers:
Guan Tao (Tony), co-founder/CTO of Cloud Device, is an expert in the field of distributed systems and big data platforms. Former researcher of Alibaba Cloud Computing Platform Division, head of MaxCompute, and Dataworks, Alibaba's general computing platform, responsible for Alibaba's mainline big data platform. Former leader of the computing platform field team of Alibaba and Ant Group Technology Committee, and leader of the big data team of Alibaba Cloud Architecture Group. Before returning to Alibaba Cloud, he worked in Microsoft Cloud Computing and Enterprise Business Unit for 9 years, where he presided over and participated in the development of multiple sets of hyperscale distributed storage and computing platforms including Azure Datalake, Cosmos/Scope, and Eririn. He is also the author of many domestic and foreign conference papers and patents.
Guan Tao is the producer of QCon Beijing 2023 "From BI to BI+AI, Big Data Platform under the New Computing Paradigm", which will be held at Renaissance Beijing Hotel from September 3 to September 5, 2023.
What is the impact of AI on the big data industry?
InfoQ: In your 21-year interview, you gave a series of trend predictions in the field of data platforms, such as mentioning that "lakehouse integration is an emerging direction, but it is expected to become a new standard in the industry". Looking back two years later, what predictions have been fulfilled that year? Which ones haven't? Why?
Guan Tao: Two years ago, we made a trend forecast from four directions: offline to real-time full spectrum; a new architecture of the lakehouse; IoT data has become a new growth point; AI will become a first-class citizen of databases, data platforms.
Now we can take a look at the predictions at that time, which directions were relatively accurate. First of all, the full spectrum from offline to real-time is a relatively clear direction (this is also the direction of cloud technology). At present, whether it is Delta, Hudi, Iceberg in the storage field, or Databricks and Snowflake in the real-time data processing field, they are pursuing this direction to support more comprehensive coverage of stream batch interaction capabilities, rather than focusing on optimization in a single direction.
The second is the lakehouse integration. Two years ago may have been an exploratory direction, but now there are more practices in China to combine the advantages of data lakes and databases, which has been recognized in the direction of integration, especially with the rise of artificial intelligence, the advantages of data lakes have been amplified, and data warehouses are on an equal footing, and the pursuit of the advantages of both has become a principle.
Then IoT became the new hotspot, and I think that's half right. With the rise of intelligent manufacturing and intelligent vehicles, large-scale data in these fields is becoming a new development direction, which is the largest increment in data generation. However, IoT data collection and processing is still in the early stages, after all, most enterprises do not deploy many IoT devices, and related applications are still in the early stages.
Finally, there is a clear trend: AI will become a first-class citizen of databases and data platforms. Before the explosion of big language models, this trend was not clear enough, and analysis was still the most mainstream direction of data platforms. With the advent of big language models, people see BI and AI as juxtaposed capabilities, and even have higher expectations for AI's potential. As a result, many platforms are claiming AI-powered capabilities, whether it's Snowflake or Databricks. In the near future, all platforms may consider compensating for this lack of capabilities, such as semi-structured data storage and vector retrieval, and this direction is already clear.
InfoQ: LLM is a hot topic at the moment, what changes do you think the arrival of LLM can bring to big data enterprises?
Guan Tao: The scope of changes brought by large models to enterprises is very wide. It can help businesses replace many people's jobs, such as data development, data tuning, database administrator (DBA), etc. Even in customer service, about 780 percent of jobs have been replaced by machines. Many repetitive tasks, such as basic budget management, primary technical verification, and even audit and financial work, can be completed to a certain extent with the assistance of large models, thereby improving work efficiency.
On the other hand, the success of a large model depends on three elements: the model, data, and computing power. At present, the model is relatively homogeneous, the computing power depends on the ability to support funds, and data has become a crucial factor. Having high-quality and professional data can make the model more accurate. Therefore, if an enterprise has a unique advantage in the field of data, it will have additional competitiveness when the era of big models comes.
For example, Bloomberg previously released a large model called "BloombergGPT" focused on journalism and finance. Due to the rich data accumulated in this field, the large models generated by it are superior in terms of knowledge depth and logical structure. This allows them to provide more valuable services to their customers, which leads to more revenue.
In addition, large models have excellent performance in terms of interaction. The large model we're talking about is actually a language model, and it's best at interacting in a natural language way, you can ask questions in language, it answers in a linguistic way. Large models may be their main application area in terms of interaction. But within this area, it can make a lot of difference.
So the answer to this question, that is, what changes the big model brings to the enterprise, is that there are three main changes. First of all, it can greatly improve efficiency and can become the core strategy of big data enterprises; Second, if you have high-quality and unique data, combined with large models, it can bring additional core competitiveness. Third, significantly lower the threshold for using a data platform (through natural language and data platform interaction), the data platform can break through the original limitations and open it to everyone, for example, executives may not write SQL or programming, but through large models, can easily communicate with the system. This shift can take businesses from having access to data platforms to everyone with access to just 20 percent, and the efficiency gains are enormous, if not disruptive.
InfoQ: After the arrival of the GPT wave, for ordinary enterprises, in order to combine enterprise data with LLM and exert the value of data, what is the most important link that the traditional data platform needs to complete? Why?
Guan Tao: Logically, AI needs to be integrated into the data platform as a core function. While many data platforms used to have analytics or BI as their sole design goal, they now need to combine data with AI as a first-class citizen. This is a big shift. The data platform architecture needs to be further upgraded while extending BI+AI.
Specifically, the first is the storage level, which requires additional support for the processing of semi-structured and unstructured data. Second, in terms of data management, it is necessary to support so-called "heterogeneous" data management capabilities, covering the unified management of unstructured and semi-structured data. Third, keep data open and support multiple engine docking.
The second aspect is the computational level, which needs to support basic functions, such as calculations for large models. This involves some detail techniques such as vector storage and the ability to retrieve vectors. In addition, it involves processing unstructured data, such as image data recognition and data sanitization, as well as large-scale Finetune and vector retrieval at the computational level.
The last point is to keep the architecture open and have a better plug-in system. At present, the AI link is still evolving rapidly, and there are many changes. Enterprise infrastructure needs to be flexible. The plug-in system itself can be solved through UDFs, FunctionCompute, or specialized PipelineManagement systems.
In particular, for LLM, there are many components for LLM applications, such as LangChain, vector database, LLM runtime, and these combinations can easily build an end-to-end LLM service link. Many emerging and more easy-to-use LLMOps components are emerging, such as Lepton.ai, XInference.
Why do you need a new system?
InfoQ: What are the technical differences between cloud devices and popular open source products Spark/Flink/Clickhouse and SaaS-based Snowflake?
Guan Tao: Lakehouse covers the three typical scenarios of batch, stream, and interaction through an engine based on the incremental computing paradigm, and provides services to customers through a SaaS model similar to Snowflake.
The three open source products in the title represent the three mainstream forms of computing in the field of data analysis, summarized as batch processing, stream processing, and interaction analysis. These three models are usually combined to form a more complete data analysis platform. This combination is a typical form in the open source world, known as the Lambda architecture.
The Lambda architecture has many problems, such as complex architecture, data storage, management, and semantic inconsistency. The advantage of cloud device in technology is to break this combined architecture through a set of systems, to achieve the unification of data storage, data management, user semantics, development experience, and the effect of improving efficiency and reducing costs.
The contrast with Snowflake, first of all, is the similarity, we are both based on a SaaS model. Serve customers with an out-of-the-box model on the cloud. Unlike the open source model, users do not need to purchase hardware, deployment, and operations. Users of SaaS-based solutions don't need to worry about these things.
Unlike Snowflake, Snowflake is still more data-based, with relatively weak support for data lakes, and some work has been done on major federal queries. The cloud is designed from the ground up to be based on a new architecture from native Lakehouse, which is not only suitable for data analysis, but also supports different workloads. Snowflake, on the other hand, is more batch processing, interaction analysis is secondary, and has little to no streaming capabilities. The cloud system is committed to unifying the three lines of stream processing, batch processing, and interaction analysis.
InfoQ: So is the cloud device redeveloped a system?
Guan Tao: Yes, the whole system was developed from scratch. We introduce a new computing paradigm called incremental computing.
Integration was the design direction we pursued from day one. By analyzing the existing three computing paradigms of batch processing, stream computing and interaction analysis, each of them has its own optimization direction and design mode, and has different storage computing expressions, which cannot replace each other. Specific differences can be seen in the table below.
Therefore, we propose a fourth new calculation method, that is, incremental calculation. We hope to unify these three traditional computing modes through incremental computing, and eventually form an integrated engine.
InfoQ: Is there an incremental lake-entry solution based on Flink?
Guan Tao: Yes, Flink was an early attempt to do an integrated solution, and put forward the slogan of "integration of flow and batch", and there are not too many landing cases at present. This is actually because stream processing and batch processing are calculated differently and the storage systems are different.
In the cloud solution, we unify the stream batch interaction mode through a common computing method, and then we use a set of common storage to support the entire storage layer. The form of this storage is the incremental storage of the lakehouse, which is a kind of general-purpose incremental storage. It and the top-level computing engine are a mutually supportive relationship. General-purpose incremental storage not only serves the unified data analytics engine that supports incremental computing that we talked about earlier, but also supports other AI engines. That's what we're aiming for and what sets it apart from other products.
InfoQ: Can you explain more specifically how performance gains work?
Guan Tao: The first is the capabilities of the basic engine. At present, the architecture selection of data analysis engine has been relatively stable, such as vectorization engine, complete columnar storage, storage and computing separation design, cost-based optimizer, and native code and other methods. This is reflected in our products (the main language is C++). These features guarantee a high level of performance for our engines.
But the above technology, which we believe is not innovative, is the level of the State of the art, which is our basic capability. Innovation comes from the following directions:
In addition to these foundational capabilities, I think the incremental computing just mentioned is a key direction. With incremental calculations, we can try to remember the parts that have already been calculated before when making calculations without having to calculate them again. These previously computed parts can often be represented as materialized views or Result Caches. These materialized views can be transparently referenced by the user's queries. For example, when a user issues a query, if the already computed results are preserved and only the newly added parts are calculated, this will greatly improve the performance of the engine.
In addition, we have a technology called "AI4D". We can optimize data storage and computation through AI-learned methods. For example, if you often join two tables, and these calculations are duplicated, they can be precomputed. When the precomputed results meet your query criteria, they can be returned directly. In fact, this is also an incremental calculation, but with the addition of intelligent data calculation and preparation processes. is a Learn based process, automated optimization through AI. With this automated optimization, performance can also be greatly improved. And this optimization can be transparent to the user. It can be understood as the automatic pilot of the data platform.
To sum up, I think we have almost reached the best level in the current industry in terms of engine implementation, and he is a foundation. However, the greater improvement potential is mainly concentrated in two aspects: first, incremental computing, innovation in computing paradigm; Second, the innovation of AI4D automatic optimization. Both of these improvements can greatly improve performance and have good potential.
InfoQ: Is such an architecture already in place, and what is the actual effect?
Guan Tao: Yes, we have already applied it to some customers.
Our products have several main selling points that can be recognized by customers. First, many businesses consider choosing a lighter SaaS architecture to be a good choice. Customers believe that the current open source self-built architecture can no longer bring technological advancement and differentiation, and the self-built architecture requires heavy asset investment, including hardware and team, which is old. In contrast, the lightweight, multi-cloud, cloud-neutral SaaS model of cloud devices is more attractive. Many customers choose us precisely for this reason.
The second is performance in terms of performance. Whether it's batch, interactive, or stream processing, we achieve significant high performance compared to existing systems. For example, in terms of batch processing, our performance is nine times faster than Spark. When it comes to interaction, our performance may also be a bit faster than ClickHouse, the best product on the market. These performance gains are critical for many customers, especially when performance gains reach multiplier levels.
Finally, many customers are interested in our attempt to solve a series of problems that come with Lambda's assembled architecture, which is our core breakthrough point and feels that it is a good technological innovation. They know firsthand on their current architecture that combining several different compute engines can cause problems. The cloud device unifies the data analysis platform through the integrated engine, so that users can flexibly switch between different computing paradigms when they need to adjust their services, which is very helpful for them. For example, in a well-known domestic intelligent manufacturing new energy vehicle manufacturer, POC test results show that the cloud platform can achieve full-link real-time at a very low cost, and they are very satisfied with this effect.
Data platform in the era of big language models
InfoQ: BI and AI/ML are gradually converging, some enterprises want to provide one-stop service, but starting from the database perspective for data management is advantageous, starting from the lakehouse is more conducive to machine learning, so what are the main challenges of combining the advantages of these two aspects with one platform?
Guan Tao: I think the main challenges come from the following aspects.
The first is the balance between system decoupling/openness and high performance. As I mentioned earlier, many data warehouse systems are one-to-one storage and computing systems, and their storage is specially optimized for upper-layer computing to achieve high performance. However, if we want to support many different types of workloads, such as storage systems that support both analytics engines and AI engines, the decoupling and openness of storage and compute is critical.
The challenge here is to decouple and open while maintaining high performance. Achieving this decoupling between modules is a relatively difficult challenge in pursuing a balance of high performance at the same time. That's what I think is the first aspect.
Another challenge is the linkage of the two computing modes. SQL is the mainstream language in the field of data analysis, Python is the most popular in the field of AI, and how to easily program the two systems is a key challenge. SQLML, SQL+UDF embedded Python, Python's SQLAlchemy library, native Python interface, etc. are all choices.
The final challenge is the new AI-oriented data link. Previously, the whole link of data analysis and BI was relatively mature, and the data integration, ETL/ELT, modeling, analysis, BI and other modes were clear. The AI link is being rebuilt, and the components and patterns are different from BI. This part is new to the industry, and there are many frameworks/platforms currently being tried.
InfoQ: To support BI+AI/ML and even LLM, does the data platform need to gradually support OLAP, OLTP, streams, graphs, vectors? With so many kinds of compatibility, what do you think a better solution would look like?
Guan Tao: Integration has the advantage of natural simplicity and is the "holy grail" of technology. The industry has not stopped exploring.
If we divide the data domain into three general directions: OLTP, OLAP, AI, I think that the typical scenarios in the OLAP data analysis field are basically fixed, there is a clear consensus in the industry on the Lambda architecture problem, and the integrated architecture to unify the workload of all analysis classes is the future direction. This is also the direction that the cloud machine is trying to do. From our current exploration and practice, stream processing, batch processing and interaction, these three types of computing paradigms can be processed unified.
The integration of OLTP and OLAP, HTAP is also the direction of the industry. Some products are working in this direction, and there are also many customer landing scenarios.
OLAP+AI integration is currently a hot spot, and the overlap of these two types of data and the demand for interaction are strong enough. Databricks has always focused on this direction, and it has always adhered to the Data+AI strategy. Starting from the OLAP field, Snowflake has recently accelerated the layout of supporting AI at the same time, such as SnowPark, which has been working hard.
As far as the positioning of cloud device technology itself is concerned, it is to unify the three computing paradigms in OLAP with a single engine. At the same time, it supports AI capabilities through the Lakehouse architecture, supports SQL and Python hybrid programming, and supports plug-in AIOps support.
InfoQ: As data platforms become more and more complex under the requirements of "compatible AI", what are the main aspects to consider the benefits of a platform?
Guan Tao: Personally, I think it can be evaluated in the following way.
The first is the full spectrum of data. Whether the storage platform can store and manage global data. Like the lakehouse integration mentioned earlier, this is a clear direction, to integrate the data lake and the data warehouse, plus unified global data access, while maintaining openness.
Second, whether it can support data analysis and other computing paradigms at the same time. Both SQL Engine and AI Engine support it well.
Then there is the ability of the system to be scalable. The so-called scalability is that in the face of future changes, other modules can be quickly integrated in a plug-in way, which has certain tests for resource scheduling and overall system design. One suggestion here is that you can use cloud patterns for design, so that it is easier to achieve this goal. Because the cloud's model excels in terms of resource resiliency and module richness.
InfoQ: Two years from now, as we look to the future, what do you see changing in the computing platform landscape in the future? What are the trends?
Guan Tao: I think there will be the following trends.
First, the explosion of data acceleration. IoT data, plus Agent data, will become a new driver of data growth. The background to this is that the first wave of data growth comes from databases, such as billing report data, which is small but valuable to institutions such as banks. The second wave of data explosions mainly occurs in the field of big data. Many people's behavioral data is recorded, such as what you buy on Taobao, what content you view, etc. This behavioral data is eventually transformed into part of services such as user profiling and personalized recommendations. This data has been the core driver and source of data growth over the past 20 years. The third wave of growth comes from human behavior data as well as device data, such as cameras on vehicles and smart switches in homes. With the rise of AI, many intelligent robots will also emerge, and they will be widely used in various industries, so the data generated by these intelligent robots will also be automatically collected, forming the third wave of data growth points.
At the same time, there is a parallel growth point, that is, the significant enhancement of semi-structured data processing capabilities brought about by large-scale models and deep learning, and this kind of data will continue to emerge. Therefore, the explosion of data remains an important trend.
Second, the architecture of data analysis will tend to be unified. In the field of data analysis, everyone may eventually move towards incremental computing, gradually breaking the limitations of the Lambda architecture, and the integrated architecture will become the future. Just as we predicted two years ago that the lakehouse integration will become the future, we hope that the integrated architecture will really land in two years.
Third, the big language model brings significant enhancement of semi-structured and unstructured data processing capabilities. Working with this data used to be almost difficult, but now it's become relatively easy. Reading a PDF file that used to be difficult to sort out what's in it is now much easier. At this level, if previously we could only deal with structured data, now there are two more categories, semi-structured and unstructured data. The significant increase in the ability to process this data will lead to a significant increase in the demand for storage and compute.
Fourth, the arrival of the big language model, data exchange/privacy protection will receive more investment. The security and privacy requirements of data have further increased, and the need for data sharing has become more urgent. Because data is essentially knowledge, and this knowledge can increase intelligence and turn into an implicit value. Therefore, the balance between data privacy protection and data sharing becomes an important issue, especially with the application of large-scale models, which may lead to significant changes. It is not yet clear how to respond to this challenge. For example, many enterprises do not allow the use of publicly available large-language model services, especially in the United States, mainly because of concerns that interacting with models may lead to privacy leaks within the enterprise. Once a model is deployed privatically, its knowledge may be limited and it will not be able to access content that interacts with the outside world. Therefore, data exchange and privacy protection have become particularly critical, which may be a development trend in the future.
Fifth, BI+AI has become a mandatory option for data platforms, which need built-in or plug-in support for AIOps technologies such as heterogeneous data, finetune, and vector retrieval. AI makes all platforms intelligent, and the intelligence of data platforms becomes inevitable. Data platforms that significantly lower the threshold for use will be used by more people, so this also brings additional requirements for "platform fooling".
Further reading:
Incorporating big models into every aspect of the work, data giant Databricks democratizes generative AI Interview with Li Xiao
On the eve of the great change in computing paradigm, Cloudhouse released Lakehouse, a multi-cloud and integrated data platform
Cloud Technology announced the completion of hundreds of millions of yuan in financing to build a multi-cloud and integrated data platform
"Revisiting Data Architecture" and Cloud Device Technology Product Launch
Event Recommendations
With "Set Sail· AIGC Software Engineering Transformation", the QCon Global Software Development Conference Beijing will be held on September 3-5 at Renaissance Beijing R&F Hotel, which plans from BI to BI+AI, big data platform under the new computing paradigm, big front-end new scenario exploration, big front-end integration and efficiency improvement, large model application landing, AI-oriented storage, R&D efficiency improvement under the AIGC wave, LLMOps, heterogeneous computing, microservice architecture governance, business security technology, Programming languages for building future software, FinOps and more with nearly 30 exciting topics.
For inquiries about ticket purchase offers, please contact the ticket manager 18514549229 (WeChat and mobile phone number). Click the link to view the full schedule of QCon Beijing Station, and look forward to communicating with developers on site.