天天看點

中英資料庫專家談:資料庫的過去、未來和現在

簡介: 資料庫是什麼?未來的資料會被存在DNA裡?資料庫裡的資料湖是什麼? 1月16日,掃地僧做了一場直播,請到我的同僚——資料庫資深專家封神,和來自帝國理工的進階講師Thomas Heinis(托馬斯·海尼斯),2人就資料庫這個話題做了比較深入的探讨,老僧印象比較深的是一些前沿的DNA儲存大資料等概念。在此老僧奉上雙方談話的全部内容,由于英國學者使用英文講解,是以對全文進行了中英文的翻譯。希望這個速記能幫助對前沿科學有興趣的同好。
中英資料庫專家談:資料庫的過去、未來和現在

1月16日,掃地僧做了一場直播,請到我的同僚——資料庫資深專家封神,和來自帝國理工的進階講師Thomas Heinis(托馬斯·海尼斯),2人就資料庫這個話題做了比較深入的探讨,老僧印象比較深的是一些前沿的DNA儲存大資料等概念。在此老僧奉上雙方談話的全部内容,由于英國學者使用英文講解,是以對全文進行了中英文的翻譯。希望這個速記能幫助對前沿科學有興趣的同好。

主持人:大家好,歡迎來到阿裡達摩院掃地僧的直播間,我是掃地僧的小助理。今天,我們的自動駕駛機器人小蠻驢帶大家逛了一下阿裡雲飛天園區,一路沒人講話,不知道盆友們有沒有看急了,那接下來我們就聊聊天。

Moderator: Hello everyone. Welcome to the live streaming studio of the sweeper monks of Alibaba DAMO Academy. I am the assistant to the sweeper monks. This morning, our autonomous robot Xiaomanlv (Little Competent Donkey) took you guys on a tour in Alibaba Cloud Apsara Park without saying anything. I wonder if you felt anxious or not. Now let’s have a good chat.

這次陪我們聊天的是掃地僧的老同僚封神。

Today, we have with us our old friend, the Sweeper Monk Fengshen.

封神(阿裡雲智能 雲原生資料湖分析DLA 技術負責人):大家好,我叫封神,來自阿裡雲資料庫團隊,09年加入阿裡,目前主要做資料庫資料湖分析方向,主要負責雲原生資料湖分析DLA的技術,之前也做了10年左右的資料庫與大資料相關的事。

Hello everyone, I am Fengshen from Alibaba Cloud database team. I joined Alibaba in 2009. Currently, I am mainly responsible for Data Lake Analytics, known as DLA. Before joining Alibaba, I had spent 10 years doing things related to databases and big data.

主持人:直播間還請到了1位來自遠方的客人,Mr. Thomas Heinis,他是帝國理工學院的講師,請客人自我介紹。

Moderator: We are also honored to have with us here Mr. Thomas Heinis, lecturer at Imperial College London. Mr. Heinis, please kindly introduce yourself.

托馬斯·海尼斯(帝國理工資料庫專業進階講師):是的,當然。 我叫托馬斯·海尼斯,現在是帝國理工學院的進階講師。我在研究小組做研究,我們研究小組基本上負責一切與資料有關的研究工作。我從事大量的資料分析和資料可視化工作,目前也負責資料存儲工作,包括所有的新技術,確定我們未來能夠有效地分析和了解資料。

Yes, sure. My name is Thomas Heinis. I am a senior lecturer at the moment at the Imperial College in London. I do research here in the research group, which basically takes care of everything to do with data.So I do a lot of data analytics, also, data visualization, and then also data storage at the moment, including all new technologies such that we can basically analyze data efficiently and understand it in the future.

中英資料庫專家談:資料庫的過去、未來和現在

主持人:既然請到2位資料庫領域的專家,我們今天讨論的主題肯定離不開資料庫。資料庫對我們這些非專業人士而言,可能最常聽的詞是删庫跑路,而對資料庫最直覺的了解就是Excel表格。先給我們的觀衆介紹一下什麼是資料庫。請問Mr. Heinis,你給學生上第一堂課時如何介紹?

Moderator: Since we have invited two experts in the field of databases, the topic of our discussion today is certainly inseparable from databases. For non-professionals like us, probably the most common words we have heard is dropping a database. And our most intuitive understanding of databases is Excel tables. First, please introduce to our audience what a database is. Mr. Heinis, how would you explain databases to freshmen in your first class?

托馬斯·海尼斯:我通常會給學生稍微介紹一下資料的由來,這與銀行密不可分。上個世紀六七十年代期間,銀行儲存了大量的資料。它們需要将這些資料組織起來,關系型資料庫應運而生。資料庫本質上是很多資訊和資料的集合,這些資料被組織起來,進而實作高效的分析、通路、管理和更新的目的。是以,計算機資料庫通常來自于資料檔案的導航,從傳統上而言,或者就曆史淵源而言,資料庫确實來自銀行,有很多關于銀行客戶的資訊以及他們賬戶餘額的資訊。

I usually explain a little bit of history about it. That's all to do with banks. Banks had a lot of data in the 60s and 70s. And they needed to organize that data. That's kind of where relational databases come from.

Essentially, what I tell students is that a database is a collection of lots of information, lots of data that are organized so that it can be analyzed, accessed, managed, updated very efficiently, right?

So computer databases, typically from the navigation of data files. Databases, traditionally, or historically, do come from banks. There was a lot about bank customers and clients their balance of their accounts and information.

And within databases, then we have this massive for this big branch of this big technology called relational database, which is really where we organize data to tables, rows, columns, which really contain information about customers, clients, transactions, sales, etc. and all kinds of very well structured information.

而在資料庫裡面,有一個龐大的技術分支,叫做“關系型資料庫”,其實就是我們把客戶、交易、銷售等資訊以高度結構化的形式組織到表、行、列當中。而有些同學一開始就已經聽說過SQL,也就是用于查詢資料庫的查詢語言。最近幾年,情況有所變化。但總的來說,關系型資料庫确實是在上世紀六七十年代期間由銀行的應用案例所驅動的。過去二十年間,情況發生了巨大的變化。因為我們有了需要組織資料的新型應用,比如科學應用,或者是社交網絡等類型的應用。

基本上,這些應用需要略微不同的資料庫。是以,後來我們轉向了NoSQL資料庫或非關系型資料庫,有了不同的用例,也開始管理更多的資料,也就是現在所說的大資料。簡單地說,我們收集了海量的資料,需要用資料庫來分析和存儲這些資料。

And some of the students have already, on the start, already heard about SQL, which is the query language to ask for a query database.

In recent years, things have changed a little bit. So relational databases, they do come from the 60s 70s driven by this banking use case, for this banking application.

And in the last 20 years, things have changed drastically, right?

Because we have new applications that need to organize the data such as scientific applications, or the types of applications like social networks, and etc.

Basically, these applications require slightly different databases. And so that's where then we went to kind of noSQL databases or non-relational databases and we moved to different use cases. And we also move to managing much much more data, What is nowadays called Big Data. Basically, we collect tremendous amounts of data. And then we need to come to databases to analyze and store these data.

主持人:想問從事這個行業的封神老師,進入到現代,資料庫為何越來越重要?

Moderator: I have a question to Fengshen. Why are databases more and more important in the modern era?

其實資料庫一直很重要,最為簡單的一條就是 資料不能丢。企業如果丢了最為核心的資料庫,則企業可能直接面臨破産。

Actually, databases have always been important. To put it simply, data cannot be lost. If an enterprise loses its core database, it might directly go bankrupt.

為什麼資料庫越來越重要 那是因為資料還蘊藏着寶藏,之前資料存着也就存着,一般存核心的資料,如交易、客戶、商品的資料,一些日志資料存就是了查詢問題。也不會去埋點,更加談不上去爬取或者購買資料了。

Why are databases more and more important? That is because data also contain treasure, In the past, data were just stored there. Generally, core data were stored, such as the data of transactions, customers, and goods. Some log data were stored just for query purposes. There were no buried points, let alone crawling or purchasing data.

網際網路也經曆了好幾個階段,從開始的新聞門戶時代,到可以有互動的類似BBS、淘寶購物時代,到all in無線,到智能時代,産生的資料也越來越多。

The Internet has gone through several stages, from the news portal era in the beginning, to the era of interactions with such platforms as BBS and Taobao, to the era of wireless technology, and then to the era of intelligence. More and more data have been generated in the process of evolution.

-從資料量來看,IDC統計,2005年全球資料是130EB,2019年為41ZB,漲了322倍;

-According to the statistics released by IDC, the amount of global data generated in 2005 were 130 EB, and that in 2019 was 41ZB, an increase of 322 times.

-從資料應用來看,越來越多的公司也使用資料做出了出色的成績,阿裡、頭條、百度、滴滴等Top100知名網際網路企業都是資料驅動的企業。我們再來看傳統的産業,有智慧園區、城市大腦、縣域大腦、智慧農業、智慧城市、智慧醫療、工業4.0等等,也都在使用資料技術在賦能各個産品,幫助這些産業數字化轉型,提升效率。在産業應用看,要充分發揮想象的空間,使用資料賦能産業轉型成長。

-From the perspective of data application, more and more companies have also made outstanding achievements with data. For example, the Top 100 well-known internet enterprises, including Alibaba, Toutiao, Baidu and DiDi, are all data-driven enterprises. In terms of the traditional industries, data technology is used in smart parks, city brain, agriculture brain, smart agriculture, smart cities, smart medicine and Industry 4.0 to empower each product and help the respective industries achieve digital transformation and increased efficiency. Seen from the perspective of industrial application, it is necessary to give full play to imagination and use data to empower industrial transformation and growth.

-從國内形勢看,國家也提出了新基建,核心是以 雲計算、大資料、人工智能、5G、區塊鍊為核心,這些核心中的核心是資料的應用,再過10年,真的是萬物互聯的時代,資料量增長的速度會更加快;在疫情時代,有相關機構研究表明,疫情讓數字化轉型快了5年左右。

-Seen from the domestic situation, the Chinese government has also put forward the concept of new infrastructure. Its core technologies include cloud computing, big data, artificial intelligence, 5G and blockchain. And the core of these core technologies is the application of data. It is expected that in 10 years, we will enter the era of the Internet of Everything, when the growth of data volume will be even faster. Relevant institutions have found through research that the COVID-19 pandemic has expediated the digital transformation by 5 years.

-從高校和研究機構看,國内高校專業增設最多是 大資料技術、人工智能的專業;總之,21世紀除了人才,還有什麼最貴,那就是資料,資料相當于20實際的石油,是21世紀整個社會效能運轉的潤滑劑。因為資料越來越重要,是以資料庫越來越重要,資料庫是這一切的核心載體。

-From the perspective of universities and research institutions, big data technology and artificial intelligence programs have been widely offered these days. In short, in addition to talents in the 21st century, what else is the most expensive? The answer is data. Data are equivalent to the oil of the 20th century. Data are the lubricant for the functioning of the whole society in the 21st Century. Due to the increasing importance of data, databases are also becoming more and more important. And databases are the core carriers of everything.

主持人:中國和英國的對大資料的定義有什麼不同?這會導緻雙方程式員對資料庫的了解不同嗎?

Questions: What is the difference in the definition of big data between China and the U.K.? Will such differences lead to different perceptions of databases between the programmers in the two countries?

封神:大資料其實我認為沒有準确的定義。比如,如果資料量比較小,但是訓練用的機器比較多,也可以認為是用到了大資料技術。我一般認為用資料驅動業務發展就屬于使用了大資料相關的技術。之前國内提Big Data(中文指:大資料)比較多,現在國外提Data Lake(中文指:資料湖)的概念比較多,主要還是雲公司在主導,資料很多存在了對象存儲上。阿裡資料庫團隊提的是庫、倉(Data Warehouse)、湖(DataLake)、多模(Multi-Model),并且我們還專門做了一個 雲原生資料湖分析DLA的産品。另外,我們看到著名的咨詢公司Gartner把大資料報告合并到了資料庫報告裡面。

Fengshen: I don’t think there is a precise definition of big data. For example, if the amount of data is small, but many machines are used for training, it can still be considered that big data technology has been used. I generally think that adopting the data-driven business mode is equivalent to using big data-related technologies. In the past, the concept of Big Data was mentioned a lot in China. Nowadays, the concept of Data Lake becomes popular in foreign countries. In most cases, cloud companies are taking the lead. And a lot of data exist in object-based storage. Alibaba’s database team mentions data library, Data Warehouse, Data Lake, and multi-model. We have also specifically made a product based on Data Lake Analytics (DLA). In addition, the renowned advisory company Gartner Inc. integrated big data reports into its database report.

我認為資料庫包括傳統的資料庫技術(如MySQL、PG); 也包括資料倉庫、資料湖的技術,如開源的Spark、Hadoop,阿裡的ADB、DLA等,也包括最近流行的LakeHouse技術。

In my view, databases include traditional database technologies (such as MySQL and PG) as well as data warehouse and data lake technologies, such as open-source Spark, Hadoop, Alibaba’s ADB and DLA. They also include the Lakehouse technology which has been quite popular these days.

我跟英國的程式員交流比較少,跟北美有一些交流,整體應該了解差不多。技術傳播的速度也比較快,大家了解應該比較類似。由于中國的市場比較大,由于某一些原因,資料也比較多,這些會加快對資料的應用的發展。

I haven’t had much communication with programmers in the UK. But I have had some communication with programmers in North America. Overall, our perceptions are more or less similar. The speed of technology communication is also relatively fast. So, the understanding should be more or less similar. Besides, the huge market in China and the larger amount of data generated here will accelerate the development of data application.

托馬斯·海尼斯:On some level, yes. I think this is gonna be long long rounds. But I don't think you know, if you look at it, from a technical perspective, the types of data could be the same, it's going to be very similar. But what I do think is that the scale is massively different. And that's quite, you know, just because there's so much more data available in China.

某種程度上,會的。這一點說來話長。但我認為,從技術角度而言,資料的類型可能相同,或者非常相似。但資料的規模卻差異巨大,這主要是因為在中國,可擷取的資料更多。

And it's because I think personally, I think it's because the society is a little bit more technologically advanced, or is it easier to…I think China adopts technology easier, which means that, for example, you have more sensors everywhere with traffic measurements, and an engine mentioned the DiDi, right, which, which, you know, produces the same kind of data as Uber, for example, but on a different scale, right. And that applies to everything.

我個人認為,這是因為相比之下,中國社會的科技更為發達,并且中國更易采用技術。這就意味着,例如,中國的路面上會有更多的傳感器來監測交通情況。又如,滴滴打車軟體所産生的資料和Uber所産生的資料一樣,但是規模卻不同。其他領域也是如此。

And what he said about the pandemic, being a catalyst for this transformation is absolutely also absolutely true. Like we have made way more electronic payments, now, we have just everything is more digital, and I think a lot of it will stay digital, and that produces data that produces data that we need to analyze.

剛剛他提到說新冠肺炎疫情是推動這一轉型的催化劑,這一點毋庸置疑。比如,我們現在使用電子支付的頻率比以前高了很多,生活的方方面面都更加數字化了,我認為數字化的趨勢會持續下去,這就産生了需要我們分析的資料。

So with what I mentioned, about China, being a technological, a bit more about having more technical technological affinity means that we basically have more places where we collect data, and much more sensors, people interacting online that produces more data.

我剛提到說中國更親近技術,這就意味着在中國,資料的來源更多,有更多的傳感器,網絡使用者互動所産生的資料也更多。

And then also, you know, China's population is huge. So that also means that more data is being produced. So it's, I would say, you know, there are two technical challenges when it comes to big data or databases in general.

此外,中國人口衆多,因而産生的資料也更多。是以,我認為,大資料或資料庫主要面臨兩大技術挑戰。

One is the data formats, you know, that's life changing, as mentioned, is going towards also data lake right loads of different formats. But I don't think that differs much between China and the rest of the world.

一是資料格式上的挑戰,其發展趨勢是走向資料湖等不同類型的格式。但在這一點上,中國和世界其他地方的差別不大。

But what what is really different is the amounts of data. And so that means that we need whatever we develop the analysis, visualization, storing the data that needs to happen efficiently on a much, much larger scale, which then again, brings in a lot of technical challenges and challenges as well. Right. So I think i think that's that's, that's what I think that's the difference. But I think the definition as such, it's roughly roughly the same.

中外真正的差別在于資料量。具體而言,中國需要更大規模和更高效地進行資料分析、可視化和存儲資料等任務。這又帶來了很多技術挑戰。我認為中外在資料庫領域的差別就在于此,但兩者對資料庫的定義基本相同。

It's also I also have the feeling That, that, you know, China has a different, the Chinese people have a different understanding of personal data as well,

此外,我認為中國人對個人資料的了解與外國也不同。

it's very, very difficult for us to get data from companies here or from you know, there's always a perception of these breaches data privacy, whereas in China and appear of works a lot with with Chinese University as well. So we get a lot of data from I don't think it's DiDi, but somebody saw some similar data sets, it's quite easy. So there's also kind of like, I feel like China has really this, this kind of this, this more affinity towards technology, and this is a and like, this more of a project, let's see what we can do with the data. Let's see if we can improve things, you know, so we're also collaborating on a traffic optimization project that you know, they collect massive amounts of data about which vehicle passes through the road, where at what speed what's the congestion level? Can we remove traffic and all these kinds- of things? It's really kind of like a very pragmatic approach to using data really, what can we do to improve you know, everything really. That’s what data does these days.

在英國或者其他國家,從公司擷取資料非常困難,因為外國企業總是擔心侵犯資料隐私。相對而言,從中國企業或者院校擷取資料容易得多。是以,我們擷取了很多資料,可能不是直接來自滴滴公司,但資料集也比較類似。總地來說,在中國,資料的擷取更為容易。我感覺中國更親近技術,更願意使用技術,更願意利用技術來改善現狀。我們正在與中方合作一個交通優化的項目,他們收集了大量的資料,包括經過道路的車輛資訊,車輛的位置,車輛的行駛速度,路面的擁堵程度,是否可以消除擁堵等。這種資料使用真的非常務實,相當于使用資料來改善一切可改善之處。這便是資料在當今社會所發揮的作用。

主持人:随着大資料這個概念的出現,資料庫是怎麼進化的,請封神老師講講?

With the emergence of the concept of big data, how has databases evolved over the years? Fengshen, please share with us your view on this question.

封神:資料庫怎麼進化,先看看資料庫怎麼來的。從廣義看,資料記錄的曆史早就有了。在5000年前,人類開始用繩結計數;在2000年前有紙張,到1946第一台計算機的誕生;計算機的誕生後,才有了現代意義的資料庫。為了形象說資料庫是什麼,好比 你有一個管家,管家有一個記賬本,你每天花費多少錢,收入多少都會告知管家,管家記錄下來。你就可以知道你目前多少錢,每個月花費多少錢;資料庫就是管家+賬本;管家提供計算力,賬本提供存儲;

To figure out how databases have evolved, let’s first look at how they came into being. In a broad sense, data recording dates back to a long time ago. Back 5000 years ago, humans began to count with knots; 2000 years ago, paper started to be used. And in 1946, the first computer was invented. After the invention of the first computer, the databases in the modern sense came into being. Let me use an analogy to explain what a database is. Suppose you have a housekeeper, and the housekeeper has a ledger. You inform the housekeeper how much you spend and how much you earn every day. The housekeeper records your income and expenditure in the ledger accordingly. You can then know how much money you currently have and how much you spend each month. In this case, the database is the housekeeper + the ledger: The housekeeper provides the computing power, while the ledger provides the storage.

資料庫也發展經曆了很多階段,為了“記好賬”,資料庫也在不斷演進。我們一般根據把資料庫發展分為了4個階段:

The development of databases has gone through several stages. And in order to “keep good accounts”, databases have also been evolving. We generally divide the development of databases into 4 stages:

  • 1970~1990商業資料庫時代 收費時代
  • The two decades from 1970 to 1990 was an era of business databases when fees were charged for the use of databases.
  • 1990~2000 開源資料庫時代 開源時代
  • The decade from 1990 to 2000 was an era of open-source databases.
  • 2000~2015 網際網路浪潮 大資料時代(大資料計算、存儲、NoSQL)
  • The period from 2000 to 2015 was an era of Internet and big data (big data computing, storage, NoSQL)
  • 2015~現在 雲的浪潮 雲原生時代+AI
  • Finally, the era from 2015 to the present is an era of cloud technology and the widespread application of Cloud Native and AI technologies.

資料庫的發展跟幾個因素有關, 硬體的發展,需求; 硬體主要指 存儲、網絡、記憶體、CPU。存儲就是存資料,記憶體與CPU關系到計算力,網絡就是傳輸。

The development of databases is related to several factors, including the development of hardware and the market demand; hardware mainly refers to storage, network, memory and CPU; storage refers to data storage; memory and CPU are related to computing power; and network concerns transmission.

大資料這個詞語大概在10年前開始流行,大資料系統開始獨立于資料庫系統發展的,随着最近5年的發展,大資料相關技術又慢慢與資料庫技術結合回歸到資料庫的大家庭。比如,2020年,著名的咨詢公司Gartner把大資料報告合并到了資料庫報告裡面。最為典型的是 DalaLake的發展,融入了事務&MVCC的概念,NewSQL的發展,NewSQL也融合了分布式的理論,并且還有一個HTAP的方向在探索。目前資料庫領域分為TP、NoSQL、AP等領域。TP一般有單機、分布式、事務型的資料庫;NoSQL就相對散一些:寬表、圖、文檔、時序、時空等;AP有Data Warehouse、DalaLake領域。

The term big data became popular around 10 years ago, when big data systems started to develop independent from database systems. With the development in the last 5 years, big data-related technologies have slowly returned to the database family by combining with database technologies. For example, in 2020, the famous research and advisory company Gartner integrated its big data report into its database report. And the most typical case is the development of Alibaba’s Dala Lake Analytics team, which incorporates the concepts of transactional databases & Multi-Version Concurrency Control (MVCC). Besides, the development of NewSQL also incorporates the theory of distributed databases. And the direction of HTAP is under exploration. Currently, the database field consists of such segments as TP, NoSQL and AP, TP generally consists of standalone databases, distributed databases and transactional databases; NoSQL covers a relatively wider scope, including wide tables, graphs, documents, time sequence and space-time. AP consists of such fields as data warehouses and data lakes.

在企業界,肯定是做看得見,并且在5年内能落地的事情。未來5年,資料庫領域核心發展方向是雲原生+分布式,具體講:Serverless、資料庫與大資料一體化、智能化、安全可信、軟硬體一體化、離線上一體化、多模資料處理。舉個例子,我負責做的雲原生資料湖分析DLA就是傳統大資料、Hadoop、Spark的更新,需要融合傳統資料庫技術,并且基于存儲與計算完全分離的雲原生架構。我們選用對象存儲,支援常見的消息、TP和NoSQ 資料庫系統資料的歸檔。我們一般歸檔到DalaLake裡面。還支援了一些事務、版本的東西,并且把Spark、Presto等元件做成雲原生的彈性、随時可用,即開即用,按需計費,分離後帶寬的損耗通過引入本地的Cache解決。

The prospects of the business community in the next 5 years are definitely foreseeable. In the next 5 years, the core development directions of the database field are Cloud Native and distributed databases, which specifically include serverless, integration of databases and big data, intelligence, security and trustworthiness, hardware and software integration, offline and online integration, and multi-mode data processing. For example, I am currently responsible for Data Lake Analytics (DLA), which can be considered the upgrade of the traditional technologies such as big data, Hadoop and Spark. It requires the integration of traditional database technologies and is based on the Cloud Native architecture of complete separation of storage and computing. It selects object-based storage and supports the archiving of common messages as well as the data in TP & NoSQL databases systems. Normally we archive the data in the data lakes. Besides, DLA also supports transactions and versions. Besides, the Spark and Presto components are also incorporated to achieve elasticity for the Cloud Native, which is accessible at any time and charged on the basis of demand. The loss of bandwidth after separation is solved by introducing local Cache.

托馬斯·海尼斯:其實封神的分享已經很詳細,并無太多可補充之處。如果非要補充的話,如您方才所言,開源資料庫大大推動了資料在社群内的擴散。我認為相較以前,資料庫的使用也變得容易得多了。

Well, there's not much more to add to what Fengshen has already said. Right. But I think I think what would what would I do want to maybe add really is that also, you know, Like you said absolutely correctly, is that open source databases have done a tremendous service to the community in kind of getting databases everywhere. I will say that I think it's also kind of the databases that have become much, much, much easier to use as well. And, you know, years back the first year students, they'd never seen a database. Today, if I asked, they've all seen MongoDB, or other technologies, kind of like easy-to-use databases, you know, not relational databases necessarily, but easy to use technology.

比方說,若幹年前,大一的學生從未見過資料庫。而現在的大一新生。幾乎都見過MongoDB或者其他一些易于使用的資料技術。他們未必見過關系型資料庫,但基本都見過易于使用的資料科技。

And that really has made a difference in terms of training people, they also kind of has changed their understanding the approach of people to using databases. Now, back in the day, people were storing data just in RAW files. Nowadays, they know, if I want to have efficient access to the data, then I need to use a database and they know how to do so.

這對人員的教育訓練起到了巨大的助推作用。這也改變了人們對資料庫使用方法的了解。以前,人們隻是把資料存儲在原始檔案中。現在,他們知道,需要使用資料庫才能有效地通路資料,并且他們也知道如何使用資料庫。

So databases have really changed, or kind of database have become much more pervasive. They're used everywhere these days. So that has definitely changed.

是以,資料庫真的發生了變化,或者說資料庫的使用變得更加普遍,現在各地的人們都在使用資料庫,這是一個很明顯的變化。

And now I unfortunately, forgot your question, which I didn't answer.

不好意思,我忘記你提的問題了,是以沒有回答。

So essentially, what really changed, right, and I said this initially, already kind of databases were designed relational databases were designed for banking applications that were revolving around transactions, which was really the centerpiece of banking applications.And that has made a lot of design decisions difficult.

至于資料庫發生了什麼樣的演變。之前我有提到,資料庫,或者說關系型資料庫是為銀行的應用而設計的,是圍繞交易設計的,而交易是銀行應用的真正核心。這也使得許多設計決策變得困難。

And then in recent years, like XX had mentioned, right, databases are kind of new use cases emerged, all of a sudden, we no longer have the data we have along fits nicely in a in a table, we actually have a graph.

然而,最近幾年,正如封神剛提到的,出現了一些資料庫的新用例。突然間,我們獲得的已經不再是表格資料,而是圖形資料。

So we have graph databases, or we realized a lot of data is natively very structured in a document, XML or similar. So we develop document databases.

于是我們有了圖形資料庫,或者說我們意識到很多資料天生就是高度結構化的,類似于文檔資料或XML資料。是以我們開發了文檔資料庫。

So there's, there's the now we, in the early, maybe around 2000, a little bit after 2000, we had this understanding that one size doesn't fit all. So we need to have different types of databases. So I mentioned graph database, document databases. But there have also been other other databases, very customized databases. For scientific applications, right?

是以,大約在2000年左右,具體說是剛過2000年的時候,我們意識到不能一刀切,而是需要不同類型的資料庫。我剛提到了圖形資料庫和文檔資料庫。但是也有其他類型的資料庫,比如用于科學應用的高度客制化的資料庫。

They produce massive amounts of data, like physics experiments, like astronomy, DNA experiments in biological experiments, they have all kinds of their own database technology these days,

這些科學應用資料庫會産生大量的資料,比如實體實驗、天文學、生物實驗中的DNA實驗等。這些領域都有各自不同類型的資料庫技術。

Back in the day, we've tried to fit everything in relational database and it didn't work really well. So each one of those now as their own title type of database.

以前,我們試圖把所有資料都裝進關系型資料庫,但效果并不理想。是以,現在不同領域都有各自不同類型的資料庫。

At the same time, we also, you know, more and more data has been produced in different formats. And this is really where the kind of what is this notion of a data lake of engine has been mentioning is coming from that we have tons of data in different formats, we still want to analyze the data as a whole. So we need some sort of kind of some sort of integration between that or some sort of way of analyzing heterogeneous different data types. And that has also changed. So we have now this capability to just produce data, throw it in a database, Put simply, and then analyze it efficient, efficiently, efficiently at scale. Right. So that's really how things I believe have changed.

同時,越來越多資料也在以不同的格式産生,是以我們不斷提到資料湖引擎的概念。之是以引出這個概念,是因為我們有大量的不同格式的資料,但仍然希望将資料作為一個整體來分析。是以我們需要某種整合技術,或者說需要采用某種方式來分析不同類型的資料。是以我們現在有這樣的能力:即把資料生産出來,簡單地扔到資料庫裡面,然後高效地進行大規模的分析。我認為這就是資料庫所發生的演變。

There's also other trends like cloud computing, in general, which has made it which is also supported for particularly smaller businesses to have their own database their own data solution. Because they no longer need to own the resources. And they can just if they have a big analysis to run on their data, or they just use cloud resources to do so temporarily, without having the hassle of owning. Right. So that has also held,

還有其他的趨勢,比如雲計算。總地來說,雲計算幫助小微企業擁有自己的資料庫和自己的資料解決方案,因為它們不再需要擁有資源,隻需要掌握資料的分析能力,或者隻是暫時使用雲端資源來進行分析,而不需要擁有資源。是以雲計算對小微企業起到了助推作用。

then we also have a huge trend in terms of hardware. So we have, obviously we have better hardware. And every now and again, the database community tries to really optimize the database for new hardware beat is multi core processors, which are not particularly new, but all kinds of hardware aspects of new CPUs, new types of memory, non volatile memory, for example, change a little bit how we organize and analyze data. So a lot of hardware trends has also changed or shaped database, database technology.

此外,硬體方面也有很大的發展趨勢。顯然,我們有更好的硬體。而且每隔一段時間,資料庫社群就會嘗試真正的優化資料庫,針對新的硬體采用多核處理器,這也并不新奇,但是硬體各個方面的優化,比如使用新的中央處理器、新的儲存器,非易失性存儲器等,在一定程度上改變了我們組織和分析資料的方式。是以說,硬體的很多發展趨勢也改變或者塑造了資料庫技術。

And then finally, what has happened in the last couple of years is really the use of machine learning or artificial intelligence in and around data and that has driven a lot of research and has also produced a lot of products. And when I talk about AI or artificial intelligence databases, it's really kind of The database research community has taken an approach And has done a lot of different things

最後,過去幾年間,機器學習或人工智能技術開始應用于資料領域或與資料相關的領域,這推動了很多研究,也催生了很多産品。談到人工智能或人工智能資料庫,資料庫研究社群做了很多不同的事情。

for example, you know, artificial intelligence machine learning requires a lot of learning, which requires a lot of data, and for that we need to have data that is clean and has been processed and has been manual has been brought in the right format. So, that's a database task.

例如,人工智能和機器學習的發展需要大量的學習,這就需要大量的資料。為此,我們需要有幹淨的資料,經過處理的資料,和經人工處理為正确格式的的資料。這是資料庫層面的任務。

And then the learning itself is also to some degree a database. Right And so, we have worked on that, that has had a tremendous impact in recent years.

此外,從一定程度上而言,學習本身也是一個資料庫。我們在這方面也下了不少功夫,這在近幾年産生了巨大的影響。

We also use artificial intelligence within the database itself to accelerate the database accelerate query execution the analysis. And then we also use artificial intelligence to organize the database itself.

我們還在資料庫内部使用了人工智能技術來加速資料庫的索引、執行和分析。我們也使用了人工智能技術來組織資料庫本身。

so that so I would say that artificial intelligence is a mega trend of course, we all know and you know has touched all aspects of life but it is also interesting enough to touch databases which not just touched but changed profoundly how we design and use databases.

是以,我認為,人工智能當然是大趨勢。衆所周知,人工智能已經觸及到我們生活的方方面面,但更有趣的是,它也觸及到了資料庫領域。準确的說,不僅是觸及,而且深刻地改變了我們設計和使用資料庫的方式。

主持人:剛剛兩位老師說到了很多關于資料庫的基礎知識,如果我現在給這場直播起個名字,我會叫它“資料庫入門必看”。開玩笑,我們實際上是個前沿學術分享的直播。其實任何一門學科在學界和工業界都有2種形态,在工業界落地很重要,你能在工業界為一門學科找到很多應用場景,比如剛剛封神老師講到的雙十一、工業大腦,而學界的探索往往非常有想象力。我們很想請2位來展望一下,資料庫的未來會如何發展?比如5年内、10年内、50年内、100年内?

Our two teachers just shared a lot of basics about databases with us. If I were to give this livestreaming interview a name, I would call it “Database Essentials”. Just kidding. This is actually a live interview about cutting-edge database technologies. In fact, any discipline exists in two different forms, one in academia and the other in industry. It is important to apply technologies in industry. You can find many application scenarios for a discipline in industry, such as Double 11 Shopping Festival and Industrial Brain mentioned by Fengshen just now. In comparison, the exploration in academia is often very imaginative. We would love to ask you two to look ahead into the development of databases in the future. For example, what will databases be like in 5 years, 10 years, 50 years, or even 100 years?

封神:我關注的一些方向,未來5年,資料庫領域核心發展方向是雲原生+分布式,具體講:Serverless、資料庫與大資料一體化、智能化、安全可信、軟硬體一體化、離線上一體化、多模資料處理,這個會對每個資料庫的每個子領域都有影響。具體在學術界研究的,我看的還相對模糊一些。按照人類發展來看,發展應該是越來越快。不過,計算機還是馮諾依曼架構,未什麼時候會颠覆,目前我也沒有概念。10年是什麼樣,我其實壓根不知道。目前唯一的就是保持敬畏之心,保持學習。

I would like to talk about some of the directions I focus on. In the next 5 years, the core development directions of databases would be Cloud Native and distributed databases. Specifically, I’m talking about serverless, integration of databases and big data, intelligence, security and trustworthiness, software and hardware integration, offline and online integration, and multi-mode data processing. These technologies will have an impact on each subfield of each database. As to database research in academia, I only have a vague idea. Seen from the history of human development, the development of databases should be faster and faster. However, computers nowadays are still based on the von Neumann architecture. I have no idea when it will be replaced. And I actually have no idea what kind of development will have happened in 10 years. At present, the only thing I am sure about is to maintain a sense of awe and keep learning.

主持人:學術界就是Mr.Heinis的研究方向了,請 Mr.Heinis 繼續來說

托馬斯·海尼斯:Yeah, well, what's the future? It’s difficult to predict, right?But in terms of, you know, kind of like a five year perspective, the only thing I would add in that I think will make a difference in the in the short term is probably also, like I mentioned, AI, artificial intelligence helping us a little bit to organize the data to accelerate analysis, etc.

未來會如何?我們很難預測。但就未來5年而言,短期内可能出現的進展,就是我剛提到的:人工智能将在一定程度上幫助我們組織資料和加速資料分析。

So I think we're kind of lucky that this is the case. Because a lot of students want to work on AI. And if you can kind of combine as a database technology, we get a lot of talented students involved. But yeah, so I think in the short term, I think AI will also have an impact on databases.

從這點來看,我認為我們很幸運。因為很多學生想研究人工智能。如果能把資料庫技術結合起來,我們就能吸引大量人才參與進來。是以,我認為在短期内,人工智能也會對資料庫産生影響。

I think also that visualization will become important. And we move there to virtual reality, right, kind of which, which offers us a much more, much more kind of, you know,

可視化技術也會變得更加重要,以及與之相關的虛拟現實技術。

we can kind of interact with the data, we can touch the data to some degree, you know,

它能幫助我們與資料互動,在某種程度上,我們将能“觸摸”到資料。

I used to do research and have a feed with gloves with haptic feedback, we can touch to data, this kind of thing will I think will become more important not for an individual analyzing data. But I think for collaborative analysis of data to analyze data together to understand it together, I think that's where we also need to put in and put some research to kind of like help to, to find easier ways for people to understand the impact of things.

我曾做過相關研究,戴上觸覺回報手套,我們可以“觸摸”到資料。我認為這類技術會變得更加重要,不是針對個人分析資料而言,而是針對資料協同分析,即團隊共同分析和了解資料。這也是我們需要投入研究的地方,以便找到更簡單的方法,幫助公衆了解資料及其帶來的影響。

And then, like Fengshen said right, at one, one important thing that's going to happen fairly soon, probably five to 10 years, maybe a little bit more, it's going to be quantum

另外,正如封神所言,不久的将來數字領域将發生重大突破,那便是量子科技。這也許會發生在5到10年之後,也許更久一點。

and quantum. You know, it's difficult to fathom what is gonna, what it's going to do to our two databases. But one thing is for sure, I believe with quantum sensing, quantum sensors will just have so much more data to deal with. And that will challenge database technology, or Big Data technology in itself, right?

我們很難弄清楚量子科技會對資料庫造成什麼影響。但有一點是肯定的,我相信随着量子傳感器的應用和普及,我們将有更多的資料需要處理。而這将對資料庫技術或大資料技術帶來挑戰。

Then when it comes to go a little bit beyond 20-30-50 years, maybe or 50 years a bit later. But yeah, one of my favorite topics DNA storage basic for the store information to store data within synthetic DNA. And this is interesting, because we know essentially has been talking about numbers initially how much data we have

再過20、30、50年,甚至超過50年之後,就得談到我最喜歡探讨的話題之一,DNA存儲,也就是在合成DNA中存儲資料。這個話題很有意思,因為封神剛剛一直在談論我們目前擁有海量資料。

a lot of this data we don't look at every day, right, we store it in the long from the long term, because we need to for the law says we have to keep records around for hundreds of years, right.

很多資料我們并不是每天檢視,隻是長期儲存而已,因為法律規定我們必須儲存數百年的資料記錄。

And we do this with traditional technology with tape, disk, they don't last forever, the last maybe 10,15 years. And then we need to copy the data on to a new disk or a new tape etc. So as always, as data migration, as much as the hassling is also quite expensive. And a lot of companies don't want to afford this anymore, can't afford to do this anymore.

我們使用錄音帶和CD光牒等傳統技術儲存資料。但是錄音帶和CD光牒無法永久儲存,可能頂多儲存10到15年。接着,我們就需要把資料複制到新的CD光牒或錄音帶上。資料遷移耗時耗力耗财,很多公司要麼不想再承擔這樣的成本,要麼承擔不起這樣的成本。

So what we're looking at with DNA storage, for example, is really to store data for 10s of years, maybe hundreds of years, right, such that we can retrieve it.

通過DNA存儲技術,我們希望将資料存儲幾十年,甚至幾百年,以便日後檢索。

So we really can take the data, convert it to two strings of nucleotides and then synthesize this and store it in, in the fridge essentially. And when we need it, we sequence and get it back. So anyway, that's kind of I think that's going to happen.

我們可以把資料轉換成核苷酸串,然後合成并儲存在冰箱裡。需要時,我們再進行測序,并取回資料。這就是我所設想的未來。

So generally, I don't want to focus too much on DNA storage itself, I think like, the underlying technology will change drastically

總體來講,我不想過多地關注DNA存儲本身,我認為其底層技術将會發生翻天覆地的變化。

in the past, we looked a lot of when we looked at storage, the storage medium, we had a lot of collaboration with computing, and electrical engineering. Now I think we're getting to a point where we go from, from computing, collaborating between computing and biology, or chemistry, etc.

過去研究存儲媒體的時候,我們與計算和電氣工程領域有很多合作。而現在,我們開始在計算和生物或化學等等領域之間進行協作。

Doesn't have to be DNA can be another kind of storage medium. But I think that's what's going on.

不一定是DNA,也可以是另一種存儲媒體。但我的設想大概就是這樣。

And what's quite interesting there is also I think, when we look at a little bit beyond 20 years, when it comes to DNA storage, but we can also implement some of some data processing some data analytics on top of the DNA using biological processes,

同樣有意思的是,展望20年之後,在DNA存儲方面,我們還可以通過生物過程在DNA之上實作資料處理和資料分析。

which is extremely energy efficient, and also very, very fast.

這種做法非常節能,而且速度極快。

There are limits to this technology, but we'll find out over the next couple of years. next couple of decades, maybe we'll come in on there

這項技術存在局限性,我們将在未來幾年或幾十年内找到答案。

but I think generally that we also will the whole field of computing will expand into other into other will collaborate more with other fields. And that also has implications for databases for data analytics to

can use biological processes or chemical processes or anything or similar to do computations right. I think that's that's what's gonna that's what's definitely gonna happen. But it's, you know, the difficulty in the future is very difficult to predict.

總體而言,整個計算領域将會擴充到其它領域,與其它領域開展更多的協作。而這也将對資料庫和資料分析産生影響,我們可以利用生物過程、化學過程或其它類似的方法進行計算。我想這絕對是一個趨勢。但未來是很難預測的。

The implication of quantum, for example, like I mentioned, quantum sensing will deliver tons of data. But there's gotta be other implications.

例如量子技術的影響。如我之前所言,量子傳感技術的應用和普及将給我們提供大量資料。但除此之外,肯定還會産生其它影響。

For example, it's one tiny operation in a database, query optimization, which is kind of like you give the database a query, it figures out how to do it efficiently. And that takes a lot of time to compute to figure out how to execute that query efficiently.

例如,資料庫中有這樣一個小操作,即查詢優化,也就是說,你在資料庫裡進行一項查詢,它會找到高效執行的辦法。這一操作需要花費大量的時間。

And we've also already seen in the community that somebody took a query optimization and implemented it on a quantum computer showing that this would be massively faster to optimize the query on the quantum computer. So there's a lot of really I don't think I understand all the implications of quantum but there the quantum computing but that will be definitely also have an impact on databases.

而我們已經在社群裡目睹了這樣一個案例,有人在一台量子計算機上進行查詢優化,結果表明,在量子計算機上優化查詢,速度要快得多。我無法了解量子技術的所有影響,但量子計算肯定會對資料庫産生影響。

So in the short term, adding to essentially I say, is really kind of I think AI is having a tremendous impact in the short term, in the somewhat longer term, I think, we really have to think about interfaces to data virtual reality being one of them, right?Augmented reality being another, but we need to think about how can we make it easy for people to interact with data and understanding typing the query that works for an analyst that's not going to work for everyone right for pretty good for for a broad class of people who need to you know, I think we all need to deal with interpret and analyze data and I think we need to make it easy for everyone. That's a little bit more medium term and in the long term, I think that hardware will change dramatically with quantum with DNA storage with other types of storage medium etc. But 100 years I'm not gonna make a prediction here that's too far out.

短期而言,接着封神剛剛的觀點講,我認為人工智能将在短期内産生巨大影響,而更長遠來講,我們必須思考資料界面,比如虛拟現實和增強現實。我們必須思考如何找到更簡單的方法,幫助公衆與資料互動,了解資料。輸入查詢對資料分析師而言是可行的,但并不适用于所有人。是以,我們需要解釋和分析資料,降低資料的門檻。這是針對中期而言。長遠來看,在量子技術、DNA存儲和其它類型的存儲媒體影響下,硬體将發生巨大的變化。但100年後我就不做預測了,那太遙遠了。

主持人:感謝2位朋友,本期節目的最後,我們也為資料庫團隊和Dr. Heinis打個招聘廣告。

Databases research falls within the expertise of Mr. Heinis. Let’s invite Mr. Heinis to share with us the prospects of database research in academia. I would like to thank our two friends. I’d like to take this opportunity to share a recruitment ad for the database team and Dr. Heinis.

封神:對資料庫技術有熱情的,有技術理想,且技術過硬的同學。具體資料庫TP、NoSQL、AP各個方向都在招聘。目前我在阿裡雲資料庫重點做資料湖分析,歡迎大家聯系我。封神:[email protected]

托馬斯·海尼斯:Currently, the specific sub-fields of databases, including TP, NoSQL and AP are all recruiting talents. We welcome those who are passionate about database technology, who have technical aspirations, and who are technically proficient. Currently I focus on DLA at Alibaba Cloud. Look forward to hearing from you. Fengshen: [email protected]

Absolutely, of course I do. We have Imperial has been very good collaborations with China in general. And we, for some reason, I don't know why. But we have a big share of Chinese students though, and they are very talented. So if any one of those would like to work on kind of and we offer everything, you know, internships, student shapes for PhD and postdocs as well, if anybody wants to work to to change database technology in the future, of course, you can go to Alibaba or you can come You know, seriously, I'm really looking always recruiting interested to the students, we look at all kinds of aspects of databases, like I mentioned. So, one of the some of the topics that I work on with my team are AI, virtual reality and DNA storage, but we also have other aspects. So, if anybody wants to kind of you know, learn learn these technologies work with these technologies and contribute to this research please do get in touch.

當然了。帝國理工學院和中國一直保持着非常良好的合作。出于某種原因,我也不知道是為什麼,我們的學生中有很大一部分是中國學生,他們才華橫溢。假如他們有興趣參與研究,我校提供各種實踐機會,招收實習生、博士和博士後等等。如果你想要改變未來的資料技術,你可以選擇加入阿裡巴巴集團,或者成為帝國理工學院的一份子。我一直在尋找有志于此的學生。正如我一開始提到的,我們關注資料庫的方方面面。我們團隊正在研究的課題包括人工智能、虛拟現實和DNA存儲等技術,但也包括其它方面。如果有人想學習這些技術、使用這些技術并為這類研究做出貢獻,歡迎聯系我們。

原文連結:

https://developer.aliyun.com/article/781472?

版權聲明: 本文内容由阿裡雲實名注冊使用者自發貢獻,版權歸原作者所有,阿裡雲開發者社群不擁有其著作權,亦不承擔相應法律責任。具體規則請檢視《阿裡雲開發者社群使用者服務協定》和《阿裡雲開發者社群知識産權保護指引》。如果您發現本社群中有涉嫌抄襲的内容,填寫侵權投訴表單進行舉報,一經查實,本社群将立刻删除涉嫌侵權内容。