laitimes

The "rebels" of the database industry: Big data is "dead", and MotherDuck is on the move

author:CnosDB
The "rebels" of the database industry: Big data is "dead", and MotherDuck is on the move
"Big data" is dead – the most important thing for us today is not to worry about the size of the data, but to focus on how we will use it to make better decisions.

Since the development of the database industry, there have been many accelerations and changes at the data level, especially the explosive growth of cloud data warehouses in the past few years, which has brought many changes to the industry. There is no doubt that cloud data warehouses have become the cornerstone of the enterprise data stack, and companies and organizations of all sizes are accustomed to using data warehouses to analyze business data. The rapid rise of Snowflake is a prime example of this trend.

But if we break down the variables of big data into three dimensions: speed, quantity and diversity, we find that the dimension that everyone pays the most attention to is still speed. As we revisit the definition of "big data" and combine the elements of data assets, our most important need is the low latency consumption of microservices on data assets processed from OLTP [1] databases.

At the same time, after many big data departments have purchased all the new tools and migrated from legacy systems, they still find that they still can't make sense of the data, and maybe the size of the data is not the problem at all. The world's data volumes are getting bigger, but so is the hardware, and vendors are still pushing the capabilities of the hardware to expand. Today we will talk about a database startup with a bit of a "different" idea, MotherDuck, and see how their product DuckDB understands the world.

History: The commercialization of European and American cooperation

Speaking of MotherDuck's past and present life, we must first start with the product DuckDB. DuckDB is a purpose-built, in-process online analytical processing database management system designed for efficient data analysis. From the release of the first open source version of DuckDB in 2019 to 2021, weekly downloads of DuckDB have grown rapidly in just two years. At this time, the project, originally founded by the Dutch Society for the Study of Mathematics and Computer Science (CWI), was spun off and operated independently, and project researchers Hannes Mühleisen and Mark Raasveldt founded DuckDB Labs.

At this point in the story, why hasn't MotherDuck appeared yet? Don't worry, we're missing another protagonist, Jordan Tigani, founding engineer of Google's Big Query, who also has an eye on DuckDB and has been looking to bring a lightweight database product to the market. After talking to Mühleisen, co-founder of DuckDB Labs, and receiving support, Tigani began experimenting with commercializing the open source DuckDB. The new company, MotherDuck, was born and received $12.5 million in angel round funding led by Red Dot Capital (USA) and $35 million in Series A funding led by A16Z, valuing the company at $175 million.

Looking back, as a startup that has not started for a long time, it is not unsuccessful to obtain such capital recognition. Since DuckDB is not MotherDuck's original open source product, it is important to have the support of the project's founding team if you want to build services based on open source products for a long and stable period in the future.

The DuckDB team is involved in part with MotherDuck, which in turn is a member of the DuckDB Foundation, a nonprofit organization that owns most of DuckDB's intellectual property. DuckDB's own business unit, DuckDB Labs, is a shareholder of MotherDuck. It has to be said that Tigani and DuckDB Labs cooperation is a smart move, through this move, the interests of both parties are bound.

Positioning: SQLite in the OLAP space

To talk about DuckDB, let's start with SQLite, arguably the world's most used relational database system, found on almost every phone, browser, and operating system, and even runs on airplanes.

Because SQLite is embedded, it does not require external server management. At the same time, it binds almost every language, and it is based on these characteristics that make it easier to use, and we must acknowledge the greatness of SQLite. But at the same time, its problems are also prominent. SQLite is designed for OLTP, uses rowstore, cannot utilize memory to speed up calculations, and has a very limited query optimizer, so it is very unfriendly for analysis.

It was based on this that DuckDB saw an opportunity. Simply put, it is SQLite for analytics (OLAP domain [2]), as an in-process database that enables developers, data scientists, data engineers, and data analysts to power its code with extremely fast analytics using pure SQL. In addition, it has the ability to analyze data where it may exist, such as on a laptop or in the cloud.

DuckDB uses a columnar vectorized query engine that still interprets the query but processes a large number of vectors in a single operation, reducing the overhead of processing each row sequentially in traditional systems such as PostgreSQL, MySQL, or SQLite, improving query performance.

SQLite is a small, relational database that can be used for in-process deployment.

The "rebels" of the database industry: Big data is "dead", and MotherDuck is on the move

The quadrant in which DuckDB is located

Cognition: "Non-consensus" in the database industry

Unlike most companies in the industry, MotherDuck has a different industry belief.

First, Tigani believes that most customers and organizations have modest data storage and not large. At the same time, the size of the customer data follows a power-law distribution. The largest customer has twice as much storage as the second-largest customer, the third-largest customer has half the storage of the second-largest customer, and so on. So, while there are customers with hundreds of petabytes of data, the size drops quickly.

Secondly, there is a storage bias in the separation of storage and computing, and the data size increases faster than the calculation. If the business is static, neither growing nor shrinking, the data grows linearly over time, but the computational needs don't change much because most of the analysis is done on recent data. This storage bias makes it possible for us not to need distributed processing at all. And a lot of users want quick and easy answers to their questions — they don't want to wait for the cloud.

Finally, most data is rarely queried. A significant portion of the processed data is less than 24 hours old. By the time the data is saved for a week, the likelihood of a query may be 20 times lower than in the most recent day. Historical data tends to be rarely queried, which means that the working set size of the data is easier to manage than we expected. If you have a PB table with 10 years of data, that data might end up compressed to less than 50 GB. As a result, many cloud vendors focus on 100TB of query performance, which may not only be irrelevant to most users, but also distract from their ability to provide a great user experience.

Therefore, MotherDuck makes its point that big data is real, but most people probably don't need to worry. "Big data" is dead – the most important thing for us today is not to worry about the size of the data, but to focus on how we will use it to make better decisions. We also often ask ourselves, do organizations really generate a lot of data? If so, do you really need to use a lot of data at once? Is the data really too big to fit on one machine if needed? Perhaps different organizations will give different answers.

The "rebels" of the database industry: Big data is "dead", and MotherDuck is on the move

Big data is dead

The future: there is no "silver bullet", there is no one-size-fits-all option

The times we are living in are changing rapidly, resulting in many database management systems. As we can see, there is currently no one-size-fits-all database management system in the world. Everyone makes different trade-offs to better suit specific use cases, and the same is true for DuckDB. Sometimes we need to focus on serving multiple concurrent users, and sometimes we need an embedded database that is very fast for single-user workloads.

Will DuckDB succeed? The answer may not be certain. But we do see a vibrant open source community emerging, and while there hasn't been any commercialization yet, we should be patient with this Series A company, because the story is just beginning.

The "rebels" of the database industry: Big data is "dead", and MotherDuck is on the move

Changes in the number of stars DuckDB has on Github

Exegesis:

[1] OLTP: On-Line Transaction Processing Online transaction processing, also known as transaction-oriented processing.

[2] OLAP: Online Analytical Processing Online Analytical Processing. Online analytical processing OLAP is a software technology that enables analysts to see information from all sides quickly, consistently, and interactively for a deeper understanding of data.

About the author

Zheng Bo, Aka Harbour Habo. Cui Niuhui is not a famous butter, a middle-aged 2B infrastructure entrepreneur and the initiator of the CnosDB cloud-native time series database open source community.

Introduction to CnosDB

CnosDB is a high-performance, easy-to-use open source distributed time series database, which is now officially released and fully open source.

Follow our community site: https://www.cnosdb.com