Postgres is encroaching on the database world

Reading guide: This article will talk to you about the future of Postgres and its development history, product features.

Developers and database administrators, PostgreSQL is more than just a relational database; it's a data management framework that threatens to engulf the entire database landscape.

The "Anything Goes, Postgres" trend is no longer confined to a handful of elite R&D teams, but is becoming a mainstream best practice.

History of Postgres

The origins of Postgres began with the Ingres project. It first started in the form of Ingres (Interactive Graphics RE trieval System) under the direction of Michael Stonebraker at the University of California, Berkeley.

The team changed and optimized the database curriculum and funded a relational database project.

The original language used was QUEL, not yet standard SQL. ANSI formalized its preference for SQL in 1986, and many relational database projects moved in this direction. SQL support was added when Postgres95 was released in 1995.

In 1996, when the first version of PostgreSQL, 6.0, was released, the development team was formed, and the PostgreSQL Global Development Team was formed, which is the group currently leading the project.

Nowadays, when it comes to using reliable databases, many things that people take for granted are actually developing little by little.

In 2000, Postgres began its journey of growth. From that year onwards, Postgres began using foreign key and connection support, which was on par with other databases.

Around 2002 there were more critical foundational parts, primarily focused on becoming a reliable database and SQL, including write-ahead logs (WALs), external joins, schemas, and the ability to potentially support 400 million transactions in the real world.

In addition, Postgres has been working on IPv6 for a long time.

Pillars in the Postgres community include Tom Lane, Josh Berkus, Bruce Momjian, Hiroshi Inoue, and Peter Eisentraut.

One of them said: "Everyone involved in Postgres is very happy with its prospects. They see a broader market opportunity here. And Oracle is already one, so it's not easy to imagine. But the open-source database wave is starting to reach its peak. ”

By the late 2000s, and especially before 2005, Postgres could already be considered a fairly reliable database. With richer transaction support, extensive SQL support, and improvements such as WAL and VACUUM. If you're an early adopter of it, you'll start trusting it to handle production workloads.

While it's trustworthy, there's still some way to go when it comes to ease of use. At this point, we start to see a combination of features with several different themes, including the following:

Concurrent index creation
Hot standby server
Query language improvements
All data types - array, UUID, ENUM, XML
Phase 2 Submission
A richer system of user roles

From the early days of the system, we've seen a group of people get involved, but other people are starting to jump in and contribute features.

In 2009, Postgres 8.4 provided window functions and Common Table Expressions (CTEs), which I personally would never have to go back to another database.

With Postgres as a solid foundation, it was around this time that we started to see it start to make its mark on the broader data ecosystem.

Because of the solid codebase, as well as its broad license, many companies have adopted Postgres and started forking it.

In the early to late 2000s, the first thing most companies did was add MPP support to Postgres so that it could target more OLAP-centric workloads. When people combine that with support for things like window functions and CTEs, they have something new, powerful without having to build it from scratch.

This type of product can reduce the time it takes for a database to mature. Many of the original forks no longer exist today, but some still exist in other products......

Daisies Data → 被 Teradata 收购
Truvis → 被思科收购
Netezza → 被 IBM 收购
Greenplum → 被 EMC 收购
ParAccel → was never acquired, but actually became RedShift

Despite the postgres fork that saw some fragmentation, Postgres still continues to do what it has been doing – keep going.

The release of Postgres 9.0 and 9.1, in 2010, was the beginning of how cool Postgres began. Started support for features such as listening/notifying (publish-subscribe in the database) and hstore (key/value data type) to break the old relational database pattern. With pg_upgrade the advent of GIN and GiST indexes, people are starting to get more than just standard B-tree indexes.

Extensions have always been a part of Postgres, but with some refactoring in the integration, they've become more accessible to users. We've seen the development of external data wrappers for Postgres, so it's possible to connect to different Postgres databases.

It's not just the old boring data types, columns, and relationships. However, all of them are built on the same ACID-compliant, trustworthy foundation.

If the world hasn't noticed Postgres by now, you've been sounded the alarm.

As the big data wave began to cool down, but with the rise of NoSQL databases (Mongo and Couchbase), it became clear that developers wanted a different way of working with their data. Postgres also listened to the opinion, but was later deceived because of its JSON support. JSON validation was provided in 9.2 but was thrown into a text field.

In fact, we'll have to wait another two years to get reliable JSON support in Postgres.

But these are all necessary paths to growth, not a big problem, with the rise of Heroku for easy application deployment – Heroku, Heroku Postgres as the default database, and the wave from shared hosting for MySQL databases and VPS for applications to PaaS and more dedicated database infrastructure is growing.

Later Postgres 9.3 releases were great, we had horizontal joins, updatable external tables, checksums, and so on. In 9.4, we got better JSON support in the JSONB data type. This is a binary representation of JSON on disk, which means that GIN indexing can make it easier to index data without having to index on very specific JSON functions.

JSONB is a data type and function that is astounding when people first hear about it.

Postgres isn't just for app developers who need shiny features. Logical decoding lays the foundation for easier capture of change data (CDC) in Postgres for years to come. Refreshing materialized views enables richer reporting use cases. Back-office workers enable more functionality and creative use cases, especially for extensions.

Amazon then announced support for PostgreSQL on RDS at Re:Invent. People remember attending the only time the audience gave a standing ovation to the Re:Invents event room.

In 2016, the group released 9.5, 9.6, 10. At this point, we started to get to the point where the title was not fully featured, and instead, we saw steady performance improvements and themes that continued to enhance existing features. JSONB has gained support for inline updates and is starting to see more support for parallel execution emerge. But it's not all minor updates, depending on people's needs, there may be major features here that you've been craving. Some of the highlights include:

Row-level security
Logical replication
Table partitioning

For several years now, the progress of PostgreSQL has largely remained the result of individual contributors. We've also seen companies working together in specific areas such as EnterpriseDB (focused on Oracle compatibility), 2ndQuadrant (focused on replication), Postgres Pro (focused on JSONB), and Crunchy Data (focused on security and cloud native).

A new challenger to OLAP

At a database administrator meetup in 2016, I thought a major gap in the PostgreSQL ecosystem was the lack of a columnar storage engine that was good enough for OLAP workloads.

While PostgreSQL itself offers a wealth of analytics capabilities, it doesn't fully perform the full analysis of larger datasets to the level of a dedicated real-time data warehouse.

We considered ClickBench (https://benchmark.clickhouse.com/), an analytics performance benchmark where we documented the performance of PostgreSQL, its ecosystem extensions, and derived databases.

Among them: Untuned PostgreSQL has poor performance (x1050) but can be achieved with optimization (x47). In addition, there are three extensions related to data analysis: columnar storage Hydra (x42), timescale DB (x103), and distributed Citus (x262).

Clickbench c6a.4xlarge, 500gb gp2 results comparison relative time

The above performance is not bad, especially compared with pure OLTP databases such as MySQL and MariaDB (x3065, x19700), however, its Layer 3 performance is not "good" enough, lagging behind Layer 1 OLAP components such as Umbra, ClickHouse, Databend, SelectDB (x3~x4) by an order of magnitude.

This is where it gets tricky – it's not satisfying to use, but it's too good to discard.

However, the arrival of ParadeDB and DuckDB has changed the rules of the game.

ParadeDB's native PG extension pg_analytics achieve Layer 2 performance (x10), narrowing the gap to only 3-4 times from the top layer. This level of performance difference is generally acceptable given the added advantages.

There's no need for ETL's ACID, freshness, and real-time data, no additional learning curve, no need to maintain a separate service, not to mention its ElasticSearch-grade full-text search capabilities.

With a focus on pure OLAP, DuckDB pushes analytics performance to the extreme (x3.2) – excluding Umbra, a closed-source database focused on academics, DuckDB is arguably the fastest performing of practical OLAPs. It's not a PG extension, but PostgreSQL can take advantage of DuckDB's analytics performance gains as an embedded file database through projects such as DuckDB FDW (address: https://github.com/alitrack/duckdb_fdw) and pg_quack (address: https://github.com/hydradatabase/pg_quack).

The advent of ParadeDB and DuckDB has pushed PostgreSQL's analytics capabilities to the top of OLAP, filling the last critical gap in its analytics performance.

Pendulum in the database space

The difference between OLTP and OLAP existed or did not exist in the early days of databases. Due to the difficulty of traditional OLTP databases to support the query patterns and performance requirements of analytics scenarios, the separation of OLAP data warehouse and database began to appear in the 90s of the 20th century.

For a long time, best practices for data processing involved using MySQL/PostgreSQL for OLTP workloads and synchronizing data to specialized OLAP systems such as Greenplum, ClickHouse, Doris, Snowflake, etc., through an ETL process.

Dia C3:OLTP分析王国

As with many "specialized databases", the advantage of a dedicated OLAP system is often performance—1-3 orders of magnitude better than native PostgreSQL or MySQL.

The elements compared here include dimensions from cost, redundant data, excessive data movement, lack of data consistency between distributed components, additional labor expenses for specialized skills, additional licensing costs, limited query language capabilities, programmability and scalability, limited tool integration, poor data integrity and usability compared to a full DMBS.

As hardware has improved over three decades following Moore's Law, performance has grown exponentially while cost has plummeted. By 2024, a single x86 server can have hundreds of cores (512 vCPUs, EPYC 9754 x2), terabytes of RAM, a single NVMe SSD can hold up to 64TB / 3M 4K rand IOPS / 14GB/s, and a single all-flash bay can reach petabytes, and object storage like the S3 offers virtually unlimited storage.

I/O and bandwidth double every 3 years

Advances in hardware have solved data volume and performance issues, while database software developments (PostgreSQL, ParadeDB, DuckDB) have solved the challenges of access methods. This brings the underlying assumptions of the analytics industry, the so-called "big data" industry, under scrutiny.

As DuckDB's manifesto "Big Data is Dead" suggests, the era of big data is over. Most people don't have that much data, and most of the data is rarely queried. With the development of hardware and software, the craze for big data is fading, and "big data" is no longer needed in 99% of scenarios.

If 99% of the use cases can now be handled on a single machine with a standalone PostgreSQL/DuckDB (and its replicas), what's the point of using a dedicated analytics component?

What's the point of pagers if every smartphone can send and receive texts freely?(It's important to note that pagers are still being used in hospitals across North America, suggesting that there may be less than 1% of scenarios where "big data" is really needed.) ）

A shift in fundamental assumptions is leading the database world from a pluralistic phase back to a convergence phase, from the Big Bang to mass extinction.

In this process, a new era of unified, multi-model, hyper-converged databases is about to emerge, and OLTP and OLAP are reunified. But who will lead this daunting task of reconsolidating the database landscape?

PostgreSQL: The Devourer of the Database World

There are many technical niches in the database space: time series, geospatial, document, search, graph, vector databases, message queues, and object databases. PostgreSQL is present in all of these areas.

A classic example is the PostGIS extension, which sets the de facto standard for geospatial databases, the TimescaleDB extension awkwardly locates a "generic" time series database, and the vector extension PGVector turns the dedicated vector database niche into a joke.

This is not the first time; we are witnessing this again in the oldest and largest subfield: OLAP analysis. But PostgreSQL's ambitions don't stop at OLAP; it's now looking at the entire database world!

PostgreSQL ecosystem

What makes PostgreSQL so powerful? Sure, it's advanced, but so is Oracle, but it's open source, and so is MySQL. PostgreSQL's advantage comes from its advanced and open source nature, which allows it to compete with Oracle/MySQL.

But what makes it truly unique is its extreme scalability and thriving ecosystem of extensions.

Users choose PostgreSQL for reasons that are open-source, reliable, and extensible

The magic of extreme scalability

PostgreSQL is more than just a relational database; it's a data management framework capable of engulfing entire database systems. In addition to being open source and advanced, its core competitiveness stems from scalability, that is, the reusability of infrastructure and the composability of extensions.

PostgreSQL allows users to develop extensions that leverage the common infrastructure of the database to provide functionality at the lowest cost. For example, the vector database extension pgvector has only a few thousand lines of code, which is negligible in complexity compared to the millions of lines of code in PostgreSQL.

pgvector address:

hatps://github.com/pgveector/pgveector

However, this "insignificant" extension enables full vector data types and indexing capabilities, and outperforms many specialized vector databases.

Because the creators of pgvector don't need to worry about the additional complexities of databases: ACID, recovery, backup and PITR, high availability, access control, monitoring, deployment, third-party ecosystem tools, client drivers, etc., all of which require millions of lines of code. They focus only on the essential complexity of the problem.

For example, ElasticSearch was developed on top of the Lucene search library, and the Rust ecosystem has an improved, next-generation full-text search library, Tantivy, as an alternative to Lucene.

Tantivy地址:

https://kithub.com/QuickQuit-As/DanTV

ParadeDB only needs to wrap and connect to PostgreSQL's interface to provide a search service comparable to ElasticSearch.

More importantly, it can stand on the shoulders of PostgreSQL and leverage the combined power of the entire PG ecosystem (e.g. hybrid search with pgvector) to compete "fairly" with another dedicated database.

Take a look at the diagram below:

Pigsty & PGDG has 234 extensions available, with more than 1000 more in the ecosystem

Scalability brings another huge advantage: the composability of extensions, allowing different extensions to work together, creating synergies that are greater than 1+1 2.

For example, TimescaleDB can be combined with PostGIS for spatiotemporal data support, and the BM25 extension for full-text search can be combined with the PGVector extension to provide hybrid search capabilities.

In addition, Distributed Scaling Citus transparently transforms independent clusters into horizontally partitioned distributed database clusters.

Citus Address:

hattapus://vv.kitusadata.com/

This feature can be combined with other product functions, such as making PostGIS a distributed geospatial database, PGVector a distributed vector database, ParadeDB a distributed full-text search database, and so on.

What's even more powerful is that extensions are developed independently and don't require cumbersome main branch merging and coordination. This technology makes PG more scalable, allowing teams to explore the database in parallel, all of which are optional without compromising the reliability of core functionality. Those mature, robust features have the opportunity to be stably integrated into the main branch.

PostgreSQL achieves foundational reliability and agile capabilities through the "magic" of extreme scalability, making it an outlier in the database world and a game-changer in the database landscape.

DB Arena is a game-changer

The advent of PostgreSQL has changed the paradigm in the database space: teams working to build a "new database kernel" now face a daunting challenge: how to stand out from the open-source, feature-rich Postgres.

So, what is their unique value proposition?

Until there is a revolutionary hardware breakthrough, the emergence of a practical, new type of general-purpose database kernel seems unlikely. No single database has yet matched PG's overall strength, and with all its extensions, not even Oracle can match PG's open source and free trump card.

If a niche database product is an order of magnitude better than PostgreSQL in a particular area (which is usually performance), then it may carve out a niche for itself. However, it usually doesn't take long for the PostgreSQL ecosystem to produce open-source extension alternatives. Choosing to develop a PG expansion instead of an entirely new database can give the team a crushing speed advantage when playing catch-up!

Following this logic, the PostgreSQL ecosystem will snowball, accumulate advantages, and inevitably move towards a monopoly, reflecting the Linux kernel's place in server operating systems within a few years. This trajectory is also confirmed in developer surveys and database trend reports.

StackOverflow 2023 年调查:PostgreSQL,十项全能选手

StackOverflow's database trends over the past 7 years

PostgreSQL has long been the most popular database in HackerNews and StackOverflow. Many new open-source projects default to PostgreSQL as the primary, if not the only, database choice.

One of the most fundamental differences between PostgreSQL and most other relational databases is its core design.

Most relational databases are best described as relational database management systems (RDBMS). An RDBMS is software specifically designed to work with relational databases, where data is stored in a tabular structure with predefined columns and data types. Data can be queried, modified, and retrieved using relational algebra-based techniques, typically through Structured Query Language (SQL).

PostgreSQL, on the other hand, is technically an object relational database management system (ORDBMS). This means that it has the same relational capabilities as an RDBMS, but also has some object-oriented features.

In practice, this means that PostgreSQL allows you to:

Define your own complex data types
Overload functions to handle different parameter data types
Define the inheritance relationships between tables

These features are powerful tools to help you work with databases and data using some of the same technologies that you may be familiar with while programming. The added flexibility allows you to model different types and relationships within the database system rather than outside the program. This can help maintain consistency and enforce expected behavior that is closer to actual data.

PostgreSQL is an open source project managed by the PostgreSQL global development team. It is licensed under the PostgreSQL license, which is recognized by the Open Source Initiative.

While there are many other open-source relational databases, PostgreSQL is developed and managed without a business owner or business peer. This helps contributors chart their own path and work on the features that the community cares about most. Professional services for PostgreSQL are provided by companies that often contribute to projects but do not control the development process.

This focus on community-driven development has led to a lot of engagement from PostgreSQL users. A large number of high-quality extensions and applications are available to enhance the functionality of the core PostgreSQL software. Community-developed software can help you manage PostgreSQL servers, compile business intelligence reports, manage new types of data, and use PostgreSQL across a variety of programming languages and platforms.

And now, many next-generation companies are going all-in on PostgreSQL.

As "Radical Simplicity: Just Use Postgres" says, "Just Use Postgres" simplifies the tech stack, reduces components, accelerates development, reduces risk, and adds more functionality.

Postgres can replace many back-end technologies, including MySQL, Kafka, RabbitMQ, ElasticSearch, Mongo, and Redis, making it easy to serve millions of users. Just Use Postgres is no longer confined to a handful of elite teams, but is becoming a mainstream best technical practice.

Ladies and gentlemen, your choices will shape the future of the database world. I hope this article has helped you make better use of the world's most advanced open-source database kernel: PostgreSQL.

https://medium.com/@fengruohang/postgres-is-eating-the-database-world-157c204dcfc4

Postgres is encroaching on the database world