Stop using offset and limit pagination, the performance is too bad!

Gone are the days when you didn't need to worry about database performance tuning.

As times progressed, as ambitious businesses wanted to become the next Facebook, with the idea of collecting as much data as possible for machine learning predictions arose.

As developers, we need to constantly polish our APIs so that they provide reliable and efficient endpoints to effortlessly navigate through massive amounts of data.

If you've done backend development or database schema, you might be pagination like this:

Stop using offset and limit pagination, the performance is too bad!

If you really are that pagination, then I have to apologize for saying that you were wrong to do so.

You don't think so? Slack, Shopify, and Mixmax are all using the way we're going to talk about pagination today.

I guess you'd be hard-pressed to find someone who doesn't use OFFSET and LIMIT for database pagination. For simple small applications and scenarios where the amount of data is not very large, this method can still "cope".

If you want to build a reliable and efficient system from the ground up, get it right from the start.

Today we're going to explore the problems with the already widely used pagination methods and how to achieve high-performance pagination.

What's wrong with OFFSET and LIMIT?

As mentioned in the previous paragraphs, OFFSET and LIMIT are no problem for projects with small amounts of data.

However, problems arise when the amount of data in the database exceeds the capacity of the server's memory, and all the data needs to be paged.

In order to achieve pagination, the database needs to perform an inefficient full-table scan every time it receives a paging request.

A full table scan (also known as a sequential scan) is a progressive scan in a database that reads each row of records in a table sequentially, and then checks whether each column meets the query criteria.

This type of scan is known to be the slowest because of the large amount of disk I/O required and the transfer overhead from disk to memory.

THIS MEANS THAT IF YOU HAVE 100 MILLION USERS AND THE OFFSET IS 50 MILLION, IT NEEDS TO TAKE ALL THOSE RECORDS (INCLUDING THAT MUCH DATA THAT YOU DON'T NEED AT ALL), PUT THEM INTO MEMORY, AND THEN GET THE 20 RESULTS SPECIFIED BY THE LIMIT.

That is, in order to get a page of data:

The 50,000th line out of 100,000 to the 50,000th line and the 20th line

You need to get 50,000 rows first. If you're not convinced, take a look at this example:

https://www.db-fiddle.com/f/3JSpBxVgcqL3W2AzfRNCyq/1?ref=hackernoon.com

The Schema SQL on the left will insert 100,000 rows of data, and on the right there is a poorly performing query and a good solution.

Just click on Run at the top to compare their execution times. The first query takes at least 30 times longer to run than the second query.

The more data, the worse it gets. Take a look at my PoC on 100,000 rows of data:

https://github.com/IvoPereira/Efficient-Pagination-SQL-PoC?ref=hackernoon.com

NOW YOU SHOULD KNOW WHAT'S GOING ON BEHIND THE SCENES: THE HIGHER THE OFFSET, THE LONGER THE QUERY WILL TAKE.

Alternatives

Here's what you should:

This is a type of pointer-based pagination.

If you want to save the last received primary key (usually an ID) and LIMIT locally, instead of OFFSET and LIMIT, then every query may look something like this.

Why? Because by explicitly telling the database the latest row, the database knows exactly where to start the search (based on a valid index) without having to consider records outside the target range.

Compare this query:

and optimized versions:

Returning the same result, the first query took 12.80 seconds, while the second took only 0.01 seconds.

To use this cursor-based pagination, you need to have a unique sequence field (or fields), such as a unique integer ID or timestamp, but this condition may not be met in some specific cases.

My advice is to consider the pros and cons of each solution anyway, and what kind of query needs to be executed.

If you need to do queries based on large amounts of data, Rick James' article provides more in-depth guidance:

http://miscl.rjweb.org/doc.php/lists

If our table doesn't have a primary key, such as a table with a many-to-many relationship, then we use the traditional OFFSET/LIMIT method, but there is a potential slow query problem.

I'd recommend using an auto-incrementing primary key in tables that need to be pagination, even if it's just for pagination.