Optimizing slow Django REST Framework performance

The Django REST Framework allows Django developers to build simple yet robust standards-based REST APIs for their applications. We've used it successfully on a number of Django web design projects. However, even seemingly simple, straightforward usage of the Django REST Framework and its nested serializers can kill performance of your API endpoints. And that matters: if your web server is wasting its time inefficiently responding to a REST API call, it will drag the rest of the server's responsiveness down with it.

At it's root, the problem is called the "N+1 selects problem"; the database is queried once for data in a table (say,

Customers

), and then, one or more times per customer inside a loop to get, say,

customer.country.Name

. Using the Django ORM, this mistake is easy to make. Using DRF, it is hard not to make.

Luckily, there is a solution that can be used to fix this common Django REST Framework performance problem, without any major restructuring of the code. It requires use of the underutilized

select_related

and

prefetch_related

methods on the Django ORM (and the newer

Prefetch

object as well) to perform what is called "eager loading".

This approach can have a big effect. On the most recent project we applied this too, important API calls were taking 5-10 seconds to return results. After applying appropriate eager loading, the same calls were well below 1s. Speedups of 20x or more are typical.

Why does Django REST Framework cause this issue so readily?

When you build a DRF view, you often want the return to include data from more than one related table. Writing this is straightforward and covered in the DRF docs in depth. Unfortunately, as soon as you use a nested relationship in your serializer, you risk crushing your performance, and like so many performance problems, it often only shows itself in production with larger, real world data sets.

This happens because the Django ORM is lazy; it only fetches the minimum amount of data needed to respond to the current query. It does not know you're about to ask a hundred (or ten thousand) times for the same or very similar data.

And these days, when talking about database-backed websites, generally, the most important metric when determining site responsiveness is number of trips to the database.

In DRF, we run into trouble whenever a serializer has a nested relationship, such as either of these:

class CustomerSerializer(serializers.ModelSerializer):
    # This can kill performance!
    order_descriptions = serializers.StringRelatedField(many=True) 
    # So can this, same exact problem...
    orders = OrderSerializer(many=True, read_only=True) # This can kill performance!

The code inside DRF that populates either

CustomerSerializer

does this:

Fetch all customers . (Requires a round-trip to the database.)
For the first returned customer, fetch their orders . (Requires another round-trip to the database.)
For the second returned customer, fetch its orders . (Requires another round-trip to the database.)
For the third returned customer, fetch its orders . (Requires another round-trip to the database.)
For the fourth returned customer, fetch its orders . (Requires another round-trip to the database.)
For the fifth returned customer, fetch its orders . (Requires another round-trip to the database.)
For the sixth returned customer, fetch its orders . (Requires another round-trip to the database.)
... you get the idea. Lets hope you don't have too many customers!

And it quickly can get worse. If your

OrderSerializer

itself has a nested relationship, you have a loop-inside-a-loop, and you're quickly in trouble, even for smallish amount of data. As a rule of thumb, these days, on a modest traffic website, you can probably afford 50 trips to the database before you start getting into real trouble.

The basic approach to solving Django's "laziness"

Our approach to fixing this problem is called "eager loading". Essentially, you warn the Django ORM ahead of time that you're going to ask it the same inane question over and over, "so get ready". In the above example, simply do this before DRF starts fetching:

Then, when DRF makes the same call as above to serialize customers, this happens instead:

Fetch all customers . (Makes TWO round-trips to the database. The first is for customers. The second fetches all orders related to any of the fetched customers.)
For the first returned customer, fetch their orders . (Does NOT require a trip to the database, we already fetch the needed data in step 1.)
For the second returned customer, fetch its orders . (Does NOT require a trip to the database.)
For the third returned customer, fetch its orders . (Does NOT require a trip to the database.)
For the fourth returned customer, fetch its orders . (Does NOT require a trip to the database.)
For the fifth returned customer, fetch its orders . (Does NOT require a trip to the database.)
For the sixth returned customer, fetch its orders . (Does NOT require a trip to the database.)
... you get the idea. You can have LOTS of customers and not have to keep waiting on trips to the database.

In short, the Django ORM "eagerly" asked for the data in step 1, then could supply the data requested in steps 2+ from it's local data cache. Fetching data from the local data cache is essentially instantaneous when compared with the database round-trip, so we just got an enormous performance speedup in conditions when there are many customers.

Standardizing a pattern to fix the Django REST Framework performance problem

We have settled on a common pattern to optimize this Django REST Framework performance problem. Whenever a serializer will query nested fields, we add a new

@staticmethod

called

setup_eager_loading

to the serializer, like so:

class CustomerSerializer(serializers.ModelSerializer):
    orders = OrderSerializer(many=True, read_only=True)
    def setup_eager_loading(cls, queryset):
        """ Perform necessary eager loading of data. """
        queryset = queryset.prefetch_related('orders')
        return queryset

And then, wherever that serializer is going to be used, simply call

setup_eager_loading

on the queryset before the serializer is invoked, like so:

customer_qs = Customers.objects.all()
customer_qs = CustomerSerializer.setup_eager_loading(customer_qs)  # Set up eager loading to avoid N+1 selects
post_data = CustomerSerializer(customer_qs, many=True).data

...or, if you have an

APIView

or a

ViewSet

, you can call

setup_eager_loading

in the

get_queryset

method:

def get_queryset(self):
    queryset = Customers.objects.all()
    # Set up eager loading to avoid N+1 selects
    queryset = self.get_serializer_class().setup_eager_loading(queryset)  
    return queryset

How do I write `setup_eager_loading` ?

The hard part of solving this Django performance problem is becoming adept with how

select_related

and its friends work. Here, we'll detail how each is used in the context of the Django ORM and the Django REST Framework.

select_related : The simplest eager loading tool in the Django ORM, for all one-to-one or many-to-one relationships, where you need data from the "one" parent object, such as a customer's company name. This translates into a SQL join so the parent rows are fetched in the same query as the child rows. (See Official Documentation)
prefetch_related : For more complex relationships where there are multiple rows per result (ie many=True), like many-to-many or one-to-many relationships, such as a customer's orders as above. This translates to a second SQL query on the related table, usually with a long WHERE ... IN clause to select only relevant rows. (See Official Documentation)
Prefetch : Used for complex prefetch_related queries, such as filtered subsets. It can also be used to nest setup_eager_loading calls. (See Official Documentation)

An example model with the appropriate eager loading

For our example, let's optimize the Django REST Framework-related performance problems of an imaginary event-planning website (which surprisingly parallels our ongoing project getfetcher.com). We have a simple database structure:

from django.contrib.auth.models import User
class Event:
    """ A single occasion that has many `attendees` from a number of organizations."""
    creator = models.ForeignKey(User)
    name = models.TextField()
    event_date = models.DateTimeField()
class Attendee:
    """ A party-goer who (usually) represents an `organization`, who may attend many `events`."""
    events = models.ManyToManyField(Event, related_name='attendees')
    organization = models.ForeignKey(Organization, null=True)
class Organization:
    name = models.TextField()

For this example, to fetch all events, our eager loading code would look like this:

class EventSerializer(serializers.ModelSerializer):
    creator = serializers.StringRelatedField()
    attendees = AttendeeSerializer(many=True)
    unaffiliated_attendees = AttendeeSerializer(many=True)
    @staticmethod
    def setup_eager_loading(queryset):
        """ Perform necessary eager loading of data. """
        # select_related for "to-one" relationships
        queryset = queryset.select_related('creator')
        # prefetch_related for "to-many" relationships
        queryset = queryset.prefetch_related(
            'attendees',
            'attendees__organization')
        # Prefetch for subsets of relationships
        queryset = queryset.prefetch_related(
            Prefetch('unaffiliated_attendees', 
                queryset=Attendee.objects.filter(organization__isnull=True))
            )
        return queryset

When we make sure to invoke

setup_eager_loading

before using the EventSerializer, we will only have two large queries instead of N+1 smaller queries, and our performance will usually be MUCH better!

Conclusion

Eager loading is a common performance optimization that has application well beyond the Django REST Framework.

Any time you are querying nested relationships via an ORM, you should think about setting up the proper eager loading. In my experience, it is the most commonplace performance-related problem in modern small- and midsize web development.

In a followup blog post, I'll write some debugging strategies for figuring out elusive queries spawned by more complex Serializers and some more advanced usages of

Prefetch

References

Django REST Framework documentation
Github issue to automatically perform this prefetch_related: Automatically determine select_related and prefetch_related on ModelSerializer.
Tom Christie, the author of DRF, in a blog post about DRF performance, touches on the issue we've treated above in Get your ORM lookups right.

Thank you for reading!

Optimizing slow Django REST Framework performance

Why does Django REST Framework cause this issue so readily?

The basic approach to solving Django's "laziness"

Standardizing a pattern to fix the Django REST Framework performance problem

How do I write `setup_eager_loading` ?

An example model with the appropriate eager loading

Conclusion

References

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入

Optimizing slow Django REST Framework performance

Why does Django REST Framework cause this issue so readily?

The basic approach to solving Django's "laziness"

Standardizing a pattern to fix the Django REST Framework performance problem

How do I write setup_eager_loading ?

An example model with the appropriate eager loading

Conclusion

References

继续阅读

How do I write `setup_eager_loading` ?