laitimes

Some of the benefits of a simple software architecture

Author | Dan Luu

Translated by | Sambodhi

Planning | Tina

This article was originally published on the Wave website, with the permission of the original author Dan Luu, and translated and shared by InfoQ Chinese.

Wave is a $1.7 billion company with 70 engineers whose product is a CRUD application that adds and subtracts numbers. To be consistent with this, our architecture is a standard CRUD application architecture, based on Postgres' Python monolithic architecture. Starting with a simple architecture and then solving problems in the simplest way possible has allowed us to expand our business to this scale, while engineers are mostly focused on providing value to users.

Stackoverflow expanded the size of the monomer to good results (2013 architecture / 2016 architecture) and was eventually acquired for $1.8 billion. If we're focused on traffic rather than market capitalization, Stackoverflow is one of the top 100 sites with the highest traffic on the Internet (for examples of many other valuable companies built on monomers, refer to this Twitter topic response (https://twitter.com/danluu/status/1462607028585525249). We don't have a lot of web traffic because we're a mobile app, but Alexa ranks our site in the top 75,000, even though our site is basically just a way for people to find apps, and most people don't get them from our site).

Some applications require that it is impossible to build a simple monolithic application in a boring database, but for most applications, even at the traffic level of the first 100 websites, the computer is fast enough to use a simple architecture to provide services, and it is often cheaper and easier to create a simple architecture than a complex architecture.

Despite the irrational validity of simple structures, most news stories revolve around complex structures. For example, at the latest General Technology Conference, there were six presentations on how to build or deal with the negative effects of complex microservices-based structures, but none discussed how to build simple monomers. There was even a talk about quantum computing. The same is true for larger conferences; a recent enterprise-oriented conference in San Francisco saw double-digit talks on dealing with complex architectures; but not a single one on how to build simple monomers. The last time I went to the conference, I was impressed by the fact that many companies have employees whose applications are small and could have been done with a simple architecture, but who are using the latest and most sophisticated technologies that are popular in meeting circles and on the web.

Our architecture is so simple that I'm too lazy to do an architecture diagram. I'll discuss what we've done that makes everything tedious.

We're currently using tedious, synchronous Python, which means that our server processes are blocked while waiting for I/O, such as network requests. We've tried Eventlet before, which is an asynchronous framework that theoretically allows us to get more efficient from Python, but we've run into a lot of bugs, and we feel that the CPU and latency costs of waiting for events are not worth the operational pain of dealing with Eventlent. Other well-known Python frameworks have a similar situation, but users who use them at scale tend to report the serious consequences of using them on a large scale. Using synchronous Python is expensive because we have to pay for the CPU, and during network requests, the CPU does nothing but wait, but now, we can only handle a few billion requests per month, so even with a slow language like Python, the cost of paying for the retail of the public cloud is low. The cost of our engineering team completely determines the cost of the system we operate.

We assign long-running tasks (we don't want to block responses) to a queue, rather than taking on the complexity of making our monomer asynchronous.

The place where we can't be as boring as we think is our internal data center. When we only operated in Senegal and Côte d'Ivoire, we operated entirely in the cloud, however, as our reach expanded to Uganda (and more countries in the future), we had to split the backend and deploy it to local on-premises data centers to comply with local data storage laws and regulations. It's not a simple operation, but as those who have done the same job in complex service-oriented architectures know, it's much simpler than using a complex service-oriented architecture.

Another aspect is the software we have to develop, not buy. When we first started, we had a strong preference for buying software rather than developing it, because a team of a small number of engineers could not afford the time cost of developing software. While the option of "buying" usually gives you some ineffective tools (https://danluu.com/nothing-works/), this is the right choice at that time. If we can't convince the vendor to fix the Showstopper bug, which is critical to us, it does make sense to build more of our own tools and retain in-house expertise in more areas in this case (https://danluu.com/in-house/), but it runs counter to the standard recommendation that companies should only "build" their core competencies. Much of this complexity is something we don't want to take on, but for some categories of products, even after quite extensive research, we still can't find a supplier who can provide the right product for us. To be fair, the problems our suppliers need to solve are much more complex than the ones we need to solve, because our suppliers bear the complexity of solving the problems for each customer, and we only need to solve the problem for one customer, and that is ourselves.

A Showtopper error is a hardware or software error that causes execution to stop and become essentially useless. This critical error must be fixed to take the development process further.

In the first few months of our operations, we made the mistake of not carefully defining the boundaries of database transactions, which is already at a cost today. In Wave's codebase, a SQLAlchemy database session is a request global variable; whenever the properties of a DB object are accessed, it implicitly starts a new database transaction, and any function in the Wave codebase can call commit on the session to commit all pending, updates. This makes it difficult to control when database updates occur, increasing the probability of subtle data integrity errors, and making it difficult to rely on databases to build job churn similar to idempotency keys or transactional staging. Doing so also increases the risk that we will accidentally hold open long-running database transactions, which can make schema migration operations difficult.

Some of the options we're not sure about (because we're thinking about changes, or suggesting other teams starting from scratch to consider another way) are: using RabbitMQ (for our purposes, Redis may be equally applicable to task queues, and only Redis is needed to reduce the operational burden); using Celery (which is too complex for our use case and has failed several times, such as backward compatibility issues during version upgrades); SQLAlchemy (which makes it difficult for developers to understand what kind of database queries their code will produce, leading to a variety of difficult-to-debug situations, while also introducing unnecessary operational pain, especially in relation to the view of database transaction boundaries mentioned above); and the use of Python (which was the original correct choice due to the technical background of our founding CTO, but its concurrency support, Performance and extensive dynamism make us question whether it is the right choice for a large-scale back-end codebase). None of the above are major bugs, and for some (e.g. Python) there are already very few bugs, so we will spend less on maintenance than investing in a theoretically better migration, but if we were to write a similar codebase from scratch now, then we would seriously consider whether they were the right choice.

In some ways, we're happy to make that choice, although it doesn't sound like the simplest and most feasible solution, like our API, where we use GraphQL; our transport protocol, which we used for a while; and our hosting, where we use Kubernetes. For our transport protocol, we used a custom UDP-based protocol with SMS and USSD fallback capabilities, which is the performance reason mentioned in this lecture. After the release of HTTP/3, we have been able to replace our custom protocol with HTTP/3, and usually we only need USSD to solve an incident like the recent internet shutdown in Mali).

When it comes to using GraphQL, we believe it has more advantages than disadvantages:

merit:

Self-documentation of exact return types;

Code generation of an exact return type makes the client more secure;

GraphiQL Interactive Explorer is a productivity win;

Most of our various applications (user apps, support apps, Wave agent apps, etc.) can share a single API, reducing complexity.

Composable query languages allow clients to get exactly the data they need in a single packet round trip, without having to set up a large number of special-purpose endpoints;

Avoids the pointless debate about what counts as a RESTful API.

shortcoming:

When we adopted GraphQL, the GraphQL library wasn't very good (the basic Python libraries were ported from JavaScript libraries and therefore not Python-made, Graphene required a lot of templates, and the optimization code generated by Apollo-Android was very bad).

The default GQL encoding is redundant, and since many clients have low bandwidth, we are very concerned about limiting the size.

For Kubernetes, we chose Kubernetes because we know that if the business is successful (and it is), and we continue to expand, we will eventually expand to countries that require us to operate services in that country. The specifics vary from country to country, but we have expanded our presence in major markets in Africa, which requires us to operate our "primary data center" in the country, as well as other regulations, such as the requirement that we be able to fail over to the data center in that country.

Telecom integration is an aspect of our unavoidable complexity. Theoretically, we would have used SaaS SMS providers to do all this, but the major SaaS SMS providers in Africa don't have operations across Https://youtu.be/6tb8ALAvodM?t=196, so the cost of using these services there is prohibitive. If we were to leverage SaaS SMS providers to address all of our SMS needs, the previous arguments about how engineers' compensation costs dominate the cost of our systems are untenable; teams that provide telecom integration services pay several times the price for it.

By making the application architecture as simple as possible, we can spend our complexity (and human) budgets for areas that are good for our business. If there is not enough reason to increase complexity, then we can create a business of no small size based on the idea of doing things as simply as possible, with a small number of engineers, although the African financial business we run is usually seen as a difficult business to get involved in, which we will talk about in a future article (one of our early and most helpful consultants, who gave us advice, was very critical to the success of Wave, and at first proposed that Wave was a bad business idea, And the founder should choose the other because he foresees many potential difficulties).

https://www.wave.com/en/blog/simple-architecture/index.html

Read on