Author | Wang Xiaorui

策划 | Tina

Xiaorui Wang, Co-Founder & CEO of AutoMQ

Cloud computing achieves better cost per resource through resource pooling, enabling enterprises to outsource IDC construction, basic software R&D, and O&M to cloud vendors, so as to focus more on business innovation. The resource pool includes not only servers, but also talent. Cloud vendors have gathered excellent engineers to provide professional services for many enterprises through cloud services, so that professional things can be handed over to the most professional people.

Cloud computing has been developing for so many years, elasticity is one of the technical capabilities that cloud computing practitioners pay the most attention to, but when it comes to specific cases, few customers can use elasticity well, and elasticity has become a slogan, an ideal architecture.

Cloud vendors retain customers through discounts on subscription subscriptions, which is contrary to elastic scenarios

The following table summarizes the rules of the game by comparing the typical subscription EC2 price with the pay-as-you-go price:

Subscription offers about 50% cost savings compared to pay-as-you-go

That's why most enterprises choose to use EC2 resources on a subscription basis. From the perspective of cloud vendors, this design is very reasonable, because cloud vendors determine how much free water level to reserve in a region by predicting the usage of customers across the network, assuming that the price of On-Demand and Reserved instances is the same, it will be difficult for cloud vendors to predict the water level of a region, and there will even be huge differences between day and night, which will directly affect the procurement decision of the supply chain. Cloud vendors are a typical retail-like business model, where the number of idle machines in each region is analogous to inventory, and the higher the inventory ratio, the lower the profit margin.

Spot Instances happen to be cheap and pay by the hour

For stateless applications, Spot Instances will notify the application before recycling, and most cloud vendors will give a minute-level recycling window, so that as long as the application is gracefully offline, it can have no impact on the business. Overseas startups [1] that specialize in managing computing resources based on spot instances have a large number of productization features to help users make good use of spot instances. AutoMQ also has extensive experience with Spot Instances [2]. However, for stateful applications, the threshold for using Spot Instances has become very high, and the state needs to be transferred before the instance can be forcibly reclaimed. Applications such as Kafka, Redis, MySQL and so on. For this type of data-based underlying software, it is generally not recommended that users deploy directly to Spot Instances.

The rules of this game are both reasonable and worth optimizing, and the author believes that at least the following aspects can be done better:

The Spot recycling mechanism provides an SLA

To encourage more users to use Spot Instances, Spot's recycling mechanism needs to provide a defined SLA so that critical businesses can use Spot Instances at scale.

The Create a new instance API provides SLAs

After Spot is reclaimed, the application plan is to continue to open new resources (such as new Spot Instances or new On-Demand Instances), and the API for opening new instances must also have a definite SLA, which will directly affect the availability of the application.

SLAs are provided for detaching disks

Detach EBS also needs to have a defined SLA, as it allows users to automate application state offloads in the event of a forced reclamation of Spot Instances.

Why is it so difficult to leverage the elasticity of the public cloud?

AWS US EAST m6g.large

It's hard for programmers to do a good job of recycling

C/C++ programmers put a lot of effort into struggling with memory, but still can't guarantee that memory resources won't leak. The reason is that accurate resource recycling is a very challenging thing, for example, if a function returns a pointer, then there is no agreement in C/C++ about who is responsible for recycling this object, and if it involves multithreading, it is even more nightmarish. To this end, C++ invented smart pointers to manage objects with a thread-safe reference count. Java solves the problem of object recycling through the built-in GC mechanism and detects object recycling at runtime, but it also brings a certain amount of runtime overhead. The recently popular Rust language, which is essentially a C++-like intelligent pointer recycling method, innovatively implements the memory recycling check mechanism in the compilation stage, thereby greatly improving the efficiency of memory recycling and avoiding the memory problems often made by C/C++ programmers, and the author believes that Rust will be a perfect alternative to C/C++.

Returning to the field of cloud operating systems, programmers can create an ECS, a Kafka instance, and an S3 object through an API, and behind this API is the change of billing. It's easy to create, but it's hard to recycle. For example, to create a Kafka instance, 20 machines are required first, because it is difficult to scale up or down in the future, so it is better to do it all at once.

While cloud computing provides elasticity, it is difficult for programmers to effectively manage resources on demand, resulting in difficulties in resource recycling. This has prompted enterprises to set up a cumbersome approval process when creating resources on the cloud, similar to the traditional IDC resource management method. The end result is that programmers use resources in the cloud in a way that converges with IDC, requiring resource management through a CMDB and relying on manual approval processes to avoid wasted resources.

We've also seen some good examples of resiliency practices. For example, when a large enterprise uses EC2, the lifetime of each EC2 instance ID is not more than 1 month, and once it is exceeded, it will be listed as "grandparents' EC2" and will be on the team's blacklist. It's a great immutable infrastructure practice that prevents engineers from having to keep state on the server, such as configurations, data, etc., making it feasible for applications to move toward resilient architectures.

At present, the stage of cloud computing is still in the C/C++ stage, and there is no excellent resource recovery solution, so enterprises are still using a lot of process approval mechanisms, which essentially leads to enterprises not being able to give full play to the biggest advantage of the cloud: elasticity. This is one of the main reasons for the high cloud spending of enterprises.

I believe that as long as there is a problem, there will be a better solution, and a Java/Rust-like solution to cloud resource recovery will definitely come out in the near future.

From the underlying software to the application layer, it is not ready for resiliency

In 2018, the author began to design resilience projects for thousands of applications on Taobao and Tmall [3], when Taobao and Tmall's applications had achieved a mix of offline and online applications to improve deployment density, but online applications were still in reserved mode and could not achieve on-demand elasticity. For example, the application will call the SDK of various middleware (database, cache, MQ, business cache, etc.), and the application itself takes a long time to start.

In order to take Java applications from cold starts in minutes to milliseconds, the Snapshot capability[3] was developed for Docker at the time, which was four years ahead of AWS for production applications (AWS announced the Lambda SnapStart[4][5] feature at Re:invent 2022). Starting an application through Snapshot can add a compute node that can work immediately in hundreds of milliseconds, which allows the application to increase or decrease computing resources according to traffic like Lambda without having to transform it into a Lambda function, which is the pay-as-you-go capability we see provided by Lambda.

The elasticity of the application layer is already so complex, and it is even more challenging to do elasticity in basic software, such as databases, caches, MQ, big data and other products. The requirements of distributed, high availability, and high reliability determine that these products need to store multiple copies of data. Once the amount of data is large, elasticity becomes very difficult, and migrating data will affect the availability of the business. To solve this problem in the cloud, we need to use a cloud-native approach, and we designed AutoMQ (a cloud-native solution that enables Kafka) with resiliency as the highest priority, and the core challenge was to offload storage to a cloud service, such as pay-as-you-go S3, rather than building your own storage system. The following figure is a graph of traffic and node changes on the AutoMQ line, you can see that AutoMQ is a fully automatic increase or decrease machine according to the traffic, if these machines use Spot Instances, it will save a lot of costs for enterprises, and truly achieve pay-as-you-go.

AWS US EAST m6g.large

How can enterprises use elasticity capabilities to reduce costs and increase efficiency

In 2018, Google launched Cloud Run[6], a fully managed computing platform that allows applications based on HTTP communication to provide only listening ports and container images to Cloud Run, and all infrastructure management is fully automated by Cloud Run. The biggest advantage of this method compared with AWS Lambda is that it does not need to be tied to a single cloud vendor, and it can be better migrated to other computing platforms in the future. Soon AWS and Azure followed suit with similar offerings, Azure Container Apps[7] and AWS App Runner[8].

Flexibility is a very challenging task, and it is recommended that applications on the cloud can rely on these no-code binding hosting frameworks as much as possible, such as Cloud Run, so that the computing resources consumed by applications can be paid according to the request.

The evolution trend of such applications is that each category is evolving to an elastic architecture, such as Amazon Aurora Serverless and Mongodb Serverless[9], and there is a consensus from cloud vendors to third-party open source software vendors to be able to achieve a thorough elastic architecture.

When choosing similar open-source basic software, enterprises should choose products with elastic capabilities as much as possible, and the criterion for judging is whether it can run on spot instances and whether it is cost-effective. At the same time, it is also necessary to pay attention to whether such products can better run on multiple clouds, which determines whether enterprises have portability when they move towards multi-cloud architecture or even hybrid cloud architecture in the future.

Original link: Why is it difficult to give full play to the elasticity of the public cloud?_Cloud Native_Wang Xiaorui_InfoQ Selected Articles

Why is it so difficult to leverage the elasticity of the public cloud?