background

Time flies, it's 2024 in a blink of an eye, and I still remember that when I first started working in 15 years, I would be able to handle most interviews with an SSM/H (Spring/Struts2/Mybatis/Hibernate) framework.

New CS majors have probably never heard of SSM.

It just so happens that from the mobile Internet boom when I first started working to the e-commerce-> sharing economy->toB boom-> now I have experienced it all, and the technology stack has also developed from the initial single application + physical machine to the current kubernetes cloud-native architecture.

Of course, there are several major stages in the process: SOA as a service - > microservices - > cloud native - > service mesh - > no service and so on.

My most recent job is mainly about infrastructure, and I think I have a fairly comprehensive understanding, so in this article, I will share which cloud native technology stacks we should use in 2024 from my perspective, because there are more technical components involved, so I won't discuss too many details.

But what I can guarantee is that the technology stack mentioned is all used by me, and the advantages and disadvantages will be mentioned, focusing on a real experience.

operating system

The first is the operating system, which is different from our traditional operating system (Linux/Windows Server/MacOS) in the past, mainly referring to the cloud-native operating system, and there is not much room for choice, that is, kubernetes.

However, how to maintain Kubernetes is a difficult problem, and I still remember that there was an accident in Didi in the second half of last year, and the network rumored that it was a problem with Kubernetes upgrades.

According to our experience, it is recommended for small teams to be directly hosted by cloud vendors, maintaining Kubernetes is a very complex task, and small teams are usually multi-tasking, and it is easier to have problems by maintaining them by themselves.

Of course, it is best for a large team to have a dedicated person to maintain it, and even if there is a problem, it can respond quickly, provided that you can cover this risk.

Because we are a small team, we only use the Kubernetes capabilities of the cloud vendor for cost and stability, and we maintain the rest of the controllable components ourselves (more on this later).

Advantages and benefits of multi-cloud

Since they all use the container services of cloud vendors, it is also necessary to consider the problems that may be caused by the failure of cloud vendors, such as the failure of Alibaba Cloud last year.

Therefore, some medium-sized and large manufacturers will also choose a multi-cloud solution, deploying the same code on multiple cloud service providers, and quickly switching once one of them has a problem.

However, there are also many challenges in the specific implementation process, such as how to ensure the most difficult and critical data consistency?

Of course, we can use some databases or middleware that support distributed deployment, which natively support data synchronization, such as Pulsar in the message queue, which can be deployed across tiers and synchronize messages.

At the same time, the cost of multi-cloud deployment will also increase, and it must be carefully considered in the context of "reducing costs and increasing efficiency", so there is a compromise plan for this:

Our technical architecture needs to have the ability to quickly migrate to other cloud services, for example, we have some internal tools that can regularly back up resources, such as MySQL binlogs, some middleware metadata, and can quickly restore business based on these metadata.

Generally, when you need to switch cloud services, it is an extreme case, so it is acceptable to allow some runtime data loss, as long as we ensure that the core data will not be lost and will not affect the business.

This is easy to say, but it also requires us to take the time to conduct a simulation exercise, and whether or not it will be implemented depends on whether the company accepts the cost of cloud service downtime and the cost of the exercise.

We have the ability to recover metadata, but we will lose some of the data at runtime.

DevOps

Now that we've chosen Kubernetes as our cloud-native operating system, our continuous integration and release has to revolve around Kubernetes.

What tech stacks are needed for cloud-native architectures in 2024

The above picture is a flowchart of using Git with gitlab+ArgoCD, we use gitlab to manage the source code, and we can also use his Pipline to help us do continuous integration, and finally use Argo to help us get through the process of kubernetes.

This is what we often call GitOps.

At the same time, our rollback of historical versions and scaling are all provided by kubernetes, and our DevOps platform only needs to call the API of kubernetes.

Of course, FinOps is now popular, and my understanding is mainly to manage and optimize cloud costs, which corresponds to my job of reclaiming some unused resources and appropriately reducing some configurations without affecting the business.

Service Mesh

Then there's what I think is the most important part of the Service Mesh link, which has a lot of backstories, and I think it's all caused by RPC (Remote Process Call) and distributed in essence.

It starts with the first stand-alone local function call:

local+------>remote +------> micro-service+----->service-mesh
               +                  |                    +
               v                  v                    v
           +---+----+       +-----+------+        +----+----+
           | motan  |       | SpringCloud|        | Istio   |
           | dubbo  |       | Dubbo3.0   |        | Linkerd |
           | gRPC   |       | SOFA       |        |         |
           +--------+       +------------+        +---------+

It has mainly gone through the above three important stages, namely the RPC framework to microservices and now the service mesh.

The RPC framework mainly simplifies distributed communication and focuses on the business itself.
The emergence of microservice frameworks can better help us manage a large number of services, such as some functions such as throttling, routing, and degradation, so that our distributed applications are more robust.
Today's service mesh makes our applications more cloud-native, focusing on business research and development instead of maintaining a microservice framework, sinking all these basic functions to our basic layer, and bringing functionality that is not weaker than that of a microservice framework.

However, there is a high technical threshold for using Istio, and I think it is more recommended to use Istio if the following conditions are met:

The application is connected to the Kubernetes platform
The gRPC communication framework is used between applications
API 网 Istio Gateway
The company has at least one person to maintain Istio (maintenance here is not necessarily an understanding of the code, but it must be sufficient knowledge of the features and documentation of Istio itself)

In addition, it is not uncommon to use microservice frameworks such as SpringCloud, Dubbo, kratos, go-zero, etc.

I've written two articles about Istio before that I can also use as a reference:

Implement gRPC load balancing in a Kubernetes environment

Observability

Nowadays, observable systems are becoming more and more important, and I personally think that the most important metric for evaluating a technical team is how well their observable system is performing.

An excellent observable system can clearly understand the operating status of the system, troubleshoot problems efficiently, and provide timely fault alarms.

To achieve these criteria, we need three core indicators of our observable system:

Metrics, which allows us to draw intuitive dashboards in Grafana to get a more complete picture of the health of our system.

Trace can help us build a complete picture of system calls, through a trace can know which systems a request has gone through and where it has gone wrong.

Logs is easier to understand, which is some logs that we print in the application, but it is slightly different from the previous development model: in the cloud-native system, it is more recommended to output directly to the standard output and standard error stream, and some third-party collection components can collect it more conveniently.

Our own observability system has gone through an iteration, and the previous stack was:

Metrics: Use VictoriaMetrics, a time series database that is fully compatible with Prometheus, but is more resource-efficient than Prometheus.
Trace: SkyWalking is chosen, which is also a popular technical solution in the field of Java trace.
Logs: Use filebeat to collect logs and output them to ElasticSearch, which is also a classic solution.

At the end of last year, we did a major makeover, mainly replacing SkyWalking with OpenTelemetry, which is a more open community and has gradually become the standard for cloud-native observability.

We have more flexibility to use it, we don't have to be tied to some specific technology stack, we haven't switched logs yet, the community is still in beta testing, and we can use OpenTelemetry to collect logs directly when it matures.

I also wrote an article on SW migration to OpenTelemetry, interested friends can refer to:

Hands-on: How to gracefully switch from SkyWalking to OpenTelemetry

Message queues

The message queue is brought out here because I am currently mainly maintaining the company's internal message queue, and at the same time, the message queue becomes very important after the business volume is large, usually acting as a bridge for the docking of various business lines, or a channel for database synchronization with MySQL, in short, it is very useful.

Due to its architecture of storage and computing separation, it can achieve rapid scaling with the characteristics of kubernetes, which is easier to maintain than Kafka, and the community is also very active, and it is more active in bug fixing and supporting new features.

Pulsar's officially supported clients are also comprehensive:

Language	Documentation	Release note	Code repo
Java	User doc API doc	Standalone	Bundled
C++	User doc API doc	Standalone	Standalone
Python	User doc API doc	Standalone	Standalone
Go client	User doc API doc	Standalone	Standalone
Js	User doc API doc	Standalone	Standalone
C#/DotPulsar	User doc	Standalone	Standalone

Another question is: how to deploy our Pulsar cluster, whether to deploy it privately or to purchase cloud services? (Pulsar's commercial company streamnative and Tencent Cloud in China have similar services)

We have consulted about the price before, and it is relatively cost-effective to deploy it ourselves, and as mentioned above, we only use the cloud vendor's Kubernetes service and deploy our own service on top of that.

Thanks to the active Pulsar community, you can get timely feedback even if you have problems with your own maintenance, and at the same time, you can also give back to the community the pits you usually step on.

I've written a series of articles about Pulsar before, so if you're interested, check them out:

How to scale Pulsar gracefully in a kubernetes environment
Vernacular Pulsar Bookkeeper's storage model

Business framework

Finally, the choice of business framework, the premise of this is that we first determine which language to choose as the main business language.

While this doesn't matter to kubernetes, let's take a look at Java and Golang, which I'm more familiar with.

Java

There are a lot of options for Java, and if we're just on kubernetes but not using a service mesh, we can just use springboot to develop an http interface as easily as developing a monolithic application.

However, this will lack some service governance capabilities, and is more suitable for small and medium-sized teams.

If you have a large team and are not using a service mesh, it is recommended to use the microservice frameworks described above: such as Dubbo and SpringCloud.

When there is a dedicated cloud-native team, it is more recommended to use a service mesh solution, so that we can combine the advantages of the above two options:

The code is concise, just need to replace http with gRPC.
Leveraging Istio also includes the ability to use a microservices framework.

Golang

Golang is actually similar to Java, and we can develop with only http frameworks like Gin when we are in small teams.

Medium- and large-sized teams also have frameworks that benchmark Dubbo and SpringCloud in the Golang ecosystem, such as kratos and go-zero.

Thanks to the simplicity of Golang, I find it simpler and more "brainless" than using Java to develop a business.

The team should be more mature at this time, and they can do their own development scaffolding based on gRPC, or they can use Kratos or go-zero to remove their service call module.

summary

The above is my personal understanding of the current popular technical solutions, and also recommended for different team sizes; there is indeed no perfect technical solution, only the most suitable, and don't follow the trend to choose some technology stacks that you can't control, and you may end up losing.

>>>>

Resources

hTTPS://levelup.gitconected.com/gitops-in-kubernets-with-gitlab-ci-and-argocked-9e20b5D3b55b
Hatpas://garpak.ayo/

作者丨crossoverJie

来源丨公众号:crossoverJie(ID:crossoverJie)

The DBAPLUS community welcomes contributions from technical personnel at [email protected]

Event Recommendations

The 2024 XCOPS Intelligent O&M Manager Annual Conference will be held on May 24 in Guangzhou, where we will study how emerging technologies such as large models and AI agents can be implemented in the O&M field, enabling enterprises to improve the level of intelligent O&M and build comprehensive O&M autonomy.

Conference details: 2024 XCOPS Intelligent O&M Manager Annual Meeting - Guangzhou Station

What tech stacks are needed for cloud-native architectures in 2024

Event Recommendations