laitimes

Higress × OpenKruiseGame 游戏网关最佳实践

author:Alibaba Cloud Cloud Native

Authors: Zhao Weiji, Li Ming, Cheng Tan

OpenKruiseGame (hereinafter referred to as OKG) is a multi-cloud open-source game server Kubernetes workload, which is a sub-project of OpenKruise, an open source project for CNCF workloads, in the game field. As a typical traffic-intensive scenario, gaming places high demands on ingress gateways in terms of throughput, latency, performance, elasticity, and security.

Higress is a next-generation cloud-native gateway built with open-source Istio and Envoy as the core, based on more than two years of Envoy gateway practice in Alibaba. Higress integrates the security protection gateway, traffic gateway, and microservice gateway into one layer, which can significantly reduce the deployment and O&M costs of the gateway. Higress can be used as an ingress gateway for K8s clusters, and is compatible with a large number of K8s Nginx Ingress annotations, allowing you to quickly and smoothly migrate from K8s Nginx Ingress to Higress. It also supports the K8s Gateway API standard, allowing users to smoothly migrate from Ingress API to Gateway API.

In this article, we'll demonstrate how Higress seamlessly integrates with OKG game servers and brings great features to them.

Higress 无缝接入 OKG

Pre-Steps:

1. 安装 OpenCrosseGame[1]。

2. Install Higress[2].

OKG provides many excellent features such as hot updates and expansion of game servers, which is convenient for game operation and maintenance personnel to manage the entire life cycle of game servers. Unlike stateless services, the network traffic of the player battle is not allowed to be load balanced, so each game server needs to have a separate access address.

When using native workloads (such as Deployment or StatefulSet), O&M engineers need to configure the access layer network for many game servers one by one, which undoubtedly hinders the efficiency of server provisioning, and manual configuration also increases the probability of failure. The GameServerSet workload provided by OKG can automatically manage the access network of the game server, greatly reducing the burden on operation and maintenance engineers.

For TCP/UDP online games, OKG provides network models such as HostPort, SLB, and NATGW, and for H5/WebSocket online games, OKG also provides ingress network models, such as Higress, Nginx, ALB, etc.

Higress × OpenKruiseGame 游戏网关最佳实践

In this article, an open-source game, Posio, is used to build a demo game server. In the following configuration, IngressClassName="higress" specifies Higress as the network layer of the game server, and Higress can seamlessly connect to the Posio game server through the following configuration, and can implement high-level traffic management and other functions defined by Higress based on annotations. The example YAML is shown below, and the access domain name corresponding to the game server generated by the GameServerSet is related to the game server ID.

In this example, the access domain of game server 0 is game0.postio.example.com, and the access domain of game server 1 is game1.postio.example.com. This is used by the client to access different game servers.

piVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
  name: postio
  namespace: default
spec:
  replicas: 1
  updateStrategy:
    rollingUpdate:
      podUpdatePolicy: InPlaceIfPossible
  network:
    networkType: Kubernetes-Ingress
    networkConf:
    - name: IngressClassName
      value: "higress"
    - name: Port
      value: "5000"
    - name: Path
      value: /
    - name: PathType
      value: Prefix
    - name: Host
      value: game<id>.postio.example.com
  gameServerTemplate:
    spec:
      containers:
        - image: registry.cn-beijing.aliyuncs.com/chrisliu95/posio:8-24
          name: postio           

OKG horizontal scaling[3] provides functions such as automatic scaling, scale-in based on the OpsState of the game server, scale-in based on DeletionPriority, and scale-in based on the serial number of the game server to support the business requirements of game O&M. While the horizontal scaling feature brings convenience to game developers, it also puts forward higher requirements for the ingress gateway: the ingress gateway must have the ability to configure hot updates and complete the smooth delivery of route configurations.

The reason is that when expanding the game server, OKG will simultaneously create ingress and other related network-related resources to ensure the automatic launch of the game server. If the ingress gateway does not have the ability to configure hot updates, online players will encounter problems such as disconnection during expansion, which will affect the gaming experience.

Nginx reload cannot be warm-updated gracefully

When the game server is expanded, or the defined routing policy is changed, the configuration change of Nginx will trigger reload, causing both upstream and downstream connections to be disconnected and triggering reconnection.

Let's take the Posio game server as an example to simulate the problems that occur when Nginx+OKG expands the game server. The Posio server relies on socket connections to communicate with clients. When the game server is expanded, the corresponding ingress resource creation is triggered, and the Nginx-ingress-controller listens to the change of the ingress resource and triggers its own reload mechanism, and the original connection with the game server (for example, the socket connection in this example will be disconnected). On the side of the player who is playing the game, there is an abnormal stuttering.

In order to visually demonstrate the impact of Nginx Ingress reload, we make some changes to the default configuration parameters of Nginx:

kubectl edit configmap nginx-configuration -n kube-system
data:
  ...
  worker-shutdown-timeout: 30s # 一个很难做权衡的配置           

The worker-shutdown-timeout parameter in the Nginx configuration parameter is the timeout configuration for the graceful disconnection of the Nginx worker process, the worker process will first stop receiving new connections, and wait for the old connection to gradually close, and then force close all the current connections and complete the process exit after the timeout period is reached.

If this parameter is too small, a large number of active connections will be disconnected instantly, and if this parameter is too large, the long-lasting websocket connection will always maintain the Nginx process, and when frequent reloading occurs, a large number of worker processes in the shutting down state will be generated, and the memory occupied by the old worker will not be released for a long time, which may cause OOM to cause online faults.

Higress × OpenKruiseGame 游戏网关最佳实践

The test process of actual play is as follows: the client accesses the game server and plays normally. During this process, the OKG capability triggers the expansion of the game server, and the response of the client is viewed. As you can see from the web developer tools, there are two socket connections, one is established by the original browser to access the game server, and the other is the socket connection caused by the reconnection after Nginx.

Higress × OpenKruiseGame 游戏网关最佳实践

The last packet timestamp received by the original connection was 15:10:26.

Higress × OpenKruiseGame 游戏网关最佳实践

The time to create a new connection to get the first normal game package is 15:10:37, and the disconnection between the web page and the game server lasts about 5s.

Higress × OpenKruiseGame 游戏网关最佳实践

In addition to the impact on the player's gaming experience, this mechanism will also lay a mine for the overall stability of the business. In the high-concurrency scenario, the CPU of Nginx will spike instantaneously due to the instantaneous disconnection of a large number of clients, and the back-end game server needs to process more business logic, which is generally higher than the resource requirements of the gateway.

How Higress implements elegant hot updates

Higress supports the use of K8s Ingress to expose the external IP port of the game server for players to connect and access. When the game server is scaled or the defined route configuration changes, Higress supports hot updates of the route configuration to ensure the stability of the player's connection.

Higress × OpenKruiseGame 游戏网关最佳实践

Higress is based on Envoy's precise configuration change management, which achieves true dynamic configuration hot updates. In Envoy, the downstream corresponds to the listener configuration and is handed over to LDS for configuration discovery, and the upstream corresponds to the cluster configuration and is handed over to CDS for configuration discovery. The listener configuration update and reconstruction will only cause the downstream connection to be disconnected and will not affect the upstream connection, while the downstream and upstream configurations can be changed independently without affecting each other. Furthermore, the certificate, filter, and router under the listener can all be changed independently, so that the downstream connection will no longer be disconnected due to the certificate/plug-in/route configuration change.

Precise configuration change mechanism, in addition to allowing Envoy to achieve real hot updates, but also to make Envoy's architecture more reliable, Envoy configuration management is designed from the beginning for the separation of the data plane (DP) and control plane (CP), so the use of gRPC to achieve remote configuration dynamic pulling, and with the help of proto to standardize configuration fields, and keep versions compatible. This design implements the security domain isolation of the data plane and the control plane, and enhances the security of the architecture.

After using OKG to connect to Higress, the following still simulates the client to access the game server and play normally. During this process, the OKG capability triggers the expansion of the game server, and the response of the client is viewed. As you can see from the web developer tools, the connection between the client and the game server is stable and unaffected during this process.

Higress × OpenKruiseGame 游戏网关最佳实践

In addition, in the large-scale game server scenario, each game server corresponds to an independent ingress, which will generate a large number of ingress resources, and we tested that when the scale reaches 1k, it takes minutes for Nginx Ingress to take effect for a new game server, while Higress can take effect in seconds. This problem of Nginx Ingress was also stepped on by Sealos, and finally solved by switching to Higress, if you are interested, you can read this article to understand: "Which Cloud Native Gateway is Stronger: The History of Sealos Gateway"

Join OKG and Higress open source communities

You are welcome to enter the DingTalk group by searching for the group number, the DingTalk group number of the cloud native game exchange group is: , and the DingTalk group number of the Higress Community Exchange Group 2 is: .

Related Links:

[1] OpenCrosseGame

[2] Higress

[3] OKG horizontal scaling

Read on