Who says databases shouldn't be placed in containers?

About the Author:

Wang Xingyi is a senior development engineer at Bilibili

Tian Wei is a senior development engineer at Bilibili

Zhong Shanhao is a senior development engineer at Bilibili

Jian Chen is a senior development engineer at Bilibili

Preface

In the cloud-native era, Kubernetes has become the de facto standard for container technology, making it possible to automate O&M management and orchestration in infrastructure applications. For stateless services, the industry has already implemented several mature and perfect solutions. For example, the dependencies between multiple instances of a distributed application (master-slave/master/standby), and the dependency of a database application on the data stored in the local disk (if the instance is destroyed, the loss of the relationship between the instance and the data on the local disk will also cause the instance to fail to be rebuilt).

For a number of reasons, stateful applications were once a taboo topic in container technology circles, and until now, the topic of whether stateful services are suitable for being placed in containers and managed by K8s orchestration (e.g., a database in a production environment) is still debated. Based on the implementation of Elasticsearch/Clickhouse's containerization/K8s orchestration capabilities in the production environment of Bilibili, this article will explain why we need to containerize/on k8s, the challenges and solutions encountered in containerization, the technical details and benefits of the implementation.

First, the status quo

The core of the promotion of projects in enterprise-level and production environments needs to focus on cost, quality, and efficiency, and we first analyze the current situation of the current use of es/ck at station B.

Taking elasticsearch as an example, ES has a wide range of application scenarios on Bilibili, including various 2B/2C online business and on-site search, internal audit, enterprise efficiency, risk control and other platforms. We have built several sets of public ES clusters for this purpose, and divided each cluster into functions, such as 2c public cluster, 2b public cluster, internal system cluster, etc.

At first, the system was relatively stable and the O&M work was relatively simple, so we only needed to pay attention to the resource utilization of public clusters and scale them accordingly.

However, with the continuous access of services, the defects are gradually exposed, and the write/query QPS of different services fluctuates, and when the resource watermark of the cluster reaches a peak at a certain time (such as instantaneous CPU), the query requests of all services in the entire cluster may increase in time consumption or even time out. In addition, all services in the public cluster mode share the query cache and the page cache of the operating system in the cluster, and the cache elimination policy (usually LRU) cannot be set based on the service priority, resulting in the unstable time-consuming query time of some services in the cluster and reducing the user experience.

To this end, we further split the cluster, directly respond to the cluster of C-end users, and use bare metal servers to build an independent cluster to isolate it from other services to solve the problem of query timeout.

Since ES is a distributed storage, in order to prevent split-brain and maintain high availability, we use at least 3 bare metal servers to build a cluster, and most of the business resources occupy very little, and the overall resource utilization rate of some independent clusters is even less than 5%. This is also unacceptable in the Internet era with the theme of reducing costs and increasing efficiency.

In the same way, for clickhouse (referred to as 'ck'), it is currently mainly used in interactive query and analysis, reports, logs, call chains and other observable scenarios on Bilibili, with 30+ clusters and 500+ physical machine nodes. The preceding problems of multi-tenancy and low resource utilization of ES clusters are also applicable to CK.

Whether it is feasible to use O&M means to solve the problem, set the corresponding cgroup and namespace limits for the nodes in each cluster, and carry as many clusters as possible on a limited bare metal server, effectively improving resource utilization. This seems to be feasible in theory, but as the number of clusters increases, the complexity of O&M increases exponentially, and the creation, destruction, and scaling of clusters with different configurations will become a nightmare for daily O&M work, and we quickly rejected this solution. It turned out that our decision was also correct, taking ES as an example, there are 100+ clusters in the test environment and online environment, and if we adopt this model, we can't accept the bloated and huge operation and maintenance costs. For CK, the O&M complexity of components is even higher than that of ES, which is obviously undesirable.

Open source ES and CK do not consider the problem of multi-tenancy well, is it feasible to carry out secondary development based on the source code? After evaluation, we also rejected this path, supporting multi-tenant isolation at the application layer is indeed a solution, if you want to achieve a good isolation effect, the technical complexity and development cost are quite high, and the characteristics of resource isolation are not suitable for spending huge energy on transformation in the application layer, like a cannon to fight mosquitoes, the benefit is not proportional to the cost.

Things seem to be in a predicament, and think carefully about what kind of capabilities we need to improve the situation

Good resource isolation capability solves the problem of mutual influence between different businesses

The orchestration capability of cluster resources improves resource utilization as much as possible on the premise of ensuring high availability of the cluster

Good automated O&M capabilities, low O&M or even O&M-free

Low development cost, less reinvention of the wheel

Against a BMS cluster, the performance must not deteriorate significantly

After sorting out the core requirements, after research, there are two technical paths left

The self-developed O&M platform integrates the capabilities of automatic scheduling and orchestration of cluster topology and resources
Run ES/CK in a container and hand over k8s to resource scheduling and orchestration

solution	O&M costs	Isolation	Development costs	Resource utilization
Public clusters	middle	weak	low	middle
Standalone clusters	middle	strong	low	low
O&M platform	middle	strong	high	high
Component two open	middle	middle	high	middle
is k8s	low	strong	middle	high

Comparison of various solutions

The two paths are essentially the same, both of which use Linux Namespace, Linux Cgroups, and rootfs technologies to build an isolated environment and orchestrate and schedule isolated resources. However, for the self-developed O&M platform, if the automation capability of the platform design is low, the daily invisible O&M cost will be greatly increased, such as the strong automation capability of the demand platform, including intelligent cluster orchestration and resource allocation, which corresponds to a large development cost, which is not much different from the ability to realize a set of K8S orchestration and scheduling, which is a reinvention of the wheel and unreasonable.

It is not easy to containerize and host k8s for stateful services, and simply running a demo is completely different from running a stable operation in a production environment and being able to quickly troubleshoot, stop loss, and recover in the event of a corresponding failure. In the event of a fault, the difficulty of troubleshooting is superimposed, but it is difficult in the technical field to superimpose the troubleshooting efficiency by stacking people, and there is a k8s expert and an ES expert at the same time, who may be helpless against incurable problems, after all, from different perspectives, you can only understand the local fault details, but the stateful application running in the container and k8s needs the technical reserve from a global perspective when troubleshooting.

While this can be challenging, as a technologist who prides itself on lifelong learning, you shouldn't rest on your laurels, and if you're a database expert, why not become a K8s expert at the same time?

At the same time, Station B has rich technical reserves for k8s engines, and has a team of experts to provide technical support, so we should not be complacent in the face of new technologies. Eventually, we chose the second path and started investigating ES containerization and K8s orchestration.

Second, the overall structure

Who says databases shouldn't be placed in containers?

Overall architecture diagram

PS: An ES cluster with only 2 master nodes will not be set up in the production environment

The operator is responsible for creating various stateful applications (StatefulSets), storage volumes (PVCs), and tuning StatefulSets. At the same time, the node will report the information of available resources such as CPU, MEM, and disk, and the scheduler will schedule pods to the appropriate node based on this information, create a storage volume on the node, and attach it to the container.

Service discovery between k8s cluster containers is implemented through CoreDNS. At the same time, communication between pods is done through MACVLAN.

3. Technical Details

This chapter explains the details of controllers, persistent storage, disk scheduling, schedulers, container networking, service discovery, and cluster high availability in k8s.

1. Controller

In k8s, the smallest unit of orchestration strategy and resource scheduling is generally based on pods, which are a logical concept that can contain one or more containers, pod contents, share network and storage, and isolate and limit resources through Cgroups and Namespaces. We can simply understand that each pod is a virtual machine (but not a virtual machine), and a container is a user program running in this virtual machine. The controller is an important component that can be tuned according to our desired state and actual state to ensure that our applications are always in the desired state. In K8s, the user defines the "expected state" of a resource through a declarative API, and the controller is responsible for monitoring the actual state of the resource, and when the actual state of the resource is inconsistent with the "expected state", the controller makes the necessary changes to the system to ensure that the two are consistent, a process called reconcile. The main management responsibilities of these controllers are to ensure the healthy operation of pod resources, such as the number of replicas and release policies.

K8S has the following default controllers, which determine the behavior of pod orchestration and scheduling

Deployment: A stateless controller that is primarily used to create stateless services

Replicaset: A stateless controller that provides replica parameter configurations, and the controller will maintain the number of pods at the desired value

Statefulset: A stateful controller that manages the workload of a stateful application.

Job and CronJob: A task controller that supports the execution of a certain type of task at one or more times

We found that the default controller in k8s can not meet our needs well, take es as an example, the nodes in es are divided into master/data/coordinating/ingest, etc., each instance may have one or more roles, and Statefulset only relies on the init container behavior and the sequence number of the pod to maintain the relationship between these instances, which is more rigid, and once the pod name is determined, Then the cluster topology relationship is also determined, and if we want to add a master node to the cluster, the process will be more cumbersome.

k8s also supports custom resources (CRDs) and custom controllers, in which we can complete specific deployment and operation and maintenance work according to custom API changes.

The following is a description of the CRD and custom operator workflows. Readers who are familiar with the concept may skip reading as appropriate.

CRD (custom resource definition): used to define and manage resources other than standard K8S objects, it is an extension of the K8S API, allowing users to define resources using their own patterns and behaviors.

Operator: A tool used to automate the management of specific application and service resources (such as stateful application databases, caches, monitoring, etc.) on K8S, which is essentially a controller in K8S. They monitor the change events of the custom resource and process them according to the state specified by the resource.

When an event is triggered about the monitored resource, it enters the reconcile cycle. To put it simply, it is to take appropriate actions based on the difference between the expected state and the actual state of the resource, such as creating, updating, and deleting resources in this circular mechanism.

There are several ways to implement the K8S Operator, using the Operator SDK or other scaffolding tools/frameworks such as KubeBuilder. So with this CRD + operator, you can develop a controller that meets your own business needs. We can customize the control based on the data storage of the pod and the data storage of the pod and whether the data is discarded after the pod is restarted.

2. Persistent storage

Due to performance considerations such as high IOPS and low latency, we chose to use local pv.

The use of local disks means that if the host goes down, it will inevitably lead to data loss, so you need to configure replicas for indexes in Elasticsearch, and you can choose the ReplicatedMergeTree table engine to use replicas in Clickhouse, which is the prerequisite for high data availability.

If a machine has multiple disks and different media (SSD/HDD), if each partition is used independently, it is easy to fragment and the resource utilization is low. In addition, the storage attached to the pod also needs to be dynamically expanded (non-stop service for disk expansion). Based on the above considerations, we chose to use LVM (Logical Volume Manager) technology.

LVM(Logical Volume Manager)

LVM technology combines physical volumes into volume groups and manages disks based on logical volumes, providing dynamic scaling capabilities. Here are some brief concepts.

PV: A physical volume refers to a hard disk partition or a device (such as a raid) that logically has the same function as a disk partition, which is the basic storage logic block of LVM, but compared with the basic physical storage media (such as partitions, disks, etc.), it contains management parameters related to LVM.

VG: Volume groupLVM volume group is similar to a physical hard disk in a non-LVM system, especially the composition of physical volumes, you can create one or more LVM partitions (logical volumes) on the volume group, and the LVM volume group is composed of one or more physical volumes

LV: logical volumeThe logical volume of LVM is similar to a hard disk partition in a non-LVM system, and a file system can be created on top of the logical volume.

With LVM, Kubernetes can use this tool to create data volumes for pods, and we use LVM to unify multiple disks on the same media into a single volume group. If the pod needs a data volume of NVME media, create a logical volume on this volume group and create the required file system on this logical volume, so that the data disk corresponding to the data volume (PV) of the pod is initialized, and the pod can be directly mounted and used when it is created.

First of all, the agent will report the available size of the logical volume group of each machine, after the report is completed, the scheduler can select the appropriate machine to create the logical volume based on the volume group category and available size of each machine, and finally the agent on the machine will create a logical volume on the logical volume group through LVM according to the size of the logical volume and the disk category declared by the pvc, and finally, when the pod is created, the agent will mount the data volume and provide the pod for use. The specific process is as follows.

3. Disk reporting

CSI-AGENT uses LVM to manage disk partitions, first the agent will identify the bare disk (unformatted or partitioned disk, that is, the disk does not have any file system or partition table data on the disk), create a physical volume through LVM, then form the same type of physical volume into an LVM volume group, and finally create a logical volume on the volume group. LVM provides an out-of-the-box tool to obtain the volume group type and size of the machine, and the CSI agent only needs to escalate it.

status:
  allocatable:
    csi.storage.io/csi-vg-hdd: "178789"
    csi.storage.io/csi-vg-ssd: "3102"
  capacity:
    csi.storage.io/csi-vg-hdd: "178824"
    csi.storage.io/csi-vg-ssd: "3112"

The above shows two volume groups on a machine, one for HDD and the other for SSD, and provides capacity size information. When creating a volume group, all disk partitions under the volume group need to be bare-rear, and at the same time, we also need to distinguish the disk media clearly to prevent the disks of different media from being divided into a volume group, especially when the node disk has a disk RAID. In addition, in practice, we found that when the machine is restarted, the drive letter of the NVME disk will change, and creating a physical volume based on the drive letter will cause problems, while the slot of the NVME will generally not change, and we will create a physical volume based on the slot information of the NVME.

4. Disk scheduling

After the disk report is completed, you can schedule the disk based on the disk information.

The scheduler calculates the watermark of various types of disks of all machines in the current cluster, and schedules pods based on policies (centralized or decentralized), and the target nodes scheduled by pods must have all types of data disks that meet the needs of pods. Users only need to apply for PVC, declaring the type of resource they need, their size, and their read/write attributes. The scheduler automatically mounts the data disk and binds the PVC to the PV. Here's a closer look at how these automated steps work.

K8S's API is all declared, and the same goes for disks, the user only needs to declare the size of the disk, the type, the timing and properties of the disk creation, etc., and the controller will automatically achieve the result you want.

Let's start with a few key concepts of the object of declaration.

Persistent Volume (PV): An abstraction of the underlying storage, such as a directory that is persistently stored on the host, independent of the pod's lifecycle, LVM creates a logical volume based on a certain volume group. After hearing the event, the csi-agent creates a localvolume on the node. If it already exists, the pod will not be created after it is restarted.

PersistentVolumeClaim (PVC): A property that describes the persistent storage (e.g., size, read and write permissions) that the pod expects to use, and is usually created by the developer. There is no need to pay attention to the underlying details of the specific PV implementation, and the pod can be used normally after the PVC is bound to the PV. The PVC of ES and ClickHouse is automatically created by the corresponding operator.

StorageClass: defines the template for creating PVs (PV attributes, such as size and type), and uses storage plug-ins to dynamically create PVs, which is one of the prerequisites for low-O&M or even O&M-free. If you specify the corresponding StorageClass in pvc, for example, if the StorageClass defines the vg group of lvm, then a logical volume will be created on the vg group when creating a pv, and for example, if only a pv is created when a pod is created, you can specify the volumeBindingMode of the StorageClass as WaitForFirstConsumer. Therefore, StorageClass is to declare the properties and types of the pv to be created.

With the Dynamic Provisioning capability in k8s, PVs can be created dynamically without manual maintenance by O&M personnel. However, there is no built-in local disk Provisioner in k8s, and we need a custom storage plugin. To this end, we have adapted based on open-source storage plug-ins.

The CSI agent plug-in of each node will report the idle disk resources of the machine, and describe these resources through CRD, and the scheduler will listen to these CRD objects, and select the appropriate node for scheduling according to the resource situation of each machine described by the CRD when scheduling pods.

Architecture diagram of the storage plug-in

status:
  allocatable:
csi.storage.io/csi-vg-hdd: "178789"
csi.storage.io/csi-vg-ssd: "3102"
  capacity:
csi.storage.io/csi-vg-hdd: "178824"
csi.storage.io/csi-vg-ssd: "3112"

Disk CRD information

By calculating the size of each type of disk volume group and the size of the node that has been occupied, you can calculate the idle size of each type of media disk in each node, and create resource objects according to a centralized or uniform policy. The scheduling process mainly involves the csi-scheduler, csi-agent, and csi-controller, and the three components configure each other to complete resource reporting, scheduling, and data volume creation and mounting.

The following is a detailed description of the three components.

调度器(csi-scheduler):

csi-scheduler is a scheduling plug-in for Kubernetes, which is responsible for reasonable scheduling based on the PV size of the application, the remaining disk space of the node, the disk type, and the load usage of the node. If a node reports more resource information later, you can implement richer scheduling to meet various business requirements. Based on the native K8s scheduling framework, it is relatively simple to make a separate scheduling plug-in, and the scheduling strategy is more flexible, which can realize centralized and decentralized scheduling to meet various business needs.

The scheduling plug-in mainly implements two stages: Filter and Score, in the pre-selection stage (Filter) to filter out the nodes that do not meet the resources, and in the optimization stage (Score), the nodes that meet the conditions are scored differently according to whether the scheduling strategy is centralized or decentralized, for example, the purpose of the centralized strategy is to reduce resource fragmentation, and the fewer available resources under the satisfaction of resources, the higher the score. For decentralized resources, the higher the score and the higher the final score, the easier it is to be selected.

Once the scheduling is complete, the corresponding PVC will be annotated, the csi-controller listens to the PVC to maintain the relationship between the PVC and the LV, and the csi-agent listens to the LV to create the corresponding data volume and is responsible for mounting the data volume when the pod is created.

csi-agent：

The service running on each node uses LVM technology to manage local disks, automatically identify the SSD/HDD/NVME bare disks of the machine, and divide them into corresponding disk volume groups according to these disk types, without user intervention. For O&M, only bare disk machines need to be delivered according to specifications. At the same time, the csi-agent listens to the LV to create a local data volume, and is responsible for mounting the data volume and other operations when the pod is created.

CSI-Controller:

Control plane, listens to PVC and maintains the relationship between PVC and LV. The csi-agent on the node listens to the LV to create the corresponding data volume.

5. Data volume creation

Name:                  csi-sc
IsDefaultClass:        No
Annotations:           kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true，"apiVersion":"storage.k8s.io/v1"，"kind":"StorageClass"，"metadata":{"annotations":{}，"name":"csi-sc"}，"mountOptions":["rw"]，"parameters":{"csi.storage.io/disk-type":"nvme"，"csi.storage.k8s.io/fstype":"xfs"}，"provisioner":"csi.storage.io"，"reclaimPolicy":"Delete"，"volumeBindingMode":"WaitForFirstConsumer"}


Provisioner:           csi.storage.io
Parameters:            csi.storage.io/disk-type=nvme，csi.storage.k8s.io/fstype=xfs
AllowVolumeExpansion:  True
MountOptions:          rw
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:               <none>

Based on this storage plug-in, we define a StorageClass named csi-sc, which is created by the storage plug-in csi.storage.io dynamic pv, the disk media is NVMe SSD, and the file system is xfs.

VolumeBindingMode: WaitForFirstConsumer, which defers the binding of PVCs and PVs until after pod scheduling to prevent pod scheduling failures caused by early binding (a very important configuration when using local pv). ReclaimPolicy: Delete, which means that when the user manually deletes the pvc, the pv will also be automatically deleted, which is necessary in the case of the local disk corresponding to the pv, which can prevent the pod from still being scheduled to the host after deleting the pvc (the binding of pvc and pv is one-to-one, the pv is not deleted, and the pod will not drift).

In the end, the user only needs to declare the data disk you want to use through the pvc and declare the StorageClass in the pvc, and the storage plug-in will do the rest.

6. Container network

By default, containers on different hosts cannot access each other through IP addresses. Whether it is Elasticsearch or Clickhouse, both are distributed applications, such as electing a master node in ES, sharding rebalance, or just a simple query request, which requires internal communication between different instances in the same cluster, and these instances are distributed on different hosts, and we need to make these containers complete reliable and high-performance cross-host communication.

There are numerous container networking solutions in the community. MacVLAN is a network function supported by the Linux kernel that allows Layer 2 networks to be configured on different network namespaces, which has the advantages of low performance loss and low management cost. The macvlan container network solution of Bilibili has been running in the production environment for several years, and the stability is guaranteed, so we decided to reuse the technical precipitation of the company's Paas team. This is done through the CNI plug-in for each node to allocate pod IPs and other initialization work.

Network mode	advantage	inferior position
macvlan	1) The pod has an independent IP, which can be accessed both inside and outside the K8S cluster 2) The request does not go through kube-proxy, and the performance is theoretically optimal	1) The IP margin depends on the information such as the computer room and cabinet where the host network segment is located, and there may be insufficient IP 2)Pod IP变更时客户端需感知
overlay	1) There is no need to consider the problem of insufficient IP	1) To access outside the K8S cluster, you need to implement the ingress network solution and Nginx reverse proxy 2) An external request may be forwarded several times (kube-proxy, nginx reverse proxy), and the performance theory is poor 3) Currently, only Layer 7 agents are supported, and Layer 4 agents are not supported 4) Too many services lead to bloat of iptables rules, which is detrimental to performance
host	1) There is no need to consider the problem of insufficient IP	1) The host port needs to be planned, and the logic of creating the cluster needs to be transformed 2) There is an invisible maintenance cost to maintain the mapping relationship of cluster-port number for client access

7. Service discovery

There is a lot of inter-cluster internal communication in ES, such as shard rebalance between nodes. To solve this problem, we use the headless service+core DNS component to solve this problem. When an internal request needs to obtain the pod address, the core DNS component returns the DNS record of the pod and parses it to the corresponding pod IP.

Although the network mode of macvlan solves the problem of accessing from outside the k8s cluster to the Elasticsearch cluster, we found that the IP address of the pod is not fixed, and the IP address of the pod may change due to the release of the cluster, rolling restart, and scaling.

To this end, we encapsulate a unified query gateway, all client requests to query ES are proxied by the query gateway, and the query gateway will constantly query and update the address information of the specific ES cluster in the K8S cluster, and save it in memory. Considering the large amount of write traffic, it may have an impact on queries. We also split read/write, and the write side will ask the query gateway for the IP list of a specific cluster and node instance, and when the IP of each instance (pod) is obtained, it will directly connect to the IP of these nodes for data writing.

Similarly, for CK, we also encapsulate a unified query gateway to provide external access to the business, and for the communication between replicas within the CK cluster, such as the replica communication between CK, we set a headless service for each replica.

In addition, CK also relies on ZooKeeper components to complete the synchronization between replicas, and we use CK's internally developed ClickHouse Keeper (hereinafter referred to as 'Keeper') to replace ZooKeeper. We have also containerized Keeper to distinguish Keepers used by different businesses through different namespaces. Among them, the Keeper cluster instances also communicate with each other through the headless service.

8, High availability

HA is divided into cluster HA and data HA.

Taking Elasticsearch as an example, the high availability of the cluster needs to be considered to prevent split-brain, and the service will still be available when some nodes are out of the cluster. Similarly, if data is lost on an instance (such as disk corruption), it needs to be restored and compensated by the replica mechanism.

As a simple example, an Elasticsearch cluster consists of 5 Elasticsearch instances (assuming both master&data nodes), and these 5 instances are distributed across host A and host B.

What happens if host A goes down?

Obviously, the Elasticsearch cluster will lose 3 instances, and the cluster will not be able to reach the minimum number of nodes and the master selection will not be triggered, resulting in the cluster being unavailable.

What if host B goes down? Due to host downtime, this ES cluster will only lose 2 instances, the cluster still has 3 instances, and the cluster is still available. Suppose there is an Index A in this cluster, and Index A has a shard 0 and a replica shard 0, which are distributed across Elasticsearch instance 3 and Elasticsearch instance 4. If the host goes down, the primary shard and the replica of the index will be lost at the same time (because the local disk is used), resulting in data unavailability.

To achieve both cluster availability and data availability, instances of the same role in the same cluster cannot be distributed on the same host. We do this using the affinity/anti-affinity mechanism in k8s, where the topologyKey is the hostname.

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            elasticsearch.k8s.elastic.co/cluster-name: es-test
            elasticsearch.k8s.elastic.co/node-data: 'true'
        topologyKey: kubernetes.io/hostname

For CK, to achieve high availability, it also depends on replicas, and through the synchronization between replicas, each replica can provide services to the outside world, even if one of the replicas is down, it can also ensure the availability of the cluster and the integrity of the data. In addition, we do not allow different replicas of the same shard in a CK cluster to be distributed on the same host. This can also be achieved by using the anti-affinity mechanism. In addition to the anti-affinity of replicas at different shard levels, CK Operator also supports different degrees of anti-affinity/affinity, such as CHI instance-level anti-affinity, replica-level anti-affinity, and namespace-level anti-affinity.

podDistribution:
  - type: ShardAntiAffinity
    topologyKey: "kubernetes.io/hostname"

The above configuration is to ensure that multiple instances will not be suspended at the same time in the Keeper cluster, resulting in the unavailability of the entire Keeper cluster, and it is also a host-level anti-affinity for the Keeper instances.

9. Memory isolation

cgroups ensure the isolation of memory in the container, and the page cache of the operating system will greatly affect the query performance of ES due to the one-time write and unchanged file characteristics of Lucene.

We took a deep dive into how cgroups work for page caches

According to the Linux kernel code, the cgroup does only count RSS memory (the sum of the physical memory actually used by all processes in the cgroup control group) + page cache, and when the memory.usage_in_bytes is more than memory.limit_in_bytes, the kernel will automatically start to recycle the page cache cache in the cgroup control group.

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
{
    /* pagein of a big page is an event。So， ignore page size */
    if (nr_pages > 0)
        __count_memcg_events(memcg， PGPGIN， 1);
    else {
        __count_memcg_events(memcg， PGPGOUT， 1);
        nr_pages = -nr_pages; /* for event */
    }


    __this_cpu_add(memcg->vmstats_percpu->nr_page_events， nr_pages);
}

It is concluded that when initializing a container running ES process, it is necessary to consider both the jvm heap and the page cache size, where the page cache is also limited by cgroups, and the reclamation behavior is performed by the kernel.

In the same way, CK will also use the PageCache of the OS when accessing files, and when the data accessed by the query is cached by the PageCache, it will also speed up the query of CK, which can be adjusted in combination with the min_bytes_to_use_direct_io parameters in the CK and the pod memory to improve query performance.

10. I/O isolation

Considering that the current models of station B carry ES and CK, Nvme SSDs are used for hot data storage, and the I/O bottleneck will not be triggered before the CPU and MEM in most cases. IO isolation doesn't make much sense.

For cold data on HDD, the default query will use buffer io, cgroup v1 version is not completely isolated on buffer io, although this feature has been supported on the new version of the kernel, but the Paas team evaluated the overall upgrade kernel and adaptation cost is high, after evaluating the ability to support I/O isolation in the second phase.

11. Specific practice

Taking ES as an example, in order to avoid reinventing the wheel, we have transformed it based on the open source community Cloud-On-K8s project to make it compatible with the company's container platform, and optimized the metrics and log collection methods, as follows.

1) The PaaS team initializes the K8S cluster and standardizes the container installation

2) Log in to the master node of the k8s cluster and install the CRD to the cluster

kubectl create -f crds.yaml

3)将operator部署到k8s集群中

kubectl apply -f operator-bili.yaml

In the YAML file, you need to specify the URL of the operator image that has been built in advance

containers:
- image: "{{镜像仓库地址}}/cloud-on-k8s-bili:2.3.2"

4) Verify that the operator is running properly

kubectl get pod -n elastic-system

Here we see that the operator is already in the READY state

NAME                 READY   STATUS    RESTARTS   AGE
elastic-operator-0   1/1     Running   27         260d

Check whether the operator is running properly through the logs

kubectl logs -f -n elastic-system elastic-operator-0

5) The CRD has been successfully registered, the operator is in place, and you can start to create an ES cluster

A single command is used to create an ES cluster

kubectl apply -f elasticsearch.yaml

Check the status of the ES cluster

kubectl get elasticsearch -n elastic

Here you can see that the cluster is ready, and the cluster status is green

NAME   HEALTH   NODES   VERSION   PHASE   AGE
test   green    3       7.17.9    Ready   1min

Check the node logs of the cluster to confirm that the cluster is running properly

kubectl logs -f -n elastic test-es-datanode-0

6) The cluster has been initialized, and we need to get the address for external access

View the node information of an ES cluster

kubectl get pod -n elastic -owide

We use macvlan, the pod IP can be accessed from outside the cluster, and the value of the IP field is directly obtained as the node address

NAME                 READY   STATUS    RESTARTS   AGE     IP               NODE
test-es-datanode-0   1/1     Running   1          1min    xx.xxx.xx.xxx    xx.xxx.xx.xx
test-es-datanode-1   1/1     Running   1          1min    xx.xxx.xx.xxx    xx.xxx.xx.xx
test-es-datanode-2   1/1     Running   1          1min    xx.xxx.xx.xxx    xx.xxx.xx.xx

7) The operator sets the account password for the cluster, and saves it in the secret, which is also obtained through the command

kubectl get secret test-es-elastic-user -o go-template='{{.data.elastic | base64decode}}' -n elastic

8) At this point, we have obtained the address and account password of the cluster, and the next step is to access the ES cluster from the outside (outside the K8s cluster).

The ck procedure is almost the same as that of es, based on the community Altinity open source clickhouse operator, the only difference is that when building a cluster, not only the ck cluster must be created, but also the keeper cluster needs to be deployed in k8s.

Fourth, observable

1. Indicator collection

Currently, Elasticsearch cluster metrics are exposed through the es exporter, reported to Prometheus, and visualized by the grafana panel.

For Elasticsearch clusters of bare metal machines, you need to deploy an ES exporter for each cluster you build, as shown in the following figure

After ES on K8s, ES clusters are created, destroyed, and scaled frequently, and in the face of a large number of clusters, it is obviously not a sustainable maintenance method to bind a set of exporters to the original cluster. Therefore, we have modified the ES exporter to support ES exporter to automatically perceive all clusters, and support the collection of metrics of multiple clusters, and the exporter is also deployed in K8s, and the expoter itself is stateless, which supports elastic scaling:

Currently, CK exporter is used for metric collection in CK clusters, and ZooKeeper is used for ZooKeeper metric collection. The metric collection of CK physical clusters is similar to that of ES physical clusters, and each physical cluster will have a corresponding CK exporter, but the exporter will obtain the information of the nodes in the cluster from its configuration file, so as to obtain the metrics of each node. However, the collection of a physical ZooKeeper cluster is based on a ZK exporter for each ZK instance.

In the same way as ES, the open source solution is not suitable for the collection of CK containerized clusters, we have modified the existing exporter, and also realized that the exporter automatically perceives all CK clusters and automatically collects the corresponding cluster indicators, and the information of the container cluster is dynamically obtained from the API server through the HTTP interface. The specific architecture diagram is as follows.

2. Key indicators

We have created a corresponding Grafana view panel for the key metrics of the Elasticsearch cluster, which mainly contains the following sections:

Cluster status, number of nodes, and shard status (unassigned/active/relocating)

Memory monitoring, including JVM heap memory, GC information, query cache, segment occupied memory, etc

CPU usage, load load, etc

线程池状态(search/write/refresh)、thread rejected数量，search、write的线程池队列任务数等

Cluster query QPS and query time

Cluster write traffic

Elasticsearch cluster monitoring panel

In addition to the data in the system tables such as metrics and events, you can also add custom metrics, including various metrics related to select/insert/alter, some of which are used for monitoring, and the metrics are divided into various categories.

In addition to application monitoring, we use Cadivisor (Container Advisor) to collect container metrics accordingly, such as pagecache, which will significantly affect the query performance of es(lucene). If you need to focus on other container metrics in your practice, you can configure the dashboard yourself.

容器cache page监控

3. Log collection

Temporarily use the API provided by K8S to expose logs to facilitate daily troubleshooting. In the future, we will integrate the company's unified log system to collect and persist cluster logs and slow query logs, and provide log alarms, slow query location and analysis, and other capabilities.

5. Commercialization

Compared with the traditional cluster construction method, the creation, scaling, rolling restart, and release of the cluster are realized in the form of a few simple command lines and yaml configuration files, which greatly reduces the operation and maintenance costs in daily work, but this method is still highly dependent on people, the degree of automation is low, and the operator also needs to have the corresponding technical domain knowledge of k8s, and the manual operation also increases the probability of errors when performing changes. In order to further automate and achieve low or even no O&M, we have conducted in-depth collaboration with the infrastructure-platform product team to productize components in the form of private cloud services, aiming to improve the efficiency of business R&D business iteration and component platform operation/O&M.

The final productization capabilities are as follows:

1. One-click creation/scaling/destruction of clusters, fill in the corresponding work order on the configuration platform, and automatically complete the creation/scaling/destruction of the cluster with the approval of the corresponding person in charge (usually the automatic change can be completed within 1 minute).

2. Automatic binding of CMDB, observable platform, change management and control platform, etc., transparent resource information, automatic access to monitoring alarms, and traces of daily changes, which improves the overall stability and troubleshooting efficiency in case of faults.

3. One-click access to business, through appropriate encapsulation of scenarios such as data integration, cluster creation, query access, etc., while using components freely, shielding the underlying details and complex concepts, and accelerating business iteration and launch.

4. On the O&M platform, it implements capabilities such as one-click release, iterative update, expansion and contraction, image management, configuration management, automatic repair of bad disks/automatic drift of pods, etc., reducing the pressure of daily O&M and operation.

Fill out a simple form and the cluster is automatically initialized

6. Revenue

After the completion of the technology implementation and productization, let's focus on the benefits generated, mainly based on cost, quality, and efficiency.

1. Cost

Before the containerization process, when the business needs to use an independent ES/CK cluster, it is necessary to apply for at least 3-4 bare metal servers for cluster construction, and there are currently 30+ sets of container es/ck clusters hosted to k8s at Bilibili, deducting the actual resources used, which has saved 100+ bare metal servers for the company, and the average CPU utilization rate of the cluster has also increased from the original 5% to 15% in the continuous progress of the migration process.

2. Quality

The traditional public cluster mode has poor isolation, which causes different services in the same cluster to affect each other. With the ability of ES/CK on K8s, we have further possibilities for resource isolation. The stability of several businesses has also been greatly improved. Taking the relational chain search scenario of station B as an example, the original 99th percentile query time must time out, but after the migration of the independent cluster of es on k8s, the query timeout problem has been solved, which significantly improves the user's experience of using station B.

After the cluster is migrated, the problem of query glitches is resolved

3. Efficiency

Previously, we set up a bare metal cluster, and the steps are as follows:

1) Coordinate resources and deliver BMSs to the system group

2) Create a cluster configuration, specify the cluster name, log path, data disk path, JVM parameters, etc

3) Write an ansible script to deploy the cluster

4) Log in to Bastionhost and run Ansible scripts in batches for deployment

5) Run the systemd command in batches to start the cluster

6) Verify whether the cluster is initialized successfully and whether the number of nodes meets expectations

7) Configure the cluster address for use on the write and query side

To set up a container cluster, perform the following steps:

1) On the configuration page, fill in the cluster name and other information, and specify the cluster specifications

2) Submit the form and wait for the cluster to be automatically created, usually within 1 minute

To create a cluster scenario, it originally required the deployer to be familiar with the components and corresponding deployment scripts, and it took >=0.5 days/person-hour to become available from application to cluster delivery in 1~2 minutes (if the number of nodes in the cluster is small, it can be completed within tens of seconds).

In the fault scenario, in the traditional maintenance mode, SRE needs to subscribe to alarms, receive alarms, locate them through the observable system, and log in to the machine for recovery, which consumes a lot of manpower and is inefficient.

After being hosted in k8s, the operator will continuously maintain the cluster in the desired state during the tuning cycle. For example, if the node is disconnected due to OOM, the operator will constantly try to pull up the pod to keep the cluster in a healthy state. Some failures, including these, do not require human intervention, saving considerable O&M costs.

When a cluster needs to be expanded, if it is a bare metal physical machine cluster, because the configuration of different clusters is different, you need to prepare scripts and configuration files in advance, which is not only cumbersome but also prone to errors. With containerized clusters, we only need to simply change the number of instances and instance specifications on the page, and the operator can automatically complete the change.

In summary, the operator will automatically perform resource orchestration, scheduling, rolling release, cluster state tuning, and other behaviors, which can save the manpower consumption under traditional human O&M and almost eliminate O&M.

7. Write at the end

When choosing a container technology, you are confronted with many underlying concepts such as operating systems, hardware, etc. FOR EXAMPLE, CONTAINER NETWORKING, DISK SCHEDULING, IO/CPU/MEM/NETWORK ISOLATION, ETC. Whether it is in the process of project progress or daily troubleshooting, the knowledge reserve of a single technical field is far from enough.

In a container network failure case, intermittent disengagement of nodes in the cluster was observed, and only the TCP network communication failure between nodes could be concluded through logs and monitoring. After ruling out the problem of the cluster itself, we suspect that it is a container network problem, and use the tool to stably reproduce the phenomenon of network transmission packet loss, because we use MACVLAN technology, we first compare the pod IP with the existing pod IP, and found that the existing pod is assigned a duplicate IP, we temporarily delete the pod and reschedule, and upgrade the CNI plug-in to solve this problem.

If you only have database domain knowledge and lack of overall understanding of the overall container network and how pods communicate with each other, it will be difficult to quickly locate the root cause of the failure, which will greatly prolong the time for stop loss recovery.

Similarly, if you only have k8s/container domain knowledge, you can only look at the failure of a component, because the failure of a component is a multi-cause and one effect, and the node is only a phenomenon that causes this phenomenon, and you must locate and eliminate other causes one by one.

In the face of tradeoff, we have to accept the complexity of the overall technical solution architecture while enjoying the convenience of container and k8s resource scheduling, orchestration, and isolation. This requires continuous learning of knowledge outside of our own domain. If you're a database expert, you might want to be a k8s expert as well. If you are a k8s expert, you also need to understand different technical areas in the era of embracing cloud native. In the process of implementing the technology, we encounter intractable problems, challenges and difficulties to solve are only the tip of the iceberg, in different components, different architectures of stateful services hosted in K8s may have more defects and difficulties exposed, but with the cloud-native consensus and the open source community's good ecological model, we believe that these difficulties can be easily solved.

We choose to embrace new technologies and keep learning, how did you choose?

Author丨Bilibili Infrastructure Team

Source丨Official Account: Bilibili Technology (ID: bilibili-TC)

The DBAPLUS community welcomes contributions from technical personnel at [email protected]

Who says databases shouldn't be placed in containers?

Read on

Zelensky has disappeared from Russia's wanted database, and Russia announced a wanted warrant five days ago

Oracle Database 23ai is here! The database giant, which is approaching 50 years old, is entering the AI era

Yang Mi's many fan stations and databases announced the suspension of operations, and other studios gave satisfactory answers

Together with Volcano Engine, ASUS leverages large models and vector databases to launch AI functional notebooks

Accelerating Digital Transformation Why Ping An Health Insurance is completely "de-oracle" database

Test Development New Skills: Seamless Migration from Oracle to Gaussian Database

Alibaba Cloud Time Series Database real-time index construction optimization practice

全面融入AI 甲骨文融合数据库要做“Game Changer”

Star Database - Julia Römmelt

and a health state database

明星数据库-仲里依纱(Riisa Naka)

明星数据库-伊丽莎白·乔利托(Elisabeth Giolito)

明星数据库-珍娜·詹姆森（Jenna Jameson）

Star Database - Irina Meier

明星数据库-伊丽莎白·特纳(Elizabeth Turner)

It's all 2024, what is Audi's service like this? Is it too difficult for Chinese consumers to bring?

NEW! The national server has fully awakened, the permanent server is online, and Blizzard has thrown benefits into a boom!

The occupancy of oil trucks in the high-speed service area is too serious, and the ideal car owner calls for the addition of ground locks Official response

The community cafeteria is finally closed. The canteen in the community was originally very good, but it was led by the property management, and the property committee participated in the supervision. Later, a company in a health care center contracted a canteen.

The Municipal Real Estate Management Service Center carries out research on property quality improvement and expansion and high-quality housing

Cangxian branch company carries out normalized service activities

Happy June 1st, true companionship! Hexi Street Veterans Service Station helps children celebrate the festival

Fast update and iteration of software and hardware to meet consumer needs On May 31, 2024, the new M7Ultra was newly launched, becoming a luxury SUV with five seats and six seats

Today I took my small car to dinner, and the waiter said to leave the car at the door. I said that mine is a folding car, and others can carry it away, and if you are sure that you will lose it, I will leave it at the door. If

Set off a new wave of the Russian-speaking market, Orange Grapefruit Green All-Russian Service has become a weapon for enterprises to go to sea!

Shuitan Primary School Campus of Shuanglong School in Xuzhou District celebrated the "June 1st" activity and after-school service achievement display

The U.S. media database analyzes the probability of winning the finals, 91% is one-sided! Will the alliance come up with a new façade?

What posture do you use to read manga? Shueisha launched the strip comic service

The volunteer service activity of "Building Kindness and Doing Good" Little Orange Lantern was successfully held

Jealousy provides you with 5 valuable services

MTI & DTI Reference Book "New Liberal Arts Language Service Academic Library" New Book Published!