天天看點

Trino on k8s 編排部署進階篇

作者:大資料老司機

一、概述

Trino on Kubernetes(Trino在Kubernetes上的部署)是将Trino查詢引擎與Kubernetes容器編排平台相結合,以實作在Kubernetes叢集上部署、管理和運作Trino的解決方案。

Trino(之前稱為Presto SQL)是一個高性能的分布式SQL查詢引擎,旨在處理大規模資料集和複雜查詢。Kubernetes是一個流行的開源容器編排平台,用于自動化容器的部署、擴充和管理。

将Trino部署在Kubernetes上可以帶來一些優勢:

  • 彈性擴充:Kubernetes提供了自動化的容器擴充功能,可以根據工作負載的需求自動增加或減少Trino的執行個體數。這樣,可以根據查詢負載的變化進行彈性伸縮,提高性能和資源使用率。
  • 高可用性:Kubernetes具有容錯和故障恢複的能力。通過在Kubernetes叢集中部署多個Trino執行個體,可以實作高可用性架構,當其中一個執行個體失敗時,其他執行個體可以接管工作,保證系統的可用性。
  • 資源管理:Kubernetes提供了資源排程和管理的功能,可以控制Trino執行個體使用的計算資源、存儲資源和網絡資源。通過适當配置資源限制和請求,可以有效地管理Trino查詢的資源消耗,防止資源沖突和争用。
  • 簡化部署和管理:Kubernetes提供了聲明性的配置和自動化的部署機制,可以簡化Trino的部署和管理過程。通過使用Kubernetes的标準工具和API,可以輕松地進行Trino執行個體的建立、配置和監控。
  • 生态系統整合:Kubernetes具有豐富的生态系統和內建能力,可以與其他工具和平台進行無縫內建。例如,可以與存儲系統(如Hadoop HDFS、Amazon S3)和其他資料處理工具(如Apache Spark)內建,實作資料的無縫通路和處理。

需要注意的是,将Trino部署在Kubernetes上需要适當的配置和調優,以確定性能和可靠性。此外,對于大規模和複雜的查詢場景,可能需要考慮資料分片、資料劃分和資料本地性等方面的優化。

總之,Trino on Kubernetes提供了一種靈活、可擴充和高效的方式來部署和管理Trino查詢引擎,使其能夠更好地适應大資料環境中的查詢需求。

這裡隻是講解部署過程,想了解更多的trino的内容,可參考我以下幾篇文章:

  • 大資料Hadoop之——基于記憶體型SQL查詢引擎Presto(Presto-Trino環境部署)
  • 【大資料】Presto(Trino)SQL 文法進階
  • 【大資料】Presto(Trino)REST API 與執行計劃介紹
  • 【大資料】Presto(Trino)配置參數以及 SQL文法

如果想單機容器部署,可以參考我這篇文章:【大資料】通過 docker-compose 快速部署 Presto(Trino)保姆級教程

Trino on k8s 編排部署進階篇

二、k8s 部署部署

k8s 環境部署這裡不重複講解了,重點是 Hadoop on k8s,不知道怎麼部署k8s環境的可以參考我以下幾篇文章:

  • 【雲原生】k8s 環境快速部署(一小時以内部署完)
  • 【雲原生】k8s 離線部署講解和實戰操作

三、開始編排部署 Trino

1)建構鏡像 Dockerfile

FROM registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/centos:7.7.1908

RUN rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone

RUN export LANG=zh_CN.UTF-8

# 建立使用者和使用者組,跟yaml編排裡的user: 10000:10000
RUN groupadd --system --gid=10000 hadoop && useradd --system --home-dir /home/hadoop --uid=10000 --gid=hadoop hadoop -m

# 安裝sudo
RUN yum -y install sudo ; chmod 640 /etc/sudoers

# 給hadoop添加sudo權限
RUN echo "hadoop ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

RUN yum -y install install net-tools telnet wget nc

RUN mkdir /opt/apache/

# 添加配置 JDK
ADD zulu20.30.11-ca-jdk20.0.1-linux_x64.tar.gz /opt/apache/
ENV JAVA_HOME /opt/apache/zulu20.30.11-ca-jdk20.0.1-linux_x64
ENV PATH $JAVA_HOME/bin:$PATH

# 添加配置 trino server
ENV TRINO_VERSION 416
ADD trino-server-${TRINO_VERSION}.tar.gz /opt/apache/
ENV TRINO_HOME /opt/apache/trino
RUN ln -s /opt/apache/trino-server-${TRINO_VERSION} $TRINO_HOME

# 建立配置目錄和資料源catalog目錄
RUN mkdir -p ${TRINO_HOME}/etc/catalog

# 添加配置 trino cli
COPY trino-cli-416-executable.jar $TRINO_HOME/bin/trino-cli

# copy bootstrap.sh
COPY bootstrap.sh /opt/apache/
RUN chmod +x /opt/apache/bootstrap.sh ${TRINO_HOME}/bin/trino-cli

RUN chown -R hadoop:hadoop /opt/apache

WORKDIR $TRINO_HOME
           

bootstrap.sh 腳本内容

#!/usr/bin/env sh

wait_for() {
    if [ -n "$1" -a  -z -n "$2" ];then
       echo Waiting for $1 to listen on $2...
       while ! nc -z $1 $2; do echo waiting...; sleep 1s; done
    fi
}

start_trino() {

   wait_for $1 $2

   ${TRINO_HOME}/bin/launcher run --verbose

}

case $1 in
        trino-coordinator)
                start_trino coordinator $@
                ;;
        trino-worker)
                start_trino worker $@
                ;;
        *)
                echo "請輸入正确的服務啟動指令~"
        ;;
esac
           

建構鏡像:

docker build -t registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino-k8s:416 . --no-cache

### 參數解釋
# -t:指定鏡像名稱
# . :目前目錄Dockerfile
# -f:指定Dockerfile路徑
#  --no-cache:不緩存
           

2)values.yaml 檔案配置

# Default values for trino.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

image:
  repository: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino-k8s
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart version.
  tag: 416

imagePullSecrets:
  - name: registry-credentials

server:
  workers: 1
  node:
    environment: production
    dataDir: /opt/apache/trino/data
    pluginDir: /opt/apache/trino/plugin
  log:
    trino:
      level: INFO
  config:
    path: /opt/apache/trino/etc
    http:
      port: 8080
    https:
      enabled: false
      port: 8443
      keystore:
        path: ""
    # Trino supports multiple authentication types: PASSWORD, CERTIFICATE, OAUTH2, JWT, KERBEROS
    # For more info: https://trino.io/docs/current/security/authentication-types.html
    authenticationType: ""
    query:
      maxMemory: "1GB"
      maxMemoryPerNode: "512MB"
    memory:
      heapHeadroomPerNode: "512MB"
  exchangeManager:
    name: "filesystem"
    baseDir: "/tmp/trino-local-file-system-exchange-manager"
  workerExtraConfig: ""
  coordinatorExtraConfig: ""
  autoscaling:
    enabled: false
    maxReplicas: 5
    targetCPUUtilizationPercentage: 50

accessControl: {}
  # type: configmap
  # refreshPeriod: 60s
  # # Rules file is mounted to /etc/trino/access-control
  # configFile: "rules.json"
  # rules:
  #   rules.json: |-
  #     {
  #       "catalogs": [
  #         {
  #           "user": "admin",
  #           "catalog": "(mysql|system)",
  #           "allow": "all"
  #         },
  #         {
  #           "group": "finance|human_resources",
  #           "catalog": "postgres",
  #           "allow": true
  #         },
  #         {
  #           "catalog": "hive",
  #           "allow": "all"
  #         },
  #         {
  #           "user": "alice",
  #           "catalog": "postgresql",
  #           "allow": "read-only"
  #         },
  #         {
  #           "catalog": "system",
  #           "allow": "none"
  #         }
  #       ],
  #       "schemas": [
  #         {
  #           "user": "admin",
  #           "schema": ".*",
  #           "owner": true
  #         },
  #         {
  #           "user": "guest",
  #           "owner": false
  #         },
  #         {
  #           "catalog": "default",
  #           "schema": "default",
  #           "owner": true
  #         }
  #       ]
  #     }

additionalNodeProperties: {}

additionalConfigProperties: {}

additionalLogProperties: {}

additionalExchangeManagerProperties: {}

eventListenerProperties: {}

#additionalCatalogs: {}

additionalCatalogs:
  mysql: |-
    connector.name=mysql
    connection-url=jdbc:mysql://mysql-primary.mysql:3306
    connection-user=root
    connection-password=WyfORdvwVm
  hive: |-
    connector.name=hive
    hive.metastore.uri=thrift://hadoop-hadoop-hive-metastore.hadoop:9083
    hive.allow-drop-table=true
    hive.allow-rename-table=true
    #hive.config.resources=/tmp/core-site.xml,/tmp/hdfs-site.xml


# Array of EnvVar (https://v1-18.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#envvar-v1-core)
env: []

initContainers: {}
  # coordinator:
  #   - name: init-coordinator
  #     image: busybox:1.28
  #     imagePullPolicy: IfNotPresent
  #     command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
  # worker:
  #   - name: init-worker
  #     image: busybox:1.28
  #     command: ['sh', '-c', 'echo The worker is running! && sleep 3600']

securityContext:
  runAsUser: 10000
  runAsGroup: 10000

service:
  #type: ClusterIP
  type: NodePort
  port: 8080
  nodePort: 31880

nodeSelector: {}

tolerations: []

affinity: {}

auth: {}
  # Set username and password
  # https://trino.io/docs/current/security/password-file.html#file-format
  # passwordAuth: "username:encrypted-password-with-htpasswd"

serviceAccount:
  # Specifies whether a service account should be created
  create: false
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""
  # Annotations to add to the service account
  annotations: {}

secretMounts: []

coordinator:
  jvm:
    maxHeapSize: "2G"
    gcMethod:
      type: "UseG1GC"
      g1:
        heapRegionSize: "32M"

  additionalJVMConfig: {}

  resources: {}
    # We usually recommend not to specify default resources and to leave this as a conscious
    # choice for the user. This also increases chances charts run on environments with little
    # resources, such as Minikube. If you do want to specify resources, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi

worker:
  jvm:
    maxHeapSize: "2G"
    gcMethod:
      type: "UseG1GC"
      g1:
        heapRegionSize: "32M"

  additionalJVMConfig: {}

  resources: {}
    # We usually recommend not to specify default resources and to leave this as a conscious
    # choice for the user. This also increases chances charts run on environments with little
    # resources, such as Minikube. If you do want to specify resources, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
           

3)trino catalog configmap yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ template "trino.catalog" . }}
  labels:
    app: {{ template "trino.name" . }}
    chart: {{ template "trino.chart" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
    role: catalogs
data:
  tpch.properties: |
    connector.name=tpch
    tpch.splits-per-node=4
  tpcds.properties: |
    connector.name=tpcds
    tpcds.splits-per-node=4
{{- range $catalogName, $catalogProperties := .Values.additionalCatalogs }}
  {{ $catalogName }}.properties: |
    {{- $catalogProperties | nindent 4 }}
{{- end }}
           

這裡隻是列舉出核心部署配置,最下面會提供git下載下傳位址,有任何疑問歡迎留言或私信~

4)開始安裝

cd trino-on-kubernetes

# 安裝
helm install trino ./ -n trino --create-namespace

# 更新
# helm upgrade trino ./ -n trino

# 解除安裝
# helm uninstall trino -n trino

# 檢視
kubectl get pods,svc -n trino
           
Trino on k8s 編排部署進階篇

5)測試驗證

coordinator_name=`kubectl get pods -n trino|grep coordinator|awk '{print $1}'`

# 登入
kubectl exec -it $coordinator_name -n trino -- /opt/apache/trino/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop

# 檢視資料源
show catalogs;
select * from system.runtime.nodes;
           
Trino on k8s 編排部署進階篇

四、配置 k8s hive 資料源

hive on k8s 可以參考我這篇文章:Hadoop on k8s 快速部署進階精簡篇

在 trino-on-kubernetes/values.yaml 檔案中添加資料源

Trino on k8s 編排部署進階篇

重新更新配置并重新開機 trino節點

helm upgrade trino ./ -n trino

# 重新開機,因為修改configmap是不會動态重新整理的,得重新開機才生效
kubectl delete pod -n trino `kubectl get pods -n trino|awk 'NR!=1{print $1}'`

coordinator_name=`kubectl get pods -n hadoop|grep coordinator|awk '{print $1}'`

# 登入
kubectl exec -it $coordinator_name -n trino -- ${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop

# 檢視資料源
show catalogs;
# 檢視mysql庫
show schemas from hive;
# 檢視表
show tables from hive.default;

create schema hive.test;

# 建立表
CREATE TABLE hive.test.movies (
  movie_id bigint,
  title varchar,
  rating real, -- real類似與float類型
  genres varchar,
  release_year int
)
WITH (
  format = 'ORC',
  partitioned_by = ARRAY['release_year'] -- 注意這裡的分區字段必須是上面順序的最後一個
);

#加載資料到Hive表
INSERT INTO hive.test.movies
VALUES 
(1, 'Toy Story', 8.3, 'Animation|Adventure|Comedy', 1995), 
(2, 'Jumanji', 6.9, 'Action|Adventure|Family', 1995), 
(3, 'Grumpier Old Men', 6.5, 'Comedy|Romance', 1995);

# 查詢資料
select * from hive.test.movies;
           
Trino on k8s 編排部署進階篇

五、快速部署核心操作步驟(如果隻關注部署,可直接跳轉這裡)

如果隻是想快速部署,上面的内容就可以直接忽略了,直接執行下面步驟即可:

1)安裝 git

# 1、安裝 git
yum -y install git 
           

2)下載下傳trino安裝包

git clone [email protected]:HBigdata/trino-on-kubernetes.git
cd trino-on-kubernetes
           

3)配置資料源

cat -n values.yaml
           
Trino on k8s 編排部署進階篇

3)配置資源限制 requests 和 limits

Trino on k8s 編排部署進階篇

4)修複 trino 配置

Trino on k8s 編排部署進階篇

JVM 記憶體配置

Trino on k8s 編排部署進階篇

5)開始部署

# git clone [email protected]:HBigdata/trino-on-kubernetes.git
# cd trino-on-kubernetes

# 安裝
helm install trino ./ -n trino --create-namespace

# 更新
helm upgrade trino ./ -n trino

# 解除安裝
helm uninstall trino -n trino
           

6)測試驗證

coordinator_name=`kubectl get pods -n trino|grep coordinator|awk '{print $1}'`

# 登入
kubectl exec -it $coordinator_name -n trino -- ${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop

# 檢視資料源
show catalogs;
# 檢視mysql庫
show schemas from hive;
# 檢視表
show tables from hive.default;

create schema hive.test;

# 建立表
CREATE TABLE hive.test.movies (
  movie_id bigint,
  title varchar,
  rating real, -- real類似與float類型
  genres varchar,
  release_year int
)
WITH (
  format = 'ORC',
  partitioned_by = ARRAY['release_year'] -- 注意這裡的分區字段必須是上面順序的最後一個
);

#加載資料到Hive表
INSERT INTO hive.test.movies
VALUES 
(1, 'Toy Story', 8.3, 'Animation|Adventure|Comedy', 1995), 
(2, 'Jumanji', 6.9, 'Action|Adventure|Family', 1995), 
(3, 'Grumpier Old Men', 6.5, 'Comedy|Romance', 1995);

# 查詢資料
select * from hive.test.movies;
           
Trino on k8s 編排部署進階篇

到這裡完成 trino on k8s 部署和可用性示範就完成了,有任何疑問請關注我公衆号:大資料與雲原生技術分享,加群交流或私信溝通,如本篇文章對您有所幫助,麻煩幫忙一鍵三連(點贊、轉發、收藏)~