如何在Kubernetes中配置Flink TaskManager部署的多个副本的静态主机名,并在Prometheus ConfigMap中获取它?

如何在Kubernetes中配置Flink TaskManager部署的多个副本的静态主机名,并在Prometheus ConfigMap中获取它?,kubernetes,apache-flink,prometheus,Kubernetes,Apache Flink,Prometheus,我有一个flink JobManager,只有一个TaskManager运行在Kubernetes之上。为此,我使用服务和部署为TaskManager提供副本:1 apiVersion: v1 kind: Service metadata: name: flink-taskmanager spec: type: ClusterIP ports: - name: prometheus port: 9250 selector: app: flink com

我有一个flink JobManager,只有一个TaskManager运行在Kubernetes之上。为此,我使用
服务
部署
为TaskManager提供
副本:1

apiVersion: v1
kind: Service
metadata:
  name: flink-taskmanager
spec:
  type: ClusterIP
  ports:
  - name: prometheus
    port: 9250
  selector:
    app: flink
    component: taskmanager
部署

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flink-taskmanager
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink
      component: taskmanager
  template:
    metadata:
      labels:
        app: flink
        component: taskmanager
    spec:
      hostname: flink-taskmanager
      volumes:
      - name: flink-config-volume
        configMap:
          name: flink-config
          items:
          - key: flink-conf.yaml
            path: flink-conf.yaml
          - key: log4j-console.properties
            path: log4j-console.properties
      - name: tpch-dbgen-data
        persistentVolumeClaim:
          claimName: tpch-dbgen-data-pvc
      - name: tpch-dbgen-datarate
        persistentVolumeClaim:
          claimName: tpch-dbgen-datarate-pvc
      containers:
      - name: taskmanager
        image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
        # imagePullPolicy: Always
        env:
        args: ["taskmanager"]
        ports:
        - containerPort: 6122
          name: rpc
        - containerPort: 6125
          name: query-state
        - containerPort: 9250
        livenessProbe:
          tcpSocket:
            port: 6122
          initialDelaySeconds: 30
          periodSeconds: 60
        volumeMounts:
        - name: flink-config-volume
          mountPath: /opt/flink/conf/
        - name: tpch-dbgen-data
          mountPath: /opt/tpch-dbgen/data
          subPath: data
        - mountPath: /tmp
          name: tpch-dbgen-datarate
          subPath: tmp
        securityContext:
          runAsUser: 9999  # refers to user _flink_ from official flink image, change if necessary
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink
      component: prometheus
  template:
    metadata:
      labels:
        app: flink
        component: prometheus
    spec:
      hostname: prometheus
      volumes:
      - name: prometheus-config-volume
        configMap:
          name: prometheus-config
          items:
          - key: prometheus.yml
            path: prometheus.yml
      containers:
      - name: prometheus
        image: prom/prometheus
        ports:
        - containerPort: 9090
        volumeMounts:
          - name: prometheus-config-volume
            mountPath: /etc/prometheus/prometheus.yml
            subPath: prometheus.yml
然后,我将数据从Flink TaskManager交换到Prometheus,并使用一个
服务
配置映射
、和
部署
将Prometheus设置在Kubernetes之上,使其从Flink Task Manager获取数据

apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
spec:
  type: ClusterIP
  ports:
  - name: promui
    protocol: TCP
    port: 9090
    targetPort: 9090
  selector:
    app: flink
    component: prometheus
ConfigMap
是我为Flink(
Flink taskmanager
)设置Flink taskmanager主机
目标:['Flink-jobmanager:9250','Flink-jobmanager:9251','Flink taskmanager:9250']
,该主机与Kubernetes对象
服务
匹配:

部署

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flink-taskmanager
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink
      component: taskmanager
  template:
    metadata:
      labels:
        app: flink
        component: taskmanager
    spec:
      hostname: flink-taskmanager
      volumes:
      - name: flink-config-volume
        configMap:
          name: flink-config
          items:
          - key: flink-conf.yaml
            path: flink-conf.yaml
          - key: log4j-console.properties
            path: log4j-console.properties
      - name: tpch-dbgen-data
        persistentVolumeClaim:
          claimName: tpch-dbgen-data-pvc
      - name: tpch-dbgen-datarate
        persistentVolumeClaim:
          claimName: tpch-dbgen-datarate-pvc
      containers:
      - name: taskmanager
        image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
        # imagePullPolicy: Always
        env:
        args: ["taskmanager"]
        ports:
        - containerPort: 6122
          name: rpc
        - containerPort: 6125
          name: query-state
        - containerPort: 9250
        livenessProbe:
          tcpSocket:
            port: 6122
          initialDelaySeconds: 30
          periodSeconds: 60
        volumeMounts:
        - name: flink-config-volume
          mountPath: /opt/flink/conf/
        - name: tpch-dbgen-data
          mountPath: /opt/tpch-dbgen/data
          subPath: data
        - mountPath: /tmp
          name: tpch-dbgen-datarate
          subPath: tmp
        securityContext:
          runAsUser: 9999  # refers to user _flink_ from official flink image, change if necessary
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink
      component: prometheus
  template:
    metadata:
      labels:
        app: flink
        component: prometheus
    spec:
      hostname: prometheus
      volumes:
      - name: prometheus-config-volume
        configMap:
          name: prometheus-config
          items:
          - key: prometheus.yml
            path: prometheus.yml
      containers:
      - name: prometheus
        image: prom/prometheus
        ports:
        - containerPort: 9090
        volumeMounts:
          - name: prometheus-config-volume
            mountPath: /etc/prometheus/prometheus.yml
            subPath: prometheus.yml

这很有效,我可以在Prometheus WEB-UI上查询Flink任务管理器的数据。但是,例如,一旦我将
副本:1
更改为
副本:3
,我就无法再从任务管理器查询数据。我猜这是因为配置
-targets:['flink-jobmanager:9250','flink-jobmanager:9251','flink-taskmanager:9250']
在有更多flink-taskmanager副本时不再有效。但是,由于是Kubernetes管理新TaskManager复制副本的创建,我不知道在普罗米修斯的这个选项上配置什么。我想它应该是动态的,或者带有*或者一些正则表达式,可以为我获取所有任务管理器。有人知道如何配置它吗?

我必须根据这个答案和。首先,我必须使用
StatefulSet
而不是
Deployment
。有了这个,我可以将Pod IP设置为有状态。不清楚的是,我必须将
服务设置为使用
clusterIP:None
而不是
type:clusterIP
。这就是我的服务:

apiVersion: v1
kind: Service
metadata:
  name: flink-taskmanager
  labels:
    app: flink-taskmanager
spec:
  clusterIP: None # type: ClusterIP
  ports:
  - name: prometheus
    port: 9250
  selector:
    app: flink-taskmanager
这是我的
状态集

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: flink-taskmanager
spec:
  replicas: 3
  serviceName: flink-taskmanager
  selector:
    matchLabels:
      app: flink-taskmanager # has to match .spec.template.metadata.labels
  template:
    metadata:
      labels:
        app: flink-taskmanager # has to match .spec.selector.matchLabels
    spec:
      hostname: flink-taskmanager
      volumes:
      - name: flink-config-volume
        configMap:
          name: flink-config
          items:
          - key: flink-conf.yaml
            path: flink-conf.yaml
          - key: log4j-console.properties
            path: log4j-console.properties
      - name: tpch-dbgen-data
        persistentVolumeClaim:
          claimName: tpch-dbgen-data-pvc
      - name: tpch-dbgen-datarate
        persistentVolumeClaim:
          claimName: tpch-dbgen-datarate-pvc
      containers:
      - name: taskmanager
        image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
        # imagePullPolicy: Always
        env:
        args: ["taskmanager"]
        ports:
        - containerPort: 6122
          name: rpc
        - containerPort: 6125
          name: query-state
        - containerPort: 9250
        livenessProbe:
          tcpSocket:
            port: 6122
          initialDelaySeconds: 30
          periodSeconds: 60
        volumeMounts:
        - name: flink-config-volume
          mountPath: /opt/flink/conf/
        - name: tpch-dbgen-data
          mountPath: /opt/tpch-dbgen/data
          subPath: data
        - mountPath: /tmp
          name: tpch-dbgen-datarate
          subPath: tmp
        securityContext:
          runAsUser: 9999  # refers to user _flink_ from official flink image, change if necessary
在prometheus配置文件
prometheus.yml
上,我用模式
StatefulSetName-{0..N-1}.ServiceName.default.svc.cluster.local
映射了主机:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  labels:
    app: flink
data:
  prometheus.yml: |+
    global:
      scrape_interval: 15s

    scrape_configs:
      - job_name: 'prometheus'
        scrape_interval: 5s
        static_configs:
          - targets: ['localhost:9090']
      - job_name: 'flink'
        scrape_interval: 5s
        static_configs:
          - targets: ['flink-jobmanager:9250', 'flink-jobmanager:9251', 'flink-taskmanager-0.flink-taskmanager.default.svc.cluster.local:9250', 'flink-taskmanager-1.flink-taskmanager.default.svc.cluster.local:9250', 'flink-taskmanager-2.flink-taskmanager.default.svc.cluster.local:9250']
        metrics_path: /