如何在Kubernetes中配置Flink TaskManager部署的多个副本的静态主机名,并在Prometheus ConfigMap中获取它?
我有一个flink JobManager,只有一个TaskManager运行在Kubernetes之上。为此,我使用如何在Kubernetes中配置Flink TaskManager部署的多个副本的静态主机名,并在Prometheus ConfigMap中获取它?,kubernetes,apache-flink,prometheus,Kubernetes,Apache Flink,Prometheus,我有一个flink JobManager,只有一个TaskManager运行在Kubernetes之上。为此,我使用服务和部署为TaskManager提供副本:1 apiVersion: v1 kind: Service metadata: name: flink-taskmanager spec: type: ClusterIP ports: - name: prometheus port: 9250 selector: app: flink com
服务
和部署
为TaskManager提供副本:1
apiVersion: v1
kind: Service
metadata:
name: flink-taskmanager
spec:
type: ClusterIP
ports:
- name: prometheus
port: 9250
selector:
app: flink
component: taskmanager
部署
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: flink-taskmanager
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: taskmanager
template:
metadata:
labels:
app: flink
component: taskmanager
spec:
hostname: flink-taskmanager
volumes:
- name: flink-config-volume
configMap:
name: flink-config
items:
- key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-console.properties
path: log4j-console.properties
- name: tpch-dbgen-data
persistentVolumeClaim:
claimName: tpch-dbgen-data-pvc
- name: tpch-dbgen-datarate
persistentVolumeClaim:
claimName: tpch-dbgen-datarate-pvc
containers:
- name: taskmanager
image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
# imagePullPolicy: Always
env:
args: ["taskmanager"]
ports:
- containerPort: 6122
name: rpc
- containerPort: 6125
name: query-state
- containerPort: 9250
livenessProbe:
tcpSocket:
port: 6122
initialDelaySeconds: 30
periodSeconds: 60
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/
- name: tpch-dbgen-data
mountPath: /opt/tpch-dbgen/data
subPath: data
- mountPath: /tmp
name: tpch-dbgen-datarate
subPath: tmp
securityContext:
runAsUser: 9999 # refers to user _flink_ from official flink image, change if necessary
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: prometheus
template:
metadata:
labels:
app: flink
component: prometheus
spec:
hostname: prometheus
volumes:
- name: prometheus-config-volume
configMap:
name: prometheus-config
items:
- key: prometheus.yml
path: prometheus.yml
containers:
- name: prometheus
image: prom/prometheus
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/prometheus.yml
subPath: prometheus.yml
然后,我将数据从Flink TaskManager交换到Prometheus,并使用一个服务
、配置映射
、和部署
将Prometheus设置在Kubernetes之上,使其从Flink Task Manager获取数据
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
spec:
type: ClusterIP
ports:
- name: promui
protocol: TCP
port: 9090
targetPort: 9090
selector:
app: flink
component: prometheus
ConfigMap
是我为Flink(Flink taskmanager
)设置Flink taskmanager主机目标:['Flink-jobmanager:9250','Flink-jobmanager:9251','Flink taskmanager:9250']
,该主机与Kubernetes对象服务
匹配:
部署
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: flink-taskmanager
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: taskmanager
template:
metadata:
labels:
app: flink
component: taskmanager
spec:
hostname: flink-taskmanager
volumes:
- name: flink-config-volume
configMap:
name: flink-config
items:
- key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-console.properties
path: log4j-console.properties
- name: tpch-dbgen-data
persistentVolumeClaim:
claimName: tpch-dbgen-data-pvc
- name: tpch-dbgen-datarate
persistentVolumeClaim:
claimName: tpch-dbgen-datarate-pvc
containers:
- name: taskmanager
image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
# imagePullPolicy: Always
env:
args: ["taskmanager"]
ports:
- containerPort: 6122
name: rpc
- containerPort: 6125
name: query-state
- containerPort: 9250
livenessProbe:
tcpSocket:
port: 6122
initialDelaySeconds: 30
periodSeconds: 60
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/
- name: tpch-dbgen-data
mountPath: /opt/tpch-dbgen/data
subPath: data
- mountPath: /tmp
name: tpch-dbgen-datarate
subPath: tmp
securityContext:
runAsUser: 9999 # refers to user _flink_ from official flink image, change if necessary
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: prometheus
template:
metadata:
labels:
app: flink
component: prometheus
spec:
hostname: prometheus
volumes:
- name: prometheus-config-volume
configMap:
name: prometheus-config
items:
- key: prometheus.yml
path: prometheus.yml
containers:
- name: prometheus
image: prom/prometheus
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/prometheus.yml
subPath: prometheus.yml
这很有效,我可以在Prometheus WEB-UI上查询Flink任务管理器的数据。但是,例如,一旦我将
副本:1
更改为副本:3
,我就无法再从任务管理器查询数据。我猜这是因为配置-targets:['flink-jobmanager:9250','flink-jobmanager:9251','flink-taskmanager:9250']
在有更多flink-taskmanager副本时不再有效。但是,由于是Kubernetes管理新TaskManager复制副本的创建,我不知道在普罗米修斯的这个选项上配置什么。我想它应该是动态的,或者带有*或者一些正则表达式,可以为我获取所有任务管理器。有人知道如何配置它吗?我必须根据这个答案和。首先,我必须使用StatefulSet
而不是Deployment
。有了这个,我可以将Pod IP设置为有状态。不清楚的是,我必须将服务设置为使用clusterIP:None
而不是type:clusterIP
。这就是我的服务:
apiVersion: v1
kind: Service
metadata:
name: flink-taskmanager
labels:
app: flink-taskmanager
spec:
clusterIP: None # type: ClusterIP
ports:
- name: prometheus
port: 9250
selector:
app: flink-taskmanager
这是我的状态集
:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: flink-taskmanager
spec:
replicas: 3
serviceName: flink-taskmanager
selector:
matchLabels:
app: flink-taskmanager # has to match .spec.template.metadata.labels
template:
metadata:
labels:
app: flink-taskmanager # has to match .spec.selector.matchLabels
spec:
hostname: flink-taskmanager
volumes:
- name: flink-config-volume
configMap:
name: flink-config
items:
- key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-console.properties
path: log4j-console.properties
- name: tpch-dbgen-data
persistentVolumeClaim:
claimName: tpch-dbgen-data-pvc
- name: tpch-dbgen-datarate
persistentVolumeClaim:
claimName: tpch-dbgen-datarate-pvc
containers:
- name: taskmanager
image: felipeogutierrez/explore-flink:1.11.1-scala_2.12
# imagePullPolicy: Always
env:
args: ["taskmanager"]
ports:
- containerPort: 6122
name: rpc
- containerPort: 6125
name: query-state
- containerPort: 9250
livenessProbe:
tcpSocket:
port: 6122
initialDelaySeconds: 30
periodSeconds: 60
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/
- name: tpch-dbgen-data
mountPath: /opt/tpch-dbgen/data
subPath: data
- mountPath: /tmp
name: tpch-dbgen-datarate
subPath: tmp
securityContext:
runAsUser: 9999 # refers to user _flink_ from official flink image, change if necessary
在prometheus配置文件prometheus.yml
上,我用模式StatefulSetName-{0..N-1}.ServiceName.default.svc.cluster.local
映射了主机:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
labels:
app: flink
data:
prometheus.yml: |+
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'flink'
scrape_interval: 5s
static_configs:
- targets: ['flink-jobmanager:9250', 'flink-jobmanager:9251', 'flink-taskmanager-0.flink-taskmanager.default.svc.cluster.local:9250', 'flink-taskmanager-1.flink-taskmanager.default.svc.cluster.local:9250', 'flink-taskmanager-2.flink-taskmanager.default.svc.cluster.local:9250']
metrics_path: /