Spark Docker Java网关进程在发送其端口号之前退出
我是docker的新手,正在尝试使用airflow和pyspark运行docker compose文件。以下是我目前的情况:Spark Docker Java网关进程在发送其端口号之前退出,docker,apache-spark,pyspark,Docker,Apache Spark,Pyspark,我是docker的新手,正在尝试使用airflow和pyspark运行docker compose文件。以下是我目前的情况: version: '3.7' services: master: image: gettyimages/spark command: bin/spark-class org.apache.spark.deploy.master.Master -h master hostname: master environment
version: '3.7'
services:
master:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
expose:
- 7001
- 7002
- 7003
- 7004
- 7005
- 7077
- 6066
ports:
- 4040:4040
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- ./conf/master:/conf
- ./data:/tmp/data
worker:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=y
- EXECUTOR=Local
logging:
options:
max-size: 10m
max-file: "3"
volumes:
- ./dags:/usr/local/airflow/dags
# Add this to have third party packages
- ./requirements.txt:/requirements.txt
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8082:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
我正在尝试运行以下简单的DAG,以确认pyspark运行正常:
import pyspark
from airflow.models import DAG
from airflow.utils.dates import days_ago, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
import random
args = {
"owner": "ian",
"start_date": days_ago(1)
}
dag = DAG(dag_id="pysparkTest", default_args=args, schedule_interval=None)
def run_this_func(**context):
sc = pyspark.SparkContext()
print(sc)
with dag:
run_this_task = PythonOperator(
task_id='run_this',
python_callable=run_this_func,
provide_context=True,
retries=10,
retry_delay=timedelta(seconds=1)
)
当我这样做时,它会失败,错误是Java网关进程在发送端口号之前退出。我发现有几篇帖子说要运行命令export-PYSPARK\u SUBMIT\u ARGS=“--master local[2]PYSPARK shell”
,我试着像这样运行命令:
version: '3.7'
services:
master:
image: gettyimages/spark
command: >
sh -c "bin/spark-class org.apache.spark.deploy.master.Master -h master
&& export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell""
hostname: master
...
但我还是犯了同样的错误。你知道我做错了什么吗?我认为你不需要修改
主程序的命令。就这样吧
此外,您希望在不同容器上运行的python代码如何连接主容器。我认为您应该将其添加到spark上下文中,例如:
def run_this_func(**context):
sc = pyspark.SparkContext("spark://master:7077")
print(sc)
哪个容器产生此错误?火花容器。使用此docker compose,我可以毫无问题地运行纯python DAGfile@DBA108642你能说得更准确些吗?您没有spark
容器。您是否在主控
或工作者
中看到了这一点?您是否尝试运行gettyimages
original docker compose?它在同一个问题上有效还是失败了?我还没有尝试过这么做,我只是去了gettyimages
repo并将他们的docker compose添加到我的中,但我会给出一个答案shot@DBA108642你解决了问题吗?不幸的是,这并没有解决我的问题,我被迫采取不同的方法