elasticsearch-hadoop,Amazon S3,Pyspark,elasticsearch Hadoop" /> elasticsearch-hadoop,Amazon S3,Pyspark,elasticsearch Hadoop" />

Amazon s3 pyspark读取表单s3并写入elasticsearch

Amazon s3 pyspark读取表单s3并写入elasticsearch,amazon-s3,pyspark,elasticsearch-hadoop,Amazon S3,Pyspark,elasticsearch Hadoop,我正在尝试从s3读取数据并写入Elasticsearch, 在spark master机器上使用jupyter安装 我有以下配置: import pyspark import os #os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell" import findspark findspark.init() from pyspark.sql import Sp

我正在尝试从s3读取数据并写入Elasticsearch, 在spark master机器上使用jupyter安装

我有以下配置:

import pyspark
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
import findspark
findspark.init()
from pyspark.sql import SparkSession
import configparser

config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile='DEFAULT'
access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

from pyspark import SparkContext, SparkConf
sc_conf = SparkConf()
sc_conf.setAppName("app-3-logstash")
sc_conf.setMaster('spark://172.31.25.152:7077')
sc_conf.set('spark.executor.memory', '24g')
sc_conf.set('spark.executor.cores', '8')
sc_conf.set('spark.cores.max', '32')
sc_conf.set('spark.logConf', True)
sc_conf.set('spark.packages', 'org.apache.hadoop:hadoop-aws:2.7.3')
sc_conf.set('spark.jars', '/usr/local/spark/jars/elasticsearch-hadoop-7.6.0/dist/elasticsearch-spark-20_2.11-7.6.0.jar')
sc = SparkContext(conf=sc_conf)

hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)


使用此配置,我可以访问ES而不是S3 当尝试使用此配置从s3读取时,我得到以下错误:

Py4JJavaError:调用时出错 z:org.apache.spark.api.python.PythonRDD.collectAndServe.: java.lang.RuntimeException:java.lang.ClassNotFoundException:Class 找不到org.apache.hadoop.fs.s3native.natives3文件系统

当禁用sc#u conf.set('spark.packages')和sc#u conf.set('spark.jars')、并启用#os.environ['PYSPARK_SUBMIT_ARGS')时,它确实可以访问s3,但不能访问ES

我错过了什么

谢谢 亚尼夫