Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 阿帕奇梁-镶木+;SparkRunner(阅读版)_Apache Spark_Hadoop_Apache Beam_Parquet - Fatal编程技术网

Apache spark 阿帕奇梁-镶木+;SparkRunner(阅读版)

Apache spark 阿帕奇梁-镶木+;SparkRunner(阅读版),apache-spark,hadoop,apache-beam,parquet,Apache Spark,Hadoop,Apache Beam,Parquet,我目前正在本地Cloudera Hadoop集群中使用sparkRunner运行Apache Beam应用程序,我使用的是Apache Beam 2.16和Apache Spark 2.4。 我有两个版本的相同管道,一个读取AVRO数据,另一个读取拼花地板(代码如下) 如果有人能帮我解决读取拼花地板数据的“工作过载”问题,那将非常有帮助:) PCollection<GenericRecord> records = pipeline .apply("Reading",Parquet

我目前正在本地Cloudera Hadoop集群中使用sparkRunner运行Apache Beam应用程序,我使用的是Apache Beam 2.16Apache Spark 2.4。 我有两个版本的相同管道,一个读取AVRO数据,另一个读取拼花地板(代码如下)

如果有人能帮我解决读取拼花地板数据的“工作过载”问题,那将非常有帮助:)

PCollection<GenericRecord> records = pipeline
  .apply("Reading",ParquetIO.read(SCHEMA)
  .from("/foo/bar"));

records.apply("Writing",AvroIO.writeGenericRecords(SCHEMA)
  .to(options.getOutputPath())
  .withSuffix(".avro"));
spark2-submit --master yarn --class my.class.MainClass \
--driver-memory 8G \
--executor-cores 5 \
--driver-cores 5 \
--conf spark.driver.memory=10G \
--conf spark.executor.memory=10G \
--conf spark.executor.memoryOverhead=2000 \
--conf spark.driver.memoryOverhead=2000 \
--conf spark.yarn.am.memoryOverhead=2000 \
--conf spark.serializer=org.apache.spark.serializer.JavaSerializer \
--conf spark.network.timeout=1200 \
--conf spark.speculation=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=20 \
--conf spark.dynamicAllocation.minExecutors=10 \
--conf spark.shuffle.spill=true \
--conf spark.shuffle.spill.compress=true \
--conf spark.io.compression.codec=snappy \
--conf spark.executor.heartbeatInterval=10000000 \
--conf spark.network.timeout=10000000 \
--conf spark.default.parallelism=100 \
/parent/path/program.jar \
--inputPath="/this/is/the/input/path" \
--outputPath="/this/is/the/output/path" \
--runner=SparkRunner