Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何在GoogleDatproc中优化Hadoop MapReduce压缩Spark输出?_Apache Spark_Hadoop_Google Cloud Dataproc - Fatal编程技术网

Apache spark 如何在GoogleDatproc中优化Hadoop MapReduce压缩Spark输出?

Apache spark 如何在GoogleDatproc中优化Hadoop MapReduce压缩Spark输出?,apache-spark,hadoop,google-cloud-dataproc,Apache Spark,Hadoop,Google Cloud Dataproc,目标是:Cassandra中的数百万行需要尽可能快速高效地(每天)提取并压缩到单个文件中 当前设置使用Google Dataproc群集运行Spark作业,该作业将数据直接提取到Google云存储桶中。我尝试了两种方法: 使用(现已弃用)FileUtil.copyMerge()将大约9000个Spark分区文件合并为一个未压缩文件,然后提交Hadoop MapReduce作业以压缩该单个文件 将大约9000个Spark分区文件保留为原始输出,并提交Hadoop MapReduce作业以将这些文件

目标是:Cassandra中的数百万行需要尽可能快速高效地(每天)提取并压缩到单个文件中

当前设置使用Google Dataproc群集运行Spark作业,该作业将数据直接提取到Google云存储桶中。我尝试了两种方法:

  • 使用(现已弃用)FileUtil.copyMerge()将大约9000个Spark分区文件合并为一个未压缩文件,然后提交Hadoop MapReduce作业以压缩该单个文件

  • 将大约9000个Spark分区文件保留为原始输出,并提交Hadoop MapReduce作业以将这些文件合并并压缩为单个文件

  • 一些工作细节: 大约8亿行。 Spark作业输出的大约9000个Spark分区文件。 Spark作业在一个1主4辅(4vCPU,每个15GB)Dataproc群集上运行大约需要一个小时。 默认的Dataproc Hadoop块大小,我认为是128MB

    一些Spark配置详细信息:

    spark.task.maxFailures=10
    spark.executor.cores=4
    
    spark.cassandra.input.consistency.level=LOCAL_ONE
    spark.cassandra.input.reads_per_sec=100
    spark.cassandra.input.fetch.size_in_rows=1000
    spark.cassandra.input.split.size_in_mb=64
    
    Hadoop作业:

    hadoop jar file://usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
    -Dmapred.reduce.tasks=1
    -Dmapred.output.compress=true
    -Dmapred.compress.map.output=true
    -Dstream.map.output.field.separator=,
    -Dmapred.textoutputformat.separator=,
    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
    -input gs://bucket/with/either/single/uncompressed/csv/or/many/spark/partition/file/csvs
    -output gs://output/bucket
    -mapper /bin/cat
    -reducer /bin/cat
    -inputformat org.apache.hadoop.mapred.TextInputFormat
    -outputformat org.apache.hadoop.mapred.TextOutputFormat
    
  • Spark任务花了大约1小时将Cassandra数据提取到GCS bucket。使用FileUtil.copyMerge()增加了大约45分钟,由Dataproc集群执行,但资源利用不足,因为它们似乎使用1个节点。Hadoop压缩单个文件的工作又花了50分钟。这不是一种最佳方法,因为即使集群没有使用其全部资源,它也必须保持更长的时间 该作业的信息输出:

    INFO mapreduce.Job: Counters: 55
    File System Counters
        FILE: Number of bytes read=5072098452
        FILE: Number of bytes written=7896333915
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        GS: Number of bytes read=47132294405
        GS: Number of bytes written=2641672054
        GS: Number of read operations=0
        GS: Number of large read operations=0
        GS: Number of write operations=0
        HDFS: Number of bytes read=57024
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=352
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Killed map tasks=1
        Launched map tasks=353
        Launched reduce tasks=1
        Rack-local map tasks=353
        Total time spent by all maps in occupied slots (ms)=18495825
        Total time spent by all reduces in occupied slots (ms)=7412208
        Total time spent by all map tasks (ms)=6165275
        Total time spent by all reduce tasks (ms)=2470736
        Total vcore-milliseconds taken by all map tasks=6165275
        Total vcore-milliseconds taken by all reduce tasks=2470736
        Total megabyte-milliseconds taken by all map tasks=18939724800
        Total megabyte-milliseconds taken by all reduce tasks=7590100992
    Map-Reduce Framework
        Map input records=775533855
        Map output records=775533855
        Map output bytes=47130856709
        Map output materialized bytes=2765069653
        Input split bytes=57024
        Combine input records=0
        Combine output records=0
        Reduce input groups=2539721
        Reduce shuffle bytes=2765069653
        Reduce input records=775533855
        Reduce output records=775533855
        Spilled Records=2204752220
        Shuffled Maps =352
        Failed Shuffles=0
        Merged Map outputs=352
        GC time elapsed (ms)=87201
        CPU time spent (ms)=7599340
        Physical memory (bytes) snapshot=204676702208
        Virtual memory (bytes) snapshot=1552881852416
        Total committed heap usage (bytes)=193017675776
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=47132294405
    File Output Format Counters 
        Bytes Written=2641672054 
    
  • 我期望这种方法的性能与另一种方法一样好或更好,但它的性能要差得多。Spark的工作没有改变。跳过FileUtil.copyMerge()并直接跳转到Hadoop MapReduce作业。。。一个半小时后,工作的地图部分仅为50%左右。这项工作在那个时候被取消了,因为很明显它是不可行的
    我完全控制了Spark工作和Hadoop工作。我知道我们可以创建一个更大的集群,但我宁愿在确保任务本身得到优化后再这样做。感谢您的帮助。谢谢。

    你能提供一些关于Spark工作的更多细节吗?您使用的Spark API是什么-RDD还是Dataframe? 为什么不在Spark中完全执行合并阶段(使用repartition().write())并避免Spark和MR作业的链接