Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Performance 性能调整火花&x27;s Word2Vec_Performance_Apache Spark_Pyspark_Amazon Emr_Word2vec - Fatal编程技术网

Performance 性能调整火花&x27;s Word2Vec

Performance 性能调整火花&x27;s Word2Vec,performance,apache-spark,pyspark,amazon-emr,word2vec,Performance,Apache Spark,Pyspark,Amazon Emr,Word2vec,我想改进Spark的Word2Vec模型在EMR集群上的性能。我有大约54 GB的清洁专利文本数据,我想在上面训练Spark的Word2Vec。看起来它正在运行,但我认为性能可以提高。有人能给我一些建议吗 采取的预处理步骤: spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --

我想改进Spark的Word2Vec模型在EMR集群上的性能。我有大约54 GB的清洁专利文本数据,我想在上面训练Spark的Word2Vec。看起来它正在运行,但我认为性能可以提高。有人能给我一些建议吗

采取的预处理步骤:

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
  • 从文本中删除特殊字符,并减少不必要的空白
  • 标记词
  • 从标记中删除停止字
  • 使单词柠檬化
  • 删除频繁出现的单词(出现在30%以上文档中的单词)
已清理数据的样本

+----------------------------------------------------------------------------------------------------+
|[water, cooling, cooled, type, pre, burning, present, invention, provides, kind, water, cooling, ...|
|[new, energetic, liquid, invention, discloses, kind, new, energetic, liquid, made, head, outlet, ...|
|[pre, assembly, pre, disclosed, pre, cylindrical, body, member, extending, axially, opposite, pre...|
|[part, feed, ozone, feed, form, difference, ozone, concentration, space, wise, time, wise, premix...|
|[homogeneous, charge, thereof, invention, discloses, homogeneous, type, thereof, cover, arranged,...|
|[gasoline, pre, plug, pre, communicating, plug, associated, pre, respectively, gasoline, injected...|
|[pre, pre, homogeneous, charge, hcci, mode, providing, pre, fluidly, creating, radical, pre, achi...|
|[pre, 105, 351, another, aspect, pre, equal, greater, main, 107, 355, ieast, prior, main, aspect,...|
|[energy, apparatus, energy, apparatus, presented, herein, energy, conversion, module, containing,...|
|[diesel, invention, provides, inlet, processing, diesel, diesel, inlet, treatment, diesel, charac...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows
电子病历硬件设置:

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
  • 母版:m5.2xL 8 vCore,32 GiB内存,仅限EBS存储 EBS存储:128 GiB
  • 芯(10x):m5.4X大 16 vCore,64 GiB内存,仅限EBS存储 EBS存储:256 GiB
spark提交
设置:

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
Word2Vec
设置(如果未提及,我使用默认设置):

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
  • vectorSize=200
  • minCount=5
  • numIterations=15
  • numPartitions=120
更多注意事项:

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
  • 在评估过程中,集群使用了大约70%的cpu
  • 估算期间,ram的总使用率约为50-60%
我应该增加
numPartitions
以利用大约100%的cpu利用率吗?它会在多大程度上(或在多大程度上)降低模型的准确性?如何设置
numIterations
?在这种情况下,什么是足够的

有人能帮我吗


提前谢谢

为什么这么大的语料库是必须的?@Nacho所以你建议对语料库进行采样并估计
Word2Vec
?如果没有特别的理由拥有这么大的语料库,我建议获得最大15-20Gb的样本。这对于一般用途的应用来说已经足够了。好吧,这似乎是合理的。我会尝试一下:)为什么一定要这么大的语料库?@Nacho所以你建议对语料库进行采样并估计
Word2Vec
?如果没有特别的理由拥有这么大的语料库,我建议获得最大15-20Gb的样本。这对于一般用途的应用来说已经足够了。好吧,这似乎是合理的。我会试试看:)