Performance 性能调整火花&x27；s Word2Vec_Performance_Apache Spark_Pyspark_Amazon Emr_Word2vec

Performance 性能调整火花&x27；s Word2Vec

performance apache-spark pyspark

Performance 性能调整火花&x27；s Word2Vec,performance,apache-spark,pyspark,amazon-emr,word2vec,Performance,Apache Spark,Pyspark,Amazon Emr,Word2vec,我想改进Spark的Word2Vec模型在EMR集群上的性能。我有大约54 GB的清洁专利文本数据，我想在上面训练Spark的Word2Vec。看起来它正在运行，但我认为性能可以提高。有人能给我一些建议吗采取的预处理步骤： spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --

我想改进Spark的Word2Vec模型在EMR集群上的性能。我有大约54 GB的清洁专利文本数据，我想在上面训练Spark的Word2Vec。看起来它正在运行，但我认为性能可以提高。有人能给我一些建议吗

采取的预处理步骤：

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py

从文本中删除特殊字符，并减少不必要的空白
标记词
从标记中删除停止字
使单词柠檬化
删除频繁出现的单词（出现在30%以上文档中的单词）

已清理数据的样本

+----------------------------------------------------------------------------------------------------+
|[water, cooling, cooled, type, pre, burning, present, invention, provides, kind, water, cooling, ...|
|[new, energetic, liquid, invention, discloses, kind, new, energetic, liquid, made, head, outlet, ...|
|[pre, assembly, pre, disclosed, pre, cylindrical, body, member, extending, axially, opposite, pre...|
|[part, feed, ozone, feed, form, difference, ozone, concentration, space, wise, time, wise, premix...|
|[homogeneous, charge, thereof, invention, discloses, homogeneous, type, thereof, cover, arranged,...|
|[gasoline, pre, plug, pre, communicating, plug, associated, pre, respectively, gasoline, injected...|
|[pre, pre, homogeneous, charge, hcci, mode, providing, pre, fluidly, creating, radical, pre, achi...|
|[pre, 105, 351, another, aspect, pre, equal, greater, main, 107, 355, ieast, prior, main, aspect,...|
|[energy, apparatus, energy, apparatus, presented, herein, energy, conversion, module, containing,...|
|[diesel, invention, provides, inlet, processing, diesel, diesel, inlet, treatment, diesel, charac...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows

电子病历硬件设置：

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py

母版：m5.2xL 8 vCore，32 GiB内存，仅限EBS存储 EBS存储：128 GiB
芯（10x）：m5.4X大 16 vCore，64 GiB内存，仅限EBS存储 EBS存储：256 GiB

spark提交
设置：

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py

Word2Vec
设置（如果未提及，我使用默认设置）：

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py

```
vectorSize=200
```
```
minCount=5
```
```
numIterations=15
```
```
numPartitions=120
```

更多注意事项：

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py

在评估过程中，集群使用了大约70%的cpu
估算期间，ram的总使用率约为50-60%

我应该增加

numPartitions

以利用大约100%的cpu利用率吗？它会在多大程度上（或在多大程度上）降低模型的准确性？如何设置

numIterations

？在这种情况下，什么是足够的

有人能帮我吗

提前谢谢

为什么这么大的语料库是必须的？@Nacho所以你建议对语料库进行采样并估计

Word2Vec

？如果没有特别的理由拥有这么大的语料库，我建议获得最大15-20Gb的样本。这对于一般用途的应用来说已经足够了。好吧，这似乎是合理的。我会尝试一下：）为什么一定要这么大的语料库？@Nacho所以你建议对语料库进行采样并估计

Word2Vec

？如果没有特别的理由拥有这么大的语料库，我建议获得最大15-20Gb的样本。这对于一般用途的应用来说已经足够了。好吧，这似乎是合理的。我会试试看：）