Performance 性能调整火花&x27;s Word2Vec
我想改进Spark的Word2Vec模型在EMR集群上的性能。我有大约54 GB的清洁专利文本数据,我想在上面训练Spark的Word2Vec。看起来它正在运行,但我认为性能可以提高。有人能给我一些建议吗 采取的预处理步骤:Performance 性能调整火花&x27;s Word2Vec,performance,apache-spark,pyspark,amazon-emr,word2vec,Performance,Apache Spark,Pyspark,Amazon Emr,Word2vec,我想改进Spark的Word2Vec模型在EMR集群上的性能。我有大约54 GB的清洁专利文本数据,我想在上面训练Spark的Word2Vec。看起来它正在运行,但我认为性能可以提高。有人能给我一些建议吗 采取的预处理步骤: spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
- 从文本中删除特殊字符,并减少不必要的空白李>
- 标记词
- 从标记中删除停止字
- 使单词柠檬化
- 删除频繁出现的单词(出现在30%以上文档中的单词)
+----------------------------------------------------------------------------------------------------+
|[water, cooling, cooled, type, pre, burning, present, invention, provides, kind, water, cooling, ...|
|[new, energetic, liquid, invention, discloses, kind, new, energetic, liquid, made, head, outlet, ...|
|[pre, assembly, pre, disclosed, pre, cylindrical, body, member, extending, axially, opposite, pre...|
|[part, feed, ozone, feed, form, difference, ozone, concentration, space, wise, time, wise, premix...|
|[homogeneous, charge, thereof, invention, discloses, homogeneous, type, thereof, cover, arranged,...|
|[gasoline, pre, plug, pre, communicating, plug, associated, pre, respectively, gasoline, injected...|
|[pre, pre, homogeneous, charge, hcci, mode, providing, pre, fluidly, creating, radical, pre, achi...|
|[pre, 105, 351, another, aspect, pre, equal, greater, main, 107, 355, ieast, prior, main, aspect,...|
|[energy, apparatus, energy, apparatus, presented, herein, energy, conversion, module, containing,...|
|[diesel, invention, provides, inlet, processing, diesel, diesel, inlet, treatment, diesel, charac...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows
电子病历硬件设置:
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
- 母版:m5.2xL 8 vCore,32 GiB内存,仅限EBS存储 EBS存储:128 GiB
- 芯(10x):m5.4X大 16 vCore,64 GiB内存,仅限EBS存储 EBS存储:256 GiB
spark提交
设置:
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
Word2Vec
设置(如果未提及,我使用默认设置):
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
vectorSize=200
minCount=5
numIterations=15
numPartitions=120
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
- 在评估过程中,集群使用了大约70%的cpu
- 估算期间,ram的总使用率约为50-60%
numPartitions
以利用大约100%的cpu利用率吗?它会在多大程度上(或在多大程度上)降低模型的准确性?如何设置numIterations
?在这种情况下,什么是足够的
有人能帮我吗
提前谢谢 为什么这么大的语料库是必须的?@Nacho所以你建议对语料库进行采样并估计
Word2Vec
?如果没有特别的理由拥有这么大的语料库,我建议获得最大15-20Gb的样本。这对于一般用途的应用来说已经足够了。好吧,这似乎是合理的。我会尝试一下:)为什么一定要这么大的语料库?@Nacho所以你建议对语料库进行采样并估计Word2Vec
?如果没有特别的理由拥有这么大的语料库,我建议获得最大15-20Gb的样本。这对于一般用途的应用来说已经足够了。好吧,这似乎是合理的。我会试试看:)