Apache spark PySpark UDF优化挑战_Apache Spark_Pyspark_Amazon Emr

Apache spark PySpark UDF优化挑战

apache-spark pyspark

Apache spark PySpark UDF优化挑战,apache-spark,pyspark,amazon-emr,Apache Spark,Pyspark,Amazon Emr,我正在尝试优化下面的代码。当运行1000行数据时，大约需要12分钟才能完成。我们的用例将要求数据大小大约为25K-50K行，这将使这个实现完全不可行 import pyspark.sql.types as Types import numpy import spacy from pyspark.sql.functions import udf inputPath = "s3://myData/part-*.parquet" df = spark.read.parquet(i

我正在尝试优化下面的代码。当运行1000行数据时，大约需要12分钟才能完成。我们的用例将要求数据大小大约为25K-50K行，这将使这个实现完全不可行

import pyspark.sql.types as Types
import numpy
import spacy
from pyspark.sql.functions import udf

inputPath = "s3://myData/part-*.parquet"
df = spark.read.parquet(inputPath)

test_df = df.select('uid', 'content').limit(1000).repartition(10)

# print(df.rdd.getNumPartitions()) -> 4
# print(test_df.rdd.getNumPartitions()) -> 1

def load_glove(fn):
    vector_dict = {}
    count = 0
    with open(fn) as inf:
        for line in inf:
            count += 1
            eles = line.strip().split()
            token = eles[0]
            try:
                vector_dict[token] = numpy.array([float(x) for x in eles[1:]])
                assert len(vector_dict[token]) == 300
            except:
                print("Exception in load_glove")
                pass
    return vector_dict

# Returning an Array of doubles from the udf
@udf(returnType=Types.ArrayType(Types.FloatType()))
def generateVectorRepresentation(text):
  # TODO: move the load function out if posible, and remove unused modules 
  # nlp = spacy.load('en', disable=['parser', 'tagger'])
  nlp = spacy.load('en', max_length=6000000)
  gloveEmbeddingsPath = "/home/hadoop/short_glove_1000.300d.txt"
  glove_embeddings_dict = load_glove(gloveEmbeddingsPath)
  spacy_doc = nlp(text)
  doc_vec = numpy.array([0.0] * 300)
  doc_vec = numpy.float32(doc_vec)
  wordcount = 0
  for sentence_id, sentence in enumerate(spacy_doc.sents):
      for word in sentence:
          if word.text in glove_embeddings_dict:
              # Pre-convert to glove dictionary to float32 representations
              doc_vec += numpy.float32(glove_embeddings_dict[word.text])
              wordcount += 1

  # Document Vector is the average of all word vectors in the document
  doc_vec = doc_vec/(1.0 * wordcount)
  return doc_vec.tolist()

spark.udf.register("generateVectorRepresentation", generateVectorRepresentation)

document_vector_df = test_df.withColumn("Glove Document Vector", generateVectorRepresentation('content'))

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pandas_document_vector_df = document_vector_df.toPandas()

# print(pandas_document_vector_df)
pandas_document_vector_df.head()

我想知道你们能否帮我回答下面的问题

是否在每次迭代中都调用spacy.load（）和load_glood（）方法？是否有一种方法可以为每个工作节点准备一次load_glood（）数据，而不是为每行数据准备一次？ load_glove方法返回一个dictionary对象，该对象可能大到5GB。是否有方法在主节点上准备该参数，然后将其作为参数传递给UDF

谢谢你的建议。提前谢谢

是的，在当前的实现中，每次运行函数时都会执行所有的模型加载代码，这远远不是最优的。并没有办法将它从驱动程序直接传递到工作节点，但有一种类似的方法——在每个工作节点上初始化模型，但只初始化一次。为此，您必须使用lazy函数，该函数将仅在需要实际结果时执行——因此，在worker上也是如此

尝试这样做：

#这里我们不在加载时加载模型，只加载工作代码
#将调用此例程并获取spacy对象。这意味着我们正在装货
#每个执行者都有新的空间模型。
SPACY_模型=无
def get_spacy_模型（）：
全局SPACY_模型
如果不是SPACY_模型：
_型号=空间荷载（'en'，最大长度=6000000）
SPACY_模型=_模型
返回空间模型
@udf（returnType=Types.ArrayType（Types.FloatType（）））
def generateVectorRepresentation（文本）：
#TODO:如果可能，将load函数移出，并移除未使用的模块
#nlp=spacy.load（'en'，disable=['parser'，'tagger']））
nlp=获取空间模型（）
#您的进一步处理

我认为您可以尝试将手套加载代码添加到类似的函数中

您可以在此处尝试阅读更多相关内容：（这不是我的博客，只是在尝试使用Spacy模型执行相同操作时发现了此信息）。

udf-s速度如此缓慢的主要原因是spark无法对其进行优化（它们被视为黑盒）。所以，为了使它更快，你需要尽可能多地取出，并用香草火花功能替代它。理想的做法是只将

spacy

部分（我不熟悉该模块）保留在udf中，得到一个结果DF，然后使用vanilla spark函数执行其余需要的转换

例如，

load\u glood（）

将按照另一个答案为每一行执行。但从代码看，它的格式看起来可以转换为301列的数据帧。然后你可以在上面加入，以获得所需的值。（如果你能让另一个DF使用

word.text

作为键，没有数据就有点难以判断，但从理论上看这是可能的）。

Rai非常感谢您的回复，这对加载spacy模型和手套字典都起到了作用。它带来了巨大的性能提升！感谢您的回复。对上述模式的一个小的修正是SPACY_MODEL=_MODEL行需要在if语句中。我试图从另一个库中使用另一个类似的nlp管道进行复制，但我遇到了这个问题：当UDF并行运行时，它会导致争用条件吗？感谢您的回复。是的，我也在探索这条路。