从PySpark UDF更新全局字典中的值_Pyspark_Global Variables_User Defined Functions

从PySpark UDF更新全局字典中的值

pyspark

从PySpark UDF更新全局字典中的值,pyspark,global-variables,user-defined-functions,Pyspark,Global Variables,User Defined Functions,我有一个用户定义函数（UDF），它向spark数据帧添加一个新列，但速度有点慢 UDF计算用户输入和拼写正确的单词列表之间的编辑距离，我希望通过将用户输入和最接近的单词匹配存储在全局字典中来加快编辑距离。这个想法是先参考全球词典，然后再花时间计算所有单词的分数我是Spark/PySpark的新手，所以我不知道所有正确的术语，但从我所读到的内容来看，执行器似乎没有跨线程（或其他）跟踪全局变量。我也读过广播变量，但我认为它们是作为输入传递的，累加器只允许数字数据以下是我目前正在使用的一些示例代码

我有一个用户定义函数（UDF），它向spark数据帧添加一个新列，但速度有点慢

UDF计算用户输入和拼写正确的单词列表之间的编辑距离，我希望通过将用户输入和最接近的单词匹配存储在全局字典中来加快编辑距离。这个想法是先参考全球词典，然后再花时间计算所有单词的分数

我是Spark/PySpark的新手，所以我不知道所有正确的术语，但从我所读到的内容来看，执行器似乎没有跨线程（或其他）跟踪全局变量。我也读过广播变量，但我认为它们是作为输入传递的，累加器只允许数字数据

以下是我目前正在使用的一些示例代码：

def guess_word(user_entry):
    user_entry= user_entry.upper().strip()

    # Check if the best match has already been calculated from a previous row, 
    # if not, calculate scores and return the one with the lowest score
    if user_entry not in global_dict:
        scores = {}
        # Calculate scores against every word
        for word in word_dataset:
            word= word.upper().strip()
            if word not in scores:
                scores[word] = distance(user_entry, word)
            else:
                continue
        # Get the word with the lowest score (aka best match)
        word_guess, score = sorted(scores.items(), key=lambda kv: kv[1])[0]

        # Update the global dictionary
        global_dict[user_entry] = (word_guess,score)

    else:
        word_guess = global_dict[user_entry]

    return word_guess


global_dict = {}

guess_word_udf = udf(lambda x: guess_word(x), StringType())

user_data = user_data.withColumn('word_guess', guess_word_udf('user_entry'))

运行此代码后，全局目录在运行此代码后始终为空。有没有可能

我刚刚意识到，在UDF运行完毕后，我不需要字典，这个问题现在变得毫无意义：D