PySpark/计算出现次数,并使用UDF创建一个新列

PySpark/计算出现次数,并使用UDF创建一个新列,pyspark,split,count,user-defined-functions,assert,Pyspark,Split,Count,User Defined Functions,Assert,我有一个包含多列的数据框,包括video\u id和tags 我需要在我的df中创建一个名为occurrencias_music的新列,其中字符串“music”的出现次数作为任何标记的子字符串。标签不一定与“music”完全相同,但它包含它作为子字符串 稍后,我们的想法是实现一个UDFsubtag\u music\u UDF,它返回IntegerType(),并封装传统的python函数subcadena\u en\u vector(tags): 为此,我需要一个名为subcadena\u en

我有一个包含多列的数据框,包括
video\u id
tags

我需要在我的df中创建一个名为
occurrencias_music
的新列,其中字符串“music”的出现次数作为任何标记的子字符串。标签不一定与“music”完全相同,但它包含它作为子字符串

稍后,我们的想法是实现一个UDF
subtag\u music\u UDF
,它返回IntegerType(),并封装传统的python函数
subcadena\u en\u vector(tags)

为此,我需要一个名为
subcadena\u en\u vector(tags)
的函数,该函数应接收一个字符串列表作为参数,并检查向量中有多少元素包含单词“music”作为子字符串。我必须用这个列表测试它的运行:

["a life in music", "music for life", "bso", "hans zimmer"]
结果是2

我对
subcadena\u en\u vector(tags)
函数有一个概念:

def subcadena_en_vector(tags, strToSearch):
    nTimes = 0
    for item in tags:
        #print(item.split())
        for subitem in item.split():
            if subitem==strToSearch:
                nTimes += 1

    return nTimes

if __name__ == "__main__":
  tags = ["a life in music", "music for life", "bso", "hans zimmer"]
  palabra = 'music'
  print(cuenta(tags,palabra)
此函数的问题在于,稍后在包含此断言的更正部分中:

assert(subcadena_en_vector(["a life in music", "music for life", "bso", "hans zimmer"]) == 2)
我得到以下错误:

> TypeErrorTraceback (most recent call last)
> <ipython-input-3-7a51ae031d9e> in <module>()
> ----> 1 assert(subcadena_en_vector(["a life in music", "music for life", "bso", "hans zimmer"]) == 2) TypeError: subcadena_en_vector()
> takes exactly 2 arguments (1 given)
>TypeErrorTraceback(最近一次呼叫最后一次)
>在()
>--->1断言(subcadena_en_vector([“音乐中的生命”、“生命的音乐”、“bso”、“hans zimmer”])==2)类型错误:subcadena_en_vector()
>只接受2个参数(给定1个)
有没有关于如何简化函数的想法,这样它就可以进行事件计数,而不会出现参数错误


提前谢谢。

我终于通过这样做解决了这个问题:

from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window

subtag_music = ["a life in music", "music for life", "bso", "hans zimmer"]

def subcadena_en_vector(tags):
    return(sum([1 for c in tags if "music" in c]))

print(subcadena_en_vector(subtag_music))

subtag_music_UDF = F.udf(subcadena_en_vector, T.IntegerType())
videosOcurrenciasMusicDF = videosDiasViralDF.withColumn("ocurrencias_music", subtag_music_UDF(F.col("tags")))

谢谢

您需要一个字符串来匹配,为什么在assert op时不传递该字符串?
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window

subtag_music = ["a life in music", "music for life", "bso", "hans zimmer"]

def subcadena_en_vector(tags):
    return(sum([1 for c in tags if "music" in c]))

print(subcadena_en_vector(subtag_music))

subtag_music_UDF = F.udf(subcadena_en_vector, T.IntegerType())
videosOcurrenciasMusicDF = videosDiasViralDF.withColumn("ocurrencias_music", subtag_music_UDF(F.col("tags")))