PySpark/计算出现次数,并使用UDF创建一个新列
我有一个包含多列的数据框,包括PySpark/计算出现次数,并使用UDF创建一个新列,pyspark,split,count,user-defined-functions,assert,Pyspark,Split,Count,User Defined Functions,Assert,我有一个包含多列的数据框,包括video\u id和tags 我需要在我的df中创建一个名为occurrencias_music的新列,其中字符串“music”的出现次数作为任何标记的子字符串。标签不一定与“music”完全相同,但它包含它作为子字符串 稍后,我们的想法是实现一个UDFsubtag\u music\u UDF,它返回IntegerType(),并封装传统的python函数subcadena\u en\u vector(tags): 为此,我需要一个名为subcadena\u en
video\u id
和tags
我需要在我的df中创建一个名为occurrencias_music
的新列,其中字符串“music”的出现次数作为任何标记的子字符串。标签不一定与“music”完全相同,但它包含它作为子字符串
稍后,我们的想法是实现一个UDFsubtag\u music\u UDF
,它返回IntegerType(),并封装传统的python函数subcadena\u en\u vector(tags)
:
为此,我需要一个名为subcadena\u en\u vector(tags)
的函数,该函数应接收一个字符串列表作为参数,并检查向量中有多少元素包含单词“music”作为子字符串。我必须用这个列表测试它的运行:
["a life in music", "music for life", "bso", "hans zimmer"]
结果是2
我对subcadena\u en\u vector(tags)
函数有一个概念:
def subcadena_en_vector(tags, strToSearch):
nTimes = 0
for item in tags:
#print(item.split())
for subitem in item.split():
if subitem==strToSearch:
nTimes += 1
return nTimes
if __name__ == "__main__":
tags = ["a life in music", "music for life", "bso", "hans zimmer"]
palabra = 'music'
print(cuenta(tags,palabra)
此函数的问题在于,稍后在包含此断言的更正部分中:
assert(subcadena_en_vector(["a life in music", "music for life", "bso", "hans zimmer"]) == 2)
我得到以下错误:
> TypeErrorTraceback (most recent call last)
> <ipython-input-3-7a51ae031d9e> in <module>()
> ----> 1 assert(subcadena_en_vector(["a life in music", "music for life", "bso", "hans zimmer"]) == 2) TypeError: subcadena_en_vector()
> takes exactly 2 arguments (1 given)
>TypeErrorTraceback(最近一次呼叫最后一次)
>在()
>--->1断言(subcadena_en_vector([“音乐中的生命”、“生命的音乐”、“bso”、“hans zimmer”])==2)类型错误:subcadena_en_vector()
>只接受2个参数(给定1个)
有没有关于如何简化函数的想法,这样它就可以进行事件计数,而不会出现参数错误
提前谢谢。我终于通过这样做解决了这个问题:
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window
subtag_music = ["a life in music", "music for life", "bso", "hans zimmer"]
def subcadena_en_vector(tags):
return(sum([1 for c in tags if "music" in c]))
print(subcadena_en_vector(subtag_music))
subtag_music_UDF = F.udf(subcadena_en_vector, T.IntegerType())
videosOcurrenciasMusicDF = videosDiasViralDF.withColumn("ocurrencias_music", subtag_music_UDF(F.col("tags")))
谢谢 您需要一个字符串来匹配,为什么在assert op时不传递该字符串?
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window
subtag_music = ["a life in music", "music for life", "bso", "hans zimmer"]
def subcadena_en_vector(tags):
return(sum([1 for c in tags if "music" in c]))
print(subcadena_en_vector(subtag_music))
subtag_music_UDF = F.udf(subcadena_en_vector, T.IntegerType())
videosOcurrenciasMusicDF = videosDiasViralDF.withColumn("ocurrencias_music", subtag_music_UDF(F.col("tags")))