Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/342.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用PySpark计算数据帧组的TF-IDF_Python_Pyspark_Tf Idf - Fatal编程技术网

Python 如何使用PySpark计算数据帧组的TF-IDF

Python 如何使用PySpark计算数据帧组的TF-IDF,python,pyspark,tf-idf,Python,Pyspark,Tf Idf,我的问题是这个,但我使用PySpark,这个问题没有解决方案 我的数据框df如下所示,其中id_2表示文档id,id_1表示它们所属的语料库: +------+-------+--------------------+ | id_1| id_2| tokens| +------+-------+--------------------+ |122720| 139936|[front, offic, op...| |122720| 139935|[front, of

我的问题是这个,但我使用PySpark,这个问题没有解决方案

我的数据框
df
如下所示,其中
id_2
表示文档id,
id_1
表示它们所属的语料库:

+------+-------+--------------------+
|  id_1|   id_2|              tokens|
+------+-------+--------------------+
|122720| 139936|[front, offic, op...|
|122720| 139935|[front, offic, op...|
|122720| 126854|[great, pitch, lo...|
|122720| 139934|[front, offic, op...|
|122720| 126895|[front, offic, op...|
|122726| 139943|[challeng, custom...|
|122726| 139944|[custom, servic, ...|
|122726| 139946|[empowerment, chapt...|
|122726| 139945|[problem, solv, c...|
|122726| 761272|[deliv, excel, gu...|
|122728| 131068|[assign, mytholog...|
|122728| 982610|[trim, compar,...|
|122779| 226646|[compar, face, to...|
|122963|1019657|[rock, tekno...|
|122964| 134344|[market, chapter,...|
|122964| 134343|[market, chapter,...|
|122965|1554436|[human, resourc, ...|
|122965|1109173|[solut, hrm...|
|122965|2328172|[right, set...|
|122965|1236259|[hrm, chapter, st...|
+------+-------+--------------------+
如何计算每个语料库文档的TF-IDF

tf = hashingTF.transform(df)
idfModel = idf.fit(tf)
tfidf = idfModel.transform(tf)

--对于给定的场景,
tf
应该可以正常工作,因为它是文档特定的,但是像这样使用
idf
会考虑属于单个语料库的所有文档

我也遇到过类似的问题,这是效率低下的工作解决方案。如果您有任何进一步改进的想法,我们将不胜感激

from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml import Pipeline

preProcStages = []

hashingTF = HashingTF()
idf = IDF() 
preProcStages += [hashingTF, idf]

pipeline = Pipeline(stages=preProcStages)

def compute_idf_in_group(df):
    model = pipeline.fit(df)
    data = model.transform(df)
    return data

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)
    
resolved_groups = []
grouped_ids = list(set(df.select('id_1').rdd.flatMap(lambda x: x).collect()))
for id in grouped_ids:
    sub_df = df.filter(sf.col('id_1')==id)
    resolved_df = compute_idf_in_group(sub_df)
    resolved_groups.append(resolved_df)
    
final_df = unionAll(resolved_groups)

我有一个类似的问题,这是效率低下的工作解决方案。如果您有任何进一步改进的想法,我们将不胜感激

from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml import Pipeline

preProcStages = []

hashingTF = HashingTF()
idf = IDF() 
preProcStages += [hashingTF, idf]

pipeline = Pipeline(stages=preProcStages)

def compute_idf_in_group(df):
    model = pipeline.fit(df)
    data = model.transform(df)
    return data

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)
    
resolved_groups = []
grouped_ids = list(set(df.select('id_1').rdd.flatMap(lambda x: x).collect()))
for id in grouped_ids:
    sub_df = df.filter(sf.col('id_1')==id)
    resolved_df = compute_idf_in_group(sub_df)
    resolved_groups.append(resolved_df)
    
final_df = unionAll(resolved_groups)

我想你得自己做点什么。1.收集
id_1
,2的列表。在为每个模型生成tfidf之前,循环此列表并过滤df,3。将tfidf添加到字典中(key=id_1,value=model)。我认为你必须自己制作一些东西。1.收集
id_1
,2的列表。在为每个模型生成tfidf之前,循环此列表并过滤df,3。将tfidf添加到字典(key=id_1,value=model)。