Python 从TF-IDF到pyspark中的LDA群集_Python_Apache Spark_Pyspark_Tf Idf_Lda

Python 从TF-IDF到pyspark中的LDA群集

python apache-spark pyspark

Python 从TF-IDF到pyspark中的LDA群集,python,apache-spark,pyspark,tf-idf,lda,Python,Apache Spark,Pyspark,Tf Idf,Lda,我正在尝试对存储在格式键listofwords中的tweet进行聚类我的第一步是使用dataframe with为单词列表提取TF-IDF值 dbURL = "hdfs://pathtodir" file = sc.textFile(dbURL) #Define data frame schema fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)] schema

我正在尝试对存储在格式键listofwords中的tweet进行聚类

我的第一步是使用dataframe with为单词列表提取TF-IDF值

dbURL = "hdfs://pathtodir"  
file = sc.textFile(dbURL)
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)

但是现在我没有找到一个好方法将我的数据帧转换成前面的示例或中建议的格式

如果有人能给我指出正确的看的地方，或者如果我的方法不对，能给我指正，我将非常感激

我认为从一系列文档中提取TF-IDS向量并对其进行聚类应该是一件非常经典的事情，但我没有找到一种简单的方法来做到这一点。

LDA期望a（id，特征）作为输入，因此假设

KeyIndex

作为id：

from pyspark.mllib.clustering import LDA

k = ... # number of clusters
corpus = indexed_data.select(col("KeyIndex").cast("long"), "features").map(list)
model = LDA.train(corpus, k=k)

LDA不将TF-IDF矩阵作为输入。相反，它只接受TF矩阵。例如：

from pyspark.ml.feature import *
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.clustering import LDA

tokenizer = Tokenizer(inputCol="hashTagDocument", outputCol="words")

stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered", 
stopWords=stopwords)

vectorizer = CountVectorizer(inputCol="filtered", outputCol="features", 
vocabSize=40000, minDF=5) 

pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(corpus)

pipelineModel.stages

的确，这是我想解决的问题，但每次强制转换列似乎都会失败：

input\u c=input.withColumn（“keyindx2”，input['Index']）.cast（“int”）

和

input\u c.take（1）

引发超时错误，我在read（self，size）中被指向/usr/lib/python2.7/socket.pyc#许多平台上的碎片问题。#379 try:->380 data=self.\u sock.recv（左）381除错误外，e:382如果e.args[0]==EINTR:\uu这不是强制转换列的正确方法吗？这里有很多内容。如果简单的数据转换在那里不起作用，那么您就有很多严重的问题。请准备一份MCVE，并将其作为一个单独的问题提问。我面临着同样的问题，它给了我这个错误

名称错误：名称“col”未定义

，我的代码与你的相同，你是如何解决的？请提及你的spark版本1。LDA未初始化。2.停止语-它们来自哪里。请核对一下。

from pyspark.ml.feature import *
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.clustering import LDA

tokenizer = Tokenizer(inputCol="hashTagDocument", outputCol="words")

stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered", 
stopWords=stopwords)

vectorizer = CountVectorizer(inputCol="filtered", outputCol="features", 
vocabSize=40000, minDF=5) 

pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(corpus)

pipelineModel.stages