Machine learning pyspark.sql.utils.IllegalArgumentException:';“字段”;特色;不存在

Machine learning pyspark.sql.utils.IllegalArgumentException:';“字段”;特色;不存在,machine-learning,pyspark,apache-spark-ml,Machine Learning,Pyspark,Apache Spark Ml,我试图通过SparkNLP对文本数据进行主题建模和情感分析。我已经对数据集执行了所有预处理步骤,但在LDA中出现错误 计划是: from pyspark.ml import Pipeline from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF from pyspark.ml.clustering import LDA from pyspark.sql.functions import col, lit, c

我试图通过SparkNLP对文本数据进行主题建模和情感分析。我已经对数据集执行了所有预处理步骤,但在LDA中出现错误

计划是:

from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql.functions import col, lit, concat, regexp_replace
from pyspark.sql.utils import AnalysisException
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

dataframe_new = spark.read.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true') \
.load('/home/cdh@psnet.com/Gourav/chap3/abcnews-date-text.csv')

get_tokenizers = Tokenizer(inputCol="headline_text", outputCol="get_tokens")
get_tokenized = get_tokenizers.transform(dataframe_new)

remover = StopWordsRemover(inputCol="get_tokens", outputCol="row")
get_remover = remover.transform(get_tokenized)

counter_vectorized = CountVectorizer(inputCol="row", outputCol="get_features")
getmodel = counter_vectorized.fit(get_remover)
get_result = getmodel.transform(get_remover)

idf_function = IDF(inputCol="get_features", outputCol="get_idf_feature")
train_model = idf_function.fit(get_result)
outcome = train_model.transform(get_result)

lda = LDA(k=10, maxIter=10)
model = lda.fit(outcome)
IDF后的数据帧架构:

根据,LDA包括一个
featuresCol
参数,默认值为
featuresCol='features'
,即保存实际特征的列的名称;根据所示的模式,数据帧中不存在这样的列,因此出现了预期的错误

不清楚数据框中哪个列包含功能-
get\u features
get\u idf\u feature
(它们在您显示的示例中看起来相同);假设它是
get\u idf\u功能
,您应该将LDA调用更改为:

lda = LDA(featuresCol='get_idf_feature', k=10, maxIter=10)
Spark(包括pyspark)mlapi与sciketlearn和类似的框架有着截然不同的逻辑;其中一个区别是,这些特性必须全部位于各自数据帧的一列中。有关该想法的一般演示,请参见中的自己的答案(关于K-Means,但逻辑是相同的)