Python 安装管道时Pypark ML管道错误_Python_Pyspark_Pipeline_Apache Spark Ml

Python 安装管道时Pypark ML管道错误

python pyspark

Python 安装管道时Pypark ML管道错误,python,pyspark,pipeline,apache-spark-ml,Python,Pyspark,Pipeline,Apache Spark Ml,我第一次用三个pyspark.ml.feature（tokenizer、CV、idf）构建了一个流水线，所有的thongs都运行良好，但第二次尝试时，它告诉我Py4JJavaError：调用o175.fit时出错。有人知道这个错误的原因吗 import findspark findspark.init() import pyspark.sql.types as typ import pyspark as ps from pyspark.sql import Spa

我第一次用三个pyspark.ml.feature（tokenizer、CV、idf）构建了一个流水线，所有的thongs都运行良好，但第二次尝试时，它告诉我Py4JJavaError：调用o175.fit时出错。有人知道这个错误的原因吗

   import findspark
   findspark.init()
   import pyspark.sql.types as typ
   import pyspark as ps
   from pyspark.sql import SparkSession
   import pandas as pd
   import numpy as np
   import warnings
   from pyspark.sql import SQLContext
   sparkSession = SparkSession.builder \
       .master("local[2]") \
       .appName("Pyspark Sentiment") \
       .getOrCreate()
   df = sparkSession.read.load('data/Microblog_Trialdata.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')
df=df.select("sentiment score","spans")
(train_set, val_set, test_set) = df.randomSplit([0.6, 0.2, 0.2], seed = 42)
from pyspark.ml.feature import HashingTF, IDF, Tokenizer ,CountVectorizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

tokenizer = Tokenizer(inputCol="spans", outputCol="words")
CV = CountVectorizer(vocabSize=2**11, inputCol="words", outputCol='cv_')
idf = IDF(inputCol='cv_', outputCol="features", minDocFreq=5) #minDocFreq: 

remove sparse terms


#model=CV.fit(data)
    #vo=model.vocabulary 
    #print(type(vo))
    pipeline = Pipeline(stages=[tokenizer, CV, idf])

    pipelineFit = pipeline.fit(train_set)
    train_df = pipelineFit.transform(train_set)
    val_df = pipelineFit.transform(val_set)
    train_df.select("cv_").show(5,truncate=False)
    train_df.show(5)

在列集合中但在值集合中看不到的单词可能会导致错误。计数矢量器具有handleInvalid选项来解决此问题

# this will ignore not seen words
CV = CountVectorizer(vocabSize=2**11, inputCol="words", outputCol='cv_',handleInvalid='skip')

您需要提供有关错误的更多详细信息。但是我猜第一次看不到的分类ID可能是这个错误造成的hi hamza我编辑了这个问题，对不起，我不明白你说的分类ID到底是什么意思？hi jowwel。我添加了答案你能试试吗？Hamza我试过，但不起作用。似乎CountVectorizer没有参数handleInvalid，错误源来自管道。fit（train_set）在用于val_set之前，我遇到了同样的问题，并用此参数解决了。我的spark版本是2.3.1。您可以输入完整的错误消息吗？Py4JJavaError:调用o49.fit时出错：org.apache.SparkException:作业因阶段失败而中止：阶段3.0中的任务0失败1次，最近的失败：阶段3.0中的任务0.0丢失（TID 3，localhost，executor driver）：org.apache.SparkException:未能执行用户定义的函数（$anonfun$createTransformFunc$1:（字符串）=>array）