Python IllegalArgumentException:输入列cityIndex应至少具有两个不同的值_Python_Apache Spark_Pyspark_Databricks

Python IllegalArgumentException:输入列cityIndex应至少具有两个不同的值

python apache-spark pyspark

Python IllegalArgumentException:输入列cityIndex应至少具有两个不同的值,python,apache-spark,pyspark,databricks,Python,Apache Spark,Pyspark,Databricks,我试图在databricks社区版中测试运行一些ML模型。我已经导入了csv并创建了spark df，需要将一些字符串转换为兼容的数据类型。当使用管道进行转换时，我收到并错误地说cityIndex应该至少有两个不同的值。我应该如何解决此错误。或者，如何将字符串转换为与spark ML算法兼容的格式 from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler categories =

我试图在databricks社区版中测试运行一些ML模型。我已经导入了csv并创建了spark df，需要将一些字符串转换为兼容的数据类型。当使用管道进行转换时，我收到并错误地说cityIndex应该至少有两个不同的值。我应该如何解决此错误。或者，如何将字符串转换为与spark ML算法兼容的格式

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

categories = ["ticket_id","city","issue_type","ticket_status","issue_description","rating","acknowledged_at","ticket_created_date_time","ticket_last_updated_date_time","address","lat","lng","location"
]
stages = []
for categoricalCol in categories:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
print(stringIndexer.getOutputCol())
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
stages += [stringIndexer, encoder]

label_stringIdx = StringIndexer(inputCol = 'issue_type', outputCol = 'label')
stages += [label_stringIdx]

assemblerInputs = [c + "classVec" for c in categories]
assembler = VectorAssembler(inputCols=assemblerInputs, 
outputCol="features")
stages += [assembler]

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)

错误消息

IllegalArgumentException:'需求失败：输入列
cityIndex应该至少有两个不同的值。”
IllegalArgumentException回溯（最后一次最近调用）
在（）
1来自pyspark.ml导入管道
2管道=管道（阶段=阶段）
---->3 pipelineModel=pipeline.fit（df）
4 df=pipelineModel.transform（df）
/fit中的databricks/spark/python/pyspark/ml/base.py（self、dataset、params）
130返回自复制（参数）.\u拟合（数据集）
131其他：
-->132返回自拟合（数据集）
133其他：
134 raise VALUERROR（“参数必须是参数映射或参数映射的列表/元组，”
/databricks/spark/python/pyspark/ml/pipeline.py in_-fit（self，dataset）
107数据集=stage.transform（数据集）
108 else:#必须是估计值
-->109模型=阶段拟合（数据集）
110变压器。附加（型号）
111如果i

看起来所有行的城市列值都相同，如果是这种情况，就不要对其进行编码。谢谢，这很有效！看起来所有行的城市列值都相同，如果是这种情况，就不要对其进行编码。谢谢，这很有效！

IllegalArgumentException: 'requirement failed: The input column 
cityIndex should have at least two distinct values.'
IllegalArgumentException                  Traceback (most recent call last)
<command-3264374776080465> in <module>()
      1 from pyspark.ml import Pipeline
      2 pipeline = Pipeline(stages = stages)
----> 3 pipelineModel = pipeline.fit(df)
      4 df = pipelineModel.transform(df)

/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
    130                 return self.copy(params)._fit(dataset)
    131             else:
--> 132                 return self._fit(dataset)
    133         else:
    134             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/databricks/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
    107                     dataset = stage.transform(dataset)
    108                 else:  # must be an Estimator
--> 109                     model = stage.fit(dataset)
    110                     transformers.append(model)
    111                     if i < indexOfLastEstimator: