Python IllegalArgumentException:输入列cityIndex应至少具有两个不同的值

Python IllegalArgumentException:输入列cityIndex应至少具有两个不同的值,python,apache-spark,pyspark,databricks,Python,Apache Spark,Pyspark,Databricks,我试图在databricks社区版中测试运行一些ML模型。我已经导入了csv并创建了spark df,需要将一些字符串转换为兼容的数据类型。当使用管道进行转换时,我收到并错误地说cityIndex应该至少有两个不同的值。我应该如何解决此错误。或者,如何将字符串转换为与spark ML算法兼容的格式 from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler categories =

我试图在databricks社区版中测试运行一些ML模型。我已经导入了csv并创建了spark df,需要将一些字符串转换为兼容的数据类型。当使用管道进行转换时,我收到并错误地说cityIndex应该至少有两个不同的值。我应该如何解决此错误。或者,如何将字符串转换为与spark ML算法兼容的格式

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

categories = ["ticket_id","city","issue_type","ticket_status","issue_description","rating","acknowledged_at","ticket_created_date_time","ticket_last_updated_date_time","address","lat","lng","location"
]
stages = []
for categoricalCol in categories:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
print(stringIndexer.getOutputCol())
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
stages += [stringIndexer, encoder]

label_stringIdx = StringIndexer(inputCol = 'issue_type', outputCol = 'label')
stages += [label_stringIdx]

assemblerInputs = [c + "classVec" for c in categories]
assembler = VectorAssembler(inputCols=assemblerInputs, 
outputCol="features")
stages += [assembler]

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
错误消息
IllegalArgumentException:'需求失败:输入列
cityIndex应该至少有两个不同的值。”
IllegalArgumentException回溯(最后一次最近调用)
在()
1来自pyspark.ml导入管道
2管道=管道(阶段=阶段)
---->3 pipelineModel=pipeline.fit(df)
4 df=pipelineModel.transform(df)
/fit中的databricks/spark/python/pyspark/ml/base.py(self、dataset、params)
130返回自复制(参数).\u拟合(数据集)
131其他:
-->132返回自拟合(数据集)
133其他:
134 raise VALUERROR(“参数必须是参数映射或参数映射的列表/元组,”
/databricks/spark/python/pyspark/ml/pipeline.py in_-fit(self,dataset)
107数据集=stage.transform(数据集)
108 else:#必须是估计值
-->109模型=阶段拟合(数据集)
110变压器。附加(型号)
111如果i
看起来所有行的城市列值都相同,如果是这种情况,就不要对其进行编码。谢谢,这很有效!看起来所有行的城市列值都相同,如果是这种情况,就不要对其进行编码。谢谢,这很有效!
IllegalArgumentException: 'requirement failed: The input column 
cityIndex should have at least two distinct values.'
IllegalArgumentException                  Traceback (most recent call last)
<command-3264374776080465> in <module>()
      1 from pyspark.ml import Pipeline
      2 pipeline = Pipeline(stages = stages)
----> 3 pipelineModel = pipeline.fit(df)
      4 df = pipelineModel.transform(df)

/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
    130                 return self.copy(params)._fit(dataset)
    131             else:
--> 132                 return self._fit(dataset)
    133         else:
    134             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/databricks/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
    107                     dataset = stage.transform(dataset)
    108                 else:  # must be an Estimator
--> 109                     model = stage.fit(dataset)
    110                     transformers.append(model)
    111                     if i < indexOfLastEstimator: