Python 3.x 错误类型错误：无序类型：int（）<；str（）_Python 3.x_Spark Dataframe_Apache Spark Mllib_Tf Idf_Lda

Python 3.x 错误类型错误：无序类型：int（）<；str（）

python-3.x

Python 3.x 错误类型错误：无序类型：int（）<；str（）,python-3.x,spark-dataframe,apache-spark-mllib,tf-idf,lda,Python 3.x,Spark Dataframe,Apache Spark Mllib,Tf Idf,Lda,我得到了这个错误 Using Python version 3.5.2+ (default, Sep 22 2016 12:18:14) SparkSession available as 'spark'. Traceback (most recent call last): File "/home/saria/PycharmProjects/TfidfLDA/main.py", line 30, in <module> corpus = indexed_

我得到了这个错误

        Using Python version 3.5.2+ (default, Sep 22 2016 12:18:14)
SparkSession available as 'spark'.
Traceback (most recent call last):
  File "/home/saria/PycharmProjects/TfidfLDA/main.py", line 30, in <module>
    corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
  File "/home/saria/tf27/lib/python3.5/site-packages/pyparsing.py", line 956, in col
    return 1 if 0<loc<len(s) and s[loc-1] == '\n' else loc - s.rfind("\n", 0, loc)
TypeError: unorderable types: int() < str()

Process finished with exit code 1

我回顾了这些案例：

但它们是关于int和string的转换，特别是读取输入。但是这里我没有输入， 代码说明： 此代码正在使用Dataframe执行tfidf+lda

    # I used alias to avoid confusion with the mllib library
from pyparsing import col
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import HashingTF as MLHashingTF, Tokenizer, HashingTF, IDF, StringIndexer
from pyspark.ml.feature import IDF as MLIDF
from pyspark.python.pyspark.shell import sqlContext, sc

from pyspark.sql.types import DoubleType, StructField, StringType, StructType
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

dbURL = "hdfs://en.wikipedia.org/wiki/Music"
file = sc.textFile("1.txt")
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)
indexer = StringIndexer(inputCol='key',outputCol='KeyIndex')
indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures')
corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
model = LDA.train(corpus, k=2)

，它抛出一个新错误

TypeError:col（）缺少1个必需的位置参数：“strg”

更新 我的主要目标是运行以下代码：

看起来

loc

是一个字符串，而不是一个数字，因此出现了错误。在比较失败之前，您是否尝试过打印（repr（loc））？如果是，它显示了什么？谢谢你的回答@9000，但是你说的LOC是什么意思，我这里没有LOC@AnthonySottile对不起，我没听懂你的意思：（，你是说全部错误吗？@AnthonySottile，确定我要更新问题我更新问题@AnthonySottile，谢谢：）

    # I used alias to avoid confusion with the mllib library
from pyparsing import col
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import HashingTF as MLHashingTF, Tokenizer, HashingTF, IDF, StringIndexer
from pyspark.ml.feature import IDF as MLIDF
from pyspark.python.pyspark.shell import sqlContext, sc

from pyspark.sql.types import DoubleType, StructField, StringType, StructType
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

dbURL = "hdfs://en.wikipedia.org/wiki/Music"
file = sc.textFile("1.txt")
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)
indexer = StringIndexer(inputCol='key',outputCol='KeyIndex')
indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures')
corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
model = LDA.train(corpus, k=2)

corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)