Python 无法从RDD创建数据帧

Python 无法从RDD创建数据帧,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我正试图从这个kaggle数据集创建一个推荐系统:f7a1f242-c 该文件名为:“user\u artist\u data\u small.txt” 数据如下所示: 1059637 1000010 238 1059637 1000049 1 1059637 1000056 1 1059637 1000062 11 1059637 1000094 1 最后一行代码的第三行出现错误 !pip install pyspark==3.0.1 py4j==0.10.9 from pyspark.sq

我正试图从这个kaggle数据集创建一个推荐系统:f7a1f242-c

该文件名为:“user\u artist\u data\u small.txt”

数据如下所示:

1059637 1000010 238

1059637 1000049 1

1059637 1000056 1

1059637 1000062 11

1059637 1000094 1

最后一行代码的第三行出现错误

!pip install pyspark==3.0.1 py4j==0.10.9
from pyspark.sql import SparkSession
from pyspark import SparkContext 
appName="Collaborative Filtering with PySpark"
from pyspark.sql.types import StructType,StructField,IntegerType,StringType,LongType
from pyspark.sql.functions import col
from pyspark.ml.recommendation import ALS
from google.colab import drive
drive.mount ('/content/gdrive')

spark = SparkSession.builder.appName(appName).getOrCreate()
sc = spark.sparkContext

userArtistData1=sc.textFile("/content/gdrive/My Drive/data/user_artist_data_small.txt")


schema_user_artist = StructType([StructField("userId",StringType(),True),StructField("artistId",StringType(),True),StructField("playCount",StringType(),True)])

userArtistRDD = userArtistData1.map(lambda k: k.split())

user_artist_df = spark.createDataFrame(userArtistRDD,schema_user_artist,['userId','artistId','playCount']) 

ua = user_artist_df.alias('ua') 
(training, test) = ua.randomSplit([0.8, 0.2])  #Training the model
als = ALS(maxIter=5, implicitPrefs=True,userCol="userId", itemCol="artistId", ratingCol="playCount",coldStartStrategy="drop")

model = als.fit(training)# predict using the testing datatset

predictions = model.transform(test)
predictions.show()
错误是:

IllegalArgumentException: requirement failed: Column userId must be of type numeric but was actually of type string.
因此,我将模式中的类型从StringType更改为IntegerType,得到以下错误:

TypeError: field userId: IntegerType can not accept object '1059637' in type <class 'str'>
TypeError:字段userId:IntegerType无法接受类型中的对象“1059637”

数字恰好是数据集中的第一项。请提供帮助?

只需使用CSV读取器(带空格分隔符)创建数据帧,而不是创建RDD:

user_artist_df = spark.read.schema(schema_user_artist).csv('/content/gdrive/My Drive/data/user_artist_data_small.txt', sep=' ')