Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 获取'的Pyspark错误;拆分';调用split()函数时不在列表中_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Apache spark 获取'的Pyspark错误;拆分';调用split()函数时不在列表中

Apache spark 获取'的Pyspark错误;拆分';调用split()函数时不在列表中,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我创建了一个数据框架,如下所示 spark= SparkSession.builder.appName("test").getOrCreate() categories=spark.read.text("resources/textFile/categories") categories.show(n=2) +------------+ | value| +------------+ |1,2,Football| | 2,2,Soccer| +------------+ only

我创建了一个数据框架,如下所示

spark= SparkSession.builder.appName("test").getOrCreate()
categories=spark.read.text("resources/textFile/categories")
categories.show(n=2)
+------------+
|       value|
+------------+
|1,2,Football|
|  2,2,Soccer|
+------------+
only showing top 2 rows
现在,当我将此数据帧转换为RDD并尝试根据“,”(逗号)拆分RDD的每一行时

在将位置1处的元素添加到crdd RDD时,我得到以下错误

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Users\Downloads\bigdataSetup\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\types.py", line 1504, in __getattr__
    idx = self.__fields__.index(item)
ValueError: 'split' is not in list

注意:此处CSV格式的数据只是为了方便复制。

由于您的数据是CSV格式的,您可以使用read.CSV API:

categories=spark.read.csv("resources/textFile/categories")
按如下方式修改您的代码:

crdd = categories.rdd.map(lambda line: line.value.split(',')[1])

for i in crdd.take(10): print (i)

我知道有一个CSV模块,这里的CSV只是为了方便复制。我更感兴趣的是解决这个问题。请参阅添加的解决问题的代码。值是数据框的列名。这意味着map正在接收行列表,然后您需要执行以下操作>>>categories.rdd.map(lambda x:x[0]。split(“,”[1])。取(3)
crdd = categories.rdd.map(lambda line: line.value.split(',')[1])

for i in crdd.take(10): print (i)