Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/sorting/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark 有没有办法将超过255列加载到Spark数据帧?_Pyspark_Spark Dataframe - Fatal编程技术网

Pyspark 有没有办法将超过255列加载到Spark数据帧?

Pyspark 有没有办法将超过255列加载到Spark数据帧?,pyspark,spark-dataframe,Pyspark,Spark Dataframe,我正在尝试加载站点catalyst数据,它有近1000条记录。下面显示的是我正在使用的代码: from pyspark.sql.types import * from pyspark.sql import Row sqlContext = SQLContext(sc) omni_rdd = sc.textFile('hdfs://user/temp/sitecatalyst20170101.gz') omni_rdd_delim = omni_rdd.ma

我正在尝试加载站点catalyst数据,它有近1000条记录。下面显示的是我正在使用的代码:

    from pyspark.sql.types import *
    from pyspark.sql import Row
    sqlContext = SQLContext(sc)
    omni_rdd = sc.textFile('hdfs://user/temp/sitecatalyst20170101.gz')
    omni_rdd_delim = omni_rdd.map(lambda line: line.split("\t"))
    omni_df = omni_rdd_delim.map(lambda line: Row(
      col_1 =   line[0]
    , col_2 =   line[1]
    , col_3 =   line[2]
    , ..
    , ..
    , col_999 = line[998]
    )).toDF()
我遇到了以下错误:

  File "<stdin>", line 2
  SyntaxError: more than 255 arguments
文件“”,第2行
SyntaxError:超过255个参数
有没有办法将所有1000列加载到我的数据帧中

-你可以这样做。 定义具有列名的列表

cols = ['col_0' ,'col_1' ,'col_2' ,.........., 'col_999']
在创建数据帧时使用它

omni_rdd = sc.textFile('hdfs://user/temp/sitecatalyst20170101.gz')
omni_rdd_delim = omni_rdd.map(lambda line: line.split(","))
omni_df = omni_rdd_delim.toDF(cols)

这会导致以下错误:回溯(最近一次调用):文件“”,第1行,在文件“/opt/spark/python/pyspark/sql/context.py”中,第64行,在toDF返回sqlContext.createDataFrame(self,schema,sampleRatio)文件“/opt/spark/python/pyspark/sql/context.py”,第423行,在createDataFrame rdd中,schema=self.\u createFromRDD(数据、模式、采样)文件“/opt/spark/python/pyspark/sql/context.py”,第315行,在_createfromrddstruct.fields[i].name=name indexer:list index超出范围知道如何纠正吗??