Python 为什么即使列存在于数据帧中,PySpark仍会告知列名上的键错误?
在从HDFS读取拼花文件后,我尝试将数据帧保存到snowflake中,如下所示。在将其加载到snowflake表之前,将向其添加一个ID列,该列中有序列号Python 为什么即使列存在于数据帧中,PySpark仍会告知列名上的键错误?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,在从HDFS读取拼花文件后,我尝试将数据帧保存到snowflake中,如下所示。在将其加载到snowflake表之前,将向其添加一个ID列,该列中有序列号 df = spark.read.parquet('file_path') schema = StructType([StructField("ID", LongType(), True)] + df.schema.fields[:]) data_rdd = df.rdd.zipWithIndex() new_rdd
df = spark.read.parquet('file_path')
schema = StructType([StructField("ID", LongType(), True)] + df.schema.fields[:])
data_rdd = df.rdd.zipWithIndex()
new_rdd = data_rdd.map(lambda row: (row[1],) + tuple(row[0].asDict()[c] for c in file_schema.fieldNames()[:-1]))
final_df = spark.createDataFrame(new_rdd, schema)
print(final_df.printSchema())
final_df.show()
当我提交作业时,我可以看到数据帧的模式如下:
root
|-- ID: long (nullable = true)
|-- COL1: string (nullable = true)
|-- COL2: string (nullable = true)
|-- COL3: string (nullable = true)
|-- COL4: string (nullable = true)
|-- COLn: string (nullable = true)
但是错误出现在final_df.show()行
无
回溯(最近一次呼叫最后一次):
文件“autocheck.py”,第66行,在
如果读取和加载拼花地板文件(文件路径):
文件“autocheck.py”,第42行,在读取和加载拼花文件中
最终设计图显示()
文件“/opt/hadoop/data/08/hadoop/thread/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/pyspark.zip/pyspark/sql/dataframe.py”,第350行,如图所示
文件“/opt/hadoop/data/08/hadoop/thread/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py”,第1257行,在__
文件“/opt/hadoop/data/08/hadoop/thread/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/pyspark.zip/pyspark/sql/utils.py”,第63行,装饰
文件“/opt/hadoop/data/08/hadoop/thread/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py”,第328行,在get_return_值中
py4j.protocol.Py4JJavaError:调用o131.showString时出错。
文件“autocheck.py”,第38行,在
文件“autocheck.py”,第38行,在
KeyError:'ID'
代码中的第38行是new_rdd=data_rdd.map(lambda行:(行[1],)+tuple(行[0].asDict()[c]表示文件中的c_schema.fieldNames()[:-1])
我不明白我在这里该怎么办。我将ID列添加到lambda函数中的现有行中。但是我看到了错误KeyError:'ID'
有没有人能告诉我我在这里犯了什么错误,我该如何改正
None
Traceback (most recent call last):
File "autocheck.py", line 66, in <module>
if read_and_load_parquet_files(file_path):
File "autocheck.py", line 42, in read_and_load_parquet_files
final_df.show()
File "/opt/hadoop/data/08/hadoop/yarn/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/opt/hadoop/data/08/hadoop/yarn/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/hadoop/data/08/hadoop/yarn/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/hadoop/data/08/hadoop/yarn/local/usercache/hdfstest/appcache/application_1603175231393_0446/container_e500_1603175231393_0446_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o131.showString.
File "autocheck.py", line 38, in <lambda>
File "autocheck.py", line 38, in <genexpr>
KeyError: 'ID'