Pyspark 红移:表格和文件之间的列数不匹配,由spark以拼花格式生成

Pyspark 红移:表格和文件之间的列数不匹配,由spark以拼花格式生成,pyspark,amazon-redshift,parquet,Pyspark,Amazon Redshift,Parquet,输入数据集由pyspark命令生成, dfs_ids1.write.parquet('./outputs-parquet/i94-apr16.parquet', mode='overwrite') 复制语句和错误消息为 COPY staging_ids FROM 's3://sushanth-dend-capstone-files/i94-apr16.parquet/' IAM_ROLE 'arn:aws:iam::164084742828:role/dwhRole' FORMAT A

输入数据集由pyspark命令生成,

dfs_ids1.write.parquet('./outputs-parquet/i94-apr16.parquet', mode='overwrite')    
复制语句和错误消息为

COPY staging_ids FROM 's3://sushanth-dend-capstone-files/i94-apr16.parquet/' IAM_ROLE 'arn:aws:iam::164084742828:role/dwhRole' FORMAT AS PARQUET ;

S3 Query Exception (Fetch)
DETAIL:  
  -----------------------------------------------
  error:  S3 Query Exception (Fetch)
  code:      15001
  context:   Task failed due to an internal error. Unmatched number of columns between table and file. Table columns: 23, Data columns: 22, File name: https://s3.us-west-2.amazonaws.com/sushanth-dend-capstone-files/i94-apr16.parquet/part-00000-6034cb60-860e-4a6c-a86d-6
  query:     1867
  location:  dory_util.cpp:1119
  process:   fetchtask_thread [pid=1331]
  -----------------------------------------------
使用S3“Select from”检查拼花地板时,发现对于某些JSON集合,缺少列
occup
,此特定列主要包含“null或STU”

为了验证此列是否丢失,我使用以下参数在雅典娜中阅读了此拼花地板文件,
S3路径:
S3://sushanth dend capstone files/i94-apr16.parquet/

列的批量加载:
CoC int、CoR int、PoE字符串、着陆状态字符串、年龄int、签证签发字符串、占用字符串、biryear int、性别字符串、航空公司字符串、admnum int、fltno字符串、visatype字符串、到达模式字符串、访问目的字符串、到达日期、出发日期、Daysinos int、添加到i94日期、允许到日期,输入\输出字符串,月整数

雅典娜DDL

CREATE EXTERNAL TABLE IF NOT EXISTS capstone.staging_ids (
  `coc` int,
  `cor` int,
  `poe` string,
  `landing_state` string,
  `age` int,
  `visa_issued_in` string,
  `occup` string,
  `biryear` int,
  `gender` string,
  `airline` string,
  `admnum` int,
  `fltno` string,
  `visatype` string,
  `arrival_mode` string,
  `visit_purpose` string,
  `arrival_dt` date,
  `departure_dt` date,
  `daysinus` int,
  `added_to_i94` date,
  `allowed_until` date,
  `entry_exit` string,
  `month` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://sushanth-dend-capstone-files/i94-apr16.parquet/'
TBLPROPERTIES ('has_encrypted_data'='false');
当我运行查询时,我可以看到一些行中带有STU的列'occup'

问题是,

  • 如何以拼花格式写入spark dataframe中的所有列 档案
  • 如何通过COPY语句将此类文件加载到redshift,parquet是否不是加载到redshift的正确格式
CREATE EXTERNAL TABLE IF NOT EXISTS capstone.staging_ids (
  `coc` int,
  `cor` int,
  `poe` string,
  `landing_state` string,
  `age` int,
  `visa_issued_in` string,
  `occup` string,
  `biryear` int,
  `gender` string,
  `airline` string,
  `admnum` int,
  `fltno` string,
  `visatype` string,
  `arrival_mode` string,
  `visit_purpose` string,
  `arrival_dt` date,
  `departure_dt` date,
  `daysinus` int,
  `added_to_i94` date,
  `allowed_until` date,
  `entry_exit` string,
  `month` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://sushanth-dend-capstone-files/i94-apr16.parquet/'
TBLPROPERTIES ('has_encrypted_data'='false');