Apache spark 在PySpark中为每秒钟运行插入覆盖数据计数未匹配
我正在使用PySpark(版本2.1.1)shell运行我的ETL代码 PySpark ETL代码的最后几行如下所示:Apache spark 在PySpark中为每秒钟运行插入覆盖数据计数未匹配,apache-spark,pyspark,apache-spark-sql,spark-dataframe,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Spark Dataframe,Pyspark Sql,我正在使用PySpark(版本2.1.1)shell运行我的ETL代码 PySpark ETL代码的最后几行如下所示: usage_fact = usage_fact_stg.union(gtac_usage).union(gtp_usage).union(upaf_src).repartition("data_date","data_product") usage_fact.createOrReplaceTempView("usage_fact_staging") fact = spark
usage_fact = usage_fact_stg.union(gtac_usage).union(gtp_usage).union(upaf_src).repartition("data_date","data_product")
usage_fact.createOrReplaceTempView("usage_fact_staging")
fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
现在,在第一次执行最后一行(插入覆盖)之后,代码运行良好,输出表(usageWideFactTable)大约有240万行,这是预期的
如果我们再次执行最后一行,那么我将得到如下所示的错误/警告,并且输出表(usageWideFactTable)的计数将减少到84万
同样,如果我们第三次执行最后一行,那么令人惊讶的是,它运行良好,输出表(usageWideFactTable)的计数得到纠正,达到240万
在第4次运行中,警告/错误再次出现,输出表的计数(*)达到84万
PySpark shell上的上述4次运行如下所示:
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
CREATE EXTERNAL TABLE `datawarehouse.usage_fact`(
`mcs_session_id` string,
`meeting_id` string,
`session_tracking_id` string,
`session_type` string,
`session_subject` string,
`session_date` string,
`session_start_time` string,
`session_end_time` string,
`session_duration` double,
`product_name` string,
`product_tier` string,
`product_version` string,
`product_build_number` string,
`native_user_id` string,
`native_participant_id` string,
`native_participant_user_id` string,
`participant_name` string,
`participant_email` string,
`participant_type` string,
`participant_start_time` timestamp,
`participant_end_time` timestamp,
`participant_duration` double,
`participant_ip` string,
`participant_city` string,
`participant_state` string,
`participant_country` string,
`participant_end_point` string,
`participant_entry_point` string,
`os_type` string,
`os_ver` string,
`os_locale` string,
`os_architecture` string,
`os_timezone` string,
`model_id` string,
`machine_address` string,
`model_name` string,
`browser` string,
`browser_version` string,
`audio_type` string,
`voip_duration` string,
`pstn_duration` string,
`webcam_duration` string,
`screen_share_duration` string,
`is_chat_used` string,
`is_screenshare_used` string,
`is_dialout_used` string,
`is_webcam_used` string,
`is_webinar_scheduled` string,
`is_webinar_deleted` string,
`is_registrationquestion_create` string,
`is_registrationquestion_modify` string,
`is_registrationquestion_delete` string,
`is_poll_created` string,
`is_poll_modified` string,
`is_poll_deleted` string,
`is_survey_created` string,
`is_survey_deleted` string,
`is_handout_uploaded` string,
`is_handout_deleted` string,
`entrypoint_access_time` string,
`endpoint_access_time` string,
`panel_connect_time` string,
`audio_connect_time` string,
`endpoint_install_time` string,
`endpoint_download_time` string,
`launcher_install_time` string,
`launcher_download_time` string,
`join_time` string,
`likely_to_recommend` string,
`rating_reason` string,
`customer_support` string,
`native_machinename_key` string,
`download_status` string,
`native_plan_key` string,
`useragent` string,
`native_connection_key` string,
`active_time` string,
`csid` string,
`arrival_time` string,
`closed_by` string,
`close_cause` string,
`viewer_ip_address` string,
`viewer_os_type` string,
`viewer_os_ver` string,
`viewer_build` string,
`native_service_account_id` string,
`license_key` string,
`session_id` string,
`session_participant_id` string,
`featureusagefactid` string,
`join_session_fact_id` string,
`responseid` string,
`sf_data_date` string,
`spf_data_date` string,
`fuf_data_date` string,
`jsf_data_date` string,
`nps_data_date` string,
`upaf_data_date` string,
`data_source_name` string,
`data_load_date_time` timestamp)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'
我也尝试过使用Oozie运行相同的ETL作业,但每运行第二次Oozie,就会出现计数不匹配的情况
输出表(usageWideFactTable=datawarehouse.usage\u fact)的DDL如下所示:
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
CREATE EXTERNAL TABLE `datawarehouse.usage_fact`(
`mcs_session_id` string,
`meeting_id` string,
`session_tracking_id` string,
`session_type` string,
`session_subject` string,
`session_date` string,
`session_start_time` string,
`session_end_time` string,
`session_duration` double,
`product_name` string,
`product_tier` string,
`product_version` string,
`product_build_number` string,
`native_user_id` string,
`native_participant_id` string,
`native_participant_user_id` string,
`participant_name` string,
`participant_email` string,
`participant_type` string,
`participant_start_time` timestamp,
`participant_end_time` timestamp,
`participant_duration` double,
`participant_ip` string,
`participant_city` string,
`participant_state` string,
`participant_country` string,
`participant_end_point` string,
`participant_entry_point` string,
`os_type` string,
`os_ver` string,
`os_locale` string,
`os_architecture` string,
`os_timezone` string,
`model_id` string,
`machine_address` string,
`model_name` string,
`browser` string,
`browser_version` string,
`audio_type` string,
`voip_duration` string,
`pstn_duration` string,
`webcam_duration` string,
`screen_share_duration` string,
`is_chat_used` string,
`is_screenshare_used` string,
`is_dialout_used` string,
`is_webcam_used` string,
`is_webinar_scheduled` string,
`is_webinar_deleted` string,
`is_registrationquestion_create` string,
`is_registrationquestion_modify` string,
`is_registrationquestion_delete` string,
`is_poll_created` string,
`is_poll_modified` string,
`is_poll_deleted` string,
`is_survey_created` string,
`is_survey_deleted` string,
`is_handout_uploaded` string,
`is_handout_deleted` string,
`entrypoint_access_time` string,
`endpoint_access_time` string,
`panel_connect_time` string,
`audio_connect_time` string,
`endpoint_install_time` string,
`endpoint_download_time` string,
`launcher_install_time` string,
`launcher_download_time` string,
`join_time` string,
`likely_to_recommend` string,
`rating_reason` string,
`customer_support` string,
`native_machinename_key` string,
`download_status` string,
`native_plan_key` string,
`useragent` string,
`native_connection_key` string,
`active_time` string,
`csid` string,
`arrival_time` string,
`closed_by` string,
`close_cause` string,
`viewer_ip_address` string,
`viewer_os_type` string,
`viewer_os_ver` string,
`viewer_build` string,
`native_service_account_id` string,
`license_key` string,
`session_id` string,
`session_participant_id` string,
`featureusagefactid` string,
`join_session_fact_id` string,
`responseid` string,
`sf_data_date` string,
`spf_data_date` string,
`fuf_data_date` string,
`jsf_data_date` string,
`nps_data_date` string,
`upaf_data_date` string,
`data_source_name` string,
`data_load_date_time` timestamp)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'
可能的问题是什么?如何纠正呢?还有,还有其他方法可以达到同样的效果吗?我想这个问题与你之前的问题有关,归结起来就是你以前经历过的错误: 无法覆盖也从中读取的路径 它是为了保护您免受数据丢失,而不是让您的生活更加艰难。在某些情况下,Spark无法自动执行此操作,但让我重复我说过的一点-永远不要覆盖(全部或部分)用作管道源的数据
- 充其量它将导致完全的数据丢失(希望您有良好的备份策略)
- 在最坏的情况下,它会悄悄地破坏数据