Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在PySpark中为每秒钟运行插入覆盖数据计数未匹配_Apache Spark_Pyspark_Apache Spark Sql_Spark Dataframe_Pyspark Sql - Fatal编程技术网

Apache spark 在PySpark中为每秒钟运行插入覆盖数据计数未匹配

Apache spark 在PySpark中为每秒钟运行插入覆盖数据计数未匹配,apache-spark,pyspark,apache-spark-sql,spark-dataframe,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Spark Dataframe,Pyspark Sql,我正在使用PySpark(版本2.1.1)shell运行我的ETL代码 PySpark ETL代码的最后几行如下所示: usage_fact = usage_fact_stg.union(gtac_usage).union(gtp_usage).union(upaf_src).repartition("data_date","data_product") usage_fact.createOrReplaceTempView("usage_fact_staging") fact = spark

我正在使用PySpark(版本2.1.1)shell运行我的ETL代码

PySpark ETL代码的最后几行如下所示:

usage_fact = usage_fact_stg.union(gtac_usage).union(gtp_usage).union(upaf_src).repartition("data_date","data_product")

usage_fact.createOrReplaceTempView("usage_fact_staging")

fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
现在,在第一次执行最后一行(插入覆盖)之后,代码运行良好,输出表(usageWideFactTable)大约有240万行,这是预期的

如果我们再次执行最后一行,那么我将得到如下所示的错误/警告,并且输出表(usageWideFactTable)的计数将减少到84万

同样,如果我们第三次执行最后一行,那么令人惊讶的是,它运行良好,输出表(usageWideFactTable)的计数得到纠正,达到240万

在第4次运行中,警告/错误再次出现,输出表的计数(*)达到84万

PySpark shell上的上述4次运行如下所示:

>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
CREATE EXTERNAL TABLE `datawarehouse.usage_fact`(
`mcs_session_id` string,
`meeting_id` string,
`session_tracking_id` string,
`session_type` string,
`session_subject` string,
`session_date` string,
`session_start_time` string,
`session_end_time` string,
`session_duration` double,
`product_name` string,
`product_tier` string,
`product_version` string,
`product_build_number` string,
`native_user_id` string,
`native_participant_id` string,
`native_participant_user_id` string,
`participant_name` string,
`participant_email` string,
`participant_type` string,
`participant_start_time` timestamp,
`participant_end_time` timestamp,
`participant_duration` double,
`participant_ip` string,
`participant_city` string,
`participant_state` string,
`participant_country` string,
`participant_end_point` string,
`participant_entry_point` string,
`os_type` string,
`os_ver` string,
`os_locale` string,
`os_architecture` string,
`os_timezone` string,
`model_id` string,
`machine_address` string,
`model_name` string,
`browser` string,
`browser_version` string,
`audio_type` string,
`voip_duration` string,
`pstn_duration` string,
`webcam_duration` string,
`screen_share_duration` string,
`is_chat_used` string,
`is_screenshare_used` string,
`is_dialout_used` string,
`is_webcam_used` string,
`is_webinar_scheduled` string,
`is_webinar_deleted` string,
`is_registrationquestion_create` string,
`is_registrationquestion_modify` string,
`is_registrationquestion_delete` string,
`is_poll_created` string,
`is_poll_modified` string,
`is_poll_deleted` string,
`is_survey_created` string,
`is_survey_deleted` string,
`is_handout_uploaded` string,
`is_handout_deleted` string,
`entrypoint_access_time` string,
`endpoint_access_time` string,
`panel_connect_time` string,
`audio_connect_time` string,
`endpoint_install_time` string,
`endpoint_download_time` string,
`launcher_install_time` string,
`launcher_download_time` string,
`join_time` string,
`likely_to_recommend` string,
`rating_reason` string,
`customer_support` string,
`native_machinename_key` string,
`download_status` string,
`native_plan_key` string,
`useragent` string,
`native_connection_key` string,
`active_time` string,
`csid` string,
`arrival_time` string,
`closed_by` string,
`close_cause` string,
`viewer_ip_address` string,
`viewer_os_type` string,
`viewer_os_ver` string,
`viewer_build` string,
`native_service_account_id` string,
`license_key` string,
`session_id` string,
`session_participant_id` string,
`featureusagefactid` string,
`join_session_fact_id` string,
`responseid` string,
`sf_data_date` string,
`spf_data_date` string,
`fuf_data_date` string,
`jsf_data_date` string,
`nps_data_date` string,
`upaf_data_date` string,
`data_source_name` string,
`data_load_date_time` timestamp)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'
我也尝试过使用Oozie运行相同的ETL作业,但每运行第二次Oozie,就会出现计数不匹配的情况

输出表(usageWideFactTable=datawarehouse.usage\u fact)的DDL如下所示:

>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact_staging")
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
CREATE EXTERNAL TABLE `datawarehouse.usage_fact`(
`mcs_session_id` string,
`meeting_id` string,
`session_tracking_id` string,
`session_type` string,
`session_subject` string,
`session_date` string,
`session_start_time` string,
`session_end_time` string,
`session_duration` double,
`product_name` string,
`product_tier` string,
`product_version` string,
`product_build_number` string,
`native_user_id` string,
`native_participant_id` string,
`native_participant_user_id` string,
`participant_name` string,
`participant_email` string,
`participant_type` string,
`participant_start_time` timestamp,
`participant_end_time` timestamp,
`participant_duration` double,
`participant_ip` string,
`participant_city` string,
`participant_state` string,
`participant_country` string,
`participant_end_point` string,
`participant_entry_point` string,
`os_type` string,
`os_ver` string,
`os_locale` string,
`os_architecture` string,
`os_timezone` string,
`model_id` string,
`machine_address` string,
`model_name` string,
`browser` string,
`browser_version` string,
`audio_type` string,
`voip_duration` string,
`pstn_duration` string,
`webcam_duration` string,
`screen_share_duration` string,
`is_chat_used` string,
`is_screenshare_used` string,
`is_dialout_used` string,
`is_webcam_used` string,
`is_webinar_scheduled` string,
`is_webinar_deleted` string,
`is_registrationquestion_create` string,
`is_registrationquestion_modify` string,
`is_registrationquestion_delete` string,
`is_poll_created` string,
`is_poll_modified` string,
`is_poll_deleted` string,
`is_survey_created` string,
`is_survey_deleted` string,
`is_handout_uploaded` string,
`is_handout_deleted` string,
`entrypoint_access_time` string,
`endpoint_access_time` string,
`panel_connect_time` string,
`audio_connect_time` string,
`endpoint_install_time` string,
`endpoint_download_time` string,
`launcher_install_time` string,
`launcher_download_time` string,
`join_time` string,
`likely_to_recommend` string,
`rating_reason` string,
`customer_support` string,
`native_machinename_key` string,
`download_status` string,
`native_plan_key` string,
`useragent` string,
`native_connection_key` string,
`active_time` string,
`csid` string,
`arrival_time` string,
`closed_by` string,
`close_cause` string,
`viewer_ip_address` string,
`viewer_os_type` string,
`viewer_os_ver` string,
`viewer_build` string,
`native_service_account_id` string,
`license_key` string,
`session_id` string,
`session_participant_id` string,
`featureusagefactid` string,
`join_session_fact_id` string,
`responseid` string,
`sf_data_date` string,
`spf_data_date` string,
`fuf_data_date` string,
`jsf_data_date` string,
`nps_data_date` string,
`upaf_data_date` string,
`data_source_name` string,
`data_load_date_time` timestamp)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'

可能的问题是什么?如何纠正呢?还有,还有其他方法可以达到同样的效果吗?

我想这个问题与你之前的问题有关,归结起来就是你以前经历过的错误:

无法覆盖也从中读取的路径

它是为了保护您免受数据丢失,而不是让您的生活更加艰难。在某些情况下,Spark无法自动执行此操作,但让我重复我说过的一点-永远不要覆盖(全部或部分)用作管道源的数据

  • 充其量它将导致完全的数据丢失(希望您有良好的备份策略)
  • 在最坏的情况下,它会悄悄地破坏数据

感谢您提供的解决方案。由于我是spark的新手(确切地说是2天的工作经验),您能为我编写克服此问题所需的确切代码吗?请让我知道确切的代码。那真的很有帮助吗??