Apache spark 蜂箱罐';找不到Spark结构化流写入的分区数据
我有一份spark结构化流媒体工作,将数据写入IBM云对象存储(S3): 我可以使用hdfs CLI查看数据:Apache spark 蜂箱罐';找不到Spark结构化流写入的分区数据,apache-spark,hive,ibm-cloud,spark-structured-streaming,analytics-engine,Apache Spark,Hive,Ibm Cloud,Spark Structured Streaming,Analytics Engine,我有一份spark结构化流媒体工作,将数据写入IBM云对象存储(S3): 我可以使用hdfs CLI查看数据: [clsadmin@xxxxx ~]$ hdfs dfs -ls s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0 | head Found 616 items -rw-rw-rw- 1 clsadmin cls
[clsadmin@xxxxx ~]$ hdfs dfs -ls s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0 | head
Found 616 items
-rw-rw-rw- 1 clsadmin clsadmin 38085 2018-09-25 01:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 45874 2018-09-25 00:31 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-28ff873e-8a9c-4128-9188-c7b763c5b4ae.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 5124 2018-09-25 01:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-5f768960-4b29-4bce-8f31-2ca9f0d42cb5.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 40154 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-70abc027-1f88-4259-a223-21c4153e2a85.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 41282 2018-09-25 00:50 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-873a1caa-3ecc-424a-8b7c-0b2dc1885de4.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 41241 2018-09-25 00:40 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-88b617bf-e35c-4f24-acec-274497b1fd31.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 3114 2018-09-25 00:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-deae2a19-1719-4dfa-afb6-33b57f2d73bb.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 38877 2018-09-25 00:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-e07429a2-43dc-4e5b-8fe7-c55ec68783b3.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 39060 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00001-1553da20-14d0-4c06-ae87-45d22914edba.c000.snappy.parquet
但是,当我尝试查询数据时:
hive> select * from invoiceitems limit 5;
OK
Time taken: 2.392 seconds
我的表DDL如下所示:
CREATE EXTERNAL TABLE `invoiceitems`(
`invoiceno` int,
`stockcode` int,
`description` string,
`quantity` int,
`invoicedate` bigint,
`unitprice` double,
`customerid` int,
`country` string,
`lineno` int,
`invoicetime` string,
`storeid` int,
`transactionid` string,
`invoicedatestring` string)
PARTITIONED BY (
`invoiceyear` int,
`invoicemonth` int,
`invoiceday` int,
`invoicehour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://streaming-data-landing-zone-partitioned/data'
我还尝试了列/分区名称的正确大小写-这也不起作用
知道为什么我的查询找不到数据吗
更新1: 我已经尝试过将位置设置为一个包含没有分区的数据的目录,但仍然不起作用,所以我想知道这是否是一个数据格式问题
CREATE EXTERNAL TABLE `invoiceitems`(
`InvoiceNo` int,
`StockCode` int,
`Description` string,
`Quantity` int,
`InvoiceDate` bigint,
`UnitPrice` double,
`CustomerID` int,
`Country` string,
`LineNo` int,
`InvoiceTime` string,
`StoreID` int,
`TransactionID` string,
`InvoiceDateString` string)
PARTITIONED BY (
`InvoiceYear` int,
`InvoiceMonth` int,
`InvoiceDay` int,
`InvoiceHour` int)
STORED AS PARQUET
LOCATION
's3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/';
hive> Select * from invoiceitems limit 5;
OK
Time taken: 2.066 seconds
读取Snappy压缩拼花地板文件 数据采用snappy压缩拼花文件格式
s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet
因此,在CREATETABLEDDL语句中设置'PARQUET.COMPRESS'='SNAPPY'表属性。您也可以在Ambari的“自定义配置单元站点设置”部分为IOP或HDP设置parquet.compression=SNAPPY
以下是在配置单元中的表创建语句期间使用表属性的示例:
hive> CREATE TABLE inv_hive_parquet(
trans_id int, product varchar(50), trans_dt date
)
PARTITIONED BY (
year int)
STORED AS PARQUET
TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY');
更新外部表中的分区元数据
此外,对于外部分区表,每当任何外部作业(本例中为spark作业)将分区直接写入Datafolder时,我们都需要更新分区元数据,因为除非显式更新分区,否则配置单元将不知道这些分区
这可以通过以下方式实现:
ALTER TABLE inv_hive_parquet RECOVER PARTITIONS;
//or
MSCK REPAIR TABLE inv_hive_parquet;
执行select时,您是否在配置单元日志中收到任何错误消息?@vishad配置单元日志中没有任何错误消息幸运的是,在DDL中设置“PARQUET.COMPRESS”=“SNAPPY”无法正常工作3需要按说明进行修复。抱歉,我应该说明我已尝试使用MSCK Repair TABLE命令。返回ok,但查询invoiceitems时仍不返回数据;
ALTER TABLE inv_hive_parquet RECOVER PARTITIONS;
//or
MSCK REPAIR TABLE inv_hive_parquet;