Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby/21.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Performance 不可接受的慢速配置单元查询_Performance_Hadoop_Hive_External Tables_Apache Tez - Fatal编程技术网

Performance 不可接受的慢速配置单元查询

Performance 不可接受的慢速配置单元查询,performance,hadoop,hive,external-tables,apache-tez,Performance,Hadoop,Hive,External Tables,Apache Tez,我正在HDP2群集上运行Hive 0.14。我的数据集是使用kite sdk构建的,并使用外部表注册到配置单元 请参见下面的我的表格布局: hive> describe hivetweets; OK created_at bigint from deserializer id bigint from deserializer in_reply_to_use

我正在HDP2群集上运行Hive 0.14。我的数据集是使用kite sdk构建的,并使用外部表注册到配置单元

请参见下面的我的表格布局:

hive> describe hivetweets;
OK
created_at              bigint                  from deserializer
id                      bigint                  from deserializer
in_reply_to_user_id     bigint                  from deserializer
in_reply_to_status_id   bigint                  from deserializer
lang                    string                  from deserializer
text                    string                  from deserializer
retweet_count           int                     from deserializer
year                    int                     Partition column derived from 'created_at' column, generated by Kite.
month                   int                     Partition column derived from 'created_at' column, generated by Kite.
day                     int                     Partition column derived from 'created_at' column, generated by Kite.
hour                    int                     Partition column derived from 'created_at' column, generated by Kite.

# Partition Information
# col_name              data_type               comment

year                    int                     Partition column derived from 'created_at' column, generated by Kite.
month                   int                     Partition column derived from 'created_at' column, generated by Kite.
day                     int                     Partition column derived from 'created_at' column, generated by Kite.
hour                    int                     Partition column derived from 'created_at' column, generated by Kite.
Time taken: 0.15 seconds, Fetched: 19 row(s)
我对此设置的初始测试查询是仅获取一行数据集,我在示例中删除了实际输出:

hive> select * from hivetweets limit 1;
OK
Time taken: 103.726 seconds, Fetched: 1 row(s)
104秒运行这个查询的时间实在太长了

这可能不是分布式运行的,因此我尝试使用更多数据对其进行测试:

hive> select count(*) from hivetweets limit 100000;
Query ID = root_20150715132222_81e386ef-2990-4251-a61f-82ca8da4c48d
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.


Status: Running (Executing on YARN cluster with App id application_1436910684121_0006)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     19         19        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 567.52 s
--------------------------------------------------------------------------------
OK
197371741
在10分钟内统计10万条记录要合理得多还需要很长时间


我很高兴有任何关于如何调试的建议。

这是无法用您发布的数据进行诊断的。至少发布配置单元查询解释和作业日志。还要确保将风筝数据集格式化为拼花,而不是CSV或JSON。它必须在拼花中吗?目前我使用avro记录。我可以稍后提供查询解释,但只注意到一件事。如何进行此查询:从HiveWeets limit 100000中选择count*;是否返回197371741作为结果?好像是在分析和计算整个数据集?!该查询绝对解析整个数据集,然后返回计数。然后将重设限制为100000。这是SQL。很久没有做过SQL了,现在觉得自己很愚蠢-谢谢你的提示!关于拼花地板与Avro:拼花地板绝对适合蜂巢。他是专栏作家。兽人是最好的,但我不确定风筝能不能支持它。