Hive 配置单元查询不工作
我正在尝试获取相当于MySQL查询的HiveQL 在MySQL中,我得到了如下表:Hive 配置单元查询不工作,hive,hiveql,Hive,Hiveql,我正在尝试获取相当于MySQL查询的HiveQL 在MySQL中,我得到了如下表: CREATE TABLE votes( user_id INT UNSIGNED NOT NULL, list_id INT UNSIGNED NOT NULL, node_id INT UNSIGNED NOT NULL, direction ENUM('UP', 'DOWN') NOT NULL, PRIMARY KEY (user_id, list_id, node_id) ) ENGINE=i
CREATE TABLE votes(
user_id INT UNSIGNED NOT NULL,
list_id INT UNSIGNED NOT NULL,
node_id INT UNSIGNED NOT NULL,
direction ENUM('UP', 'DOWN') NOT NULL,
PRIMARY KEY (user_id, list_id, node_id)
) ENGINE=innodb;
我已在配置单元中使用以下工具创建了一个类似的表:
CREATE TABLE votes (
user_id INT,
list_id INT,
node_id INT,
direction STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
我将MySQL表中的6行复制到配置单元表中。所以在蜂巢里,我得到了:
hive> SELECT * FROM votes;
OK
28 390 400058 "UP"
28 390 400059 "DOWN"
90113 390 400058 "DOWN"
90113 390 400059 "UP"
323694 390 400058 "UP"
323694 390 400059 "UP"
Time taken: 0.059 seconds, Fetched: 6 row(s)
以下语句在MySQL中运行良好:
SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id,
SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
哪些产出:
390 400058 390 400058 2 0 0 1
390 400058 390 400059 1 1 1 0
390 400059 390 400058 1 1 1 0
390 400059 390 400059 2 0 0 1
但是,相同的语句没有给出配置单元中的正确计数:
hive> SELECT v1.list_id AS lid, v1.node_id AS nid, v2.list_id AS rlid, v2.node_id AS rnid,
> SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
> SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
> SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
> SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
> FROM votes v1
> JOIN votes v2 ON v1.user_id=v2.user_id
> GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
...
Status: Finished successfully
OK
390 400058 390 400058 0 0 0 0
390 400058 390 400059 0 0 0 0
390 400059 390 400058 0 0 0 0
390 400059 390 400059 0 0 0 0
Time taken: 19.127 seconds, Fetched: 4 row(s)
我该如何解决这个问题
此外,我还发现了一篇帖子,其中有人提到最好避免在蜂巢中自我加入。如果这是真的,您能解释一下为什么要这样做,并用一个更好的查询来实现我想要得到的吗?看起来这些引号实际上是上/下值字符串的一部分,所以您需要将它们包含在比较语句中。我可以使用此配置单元查询获得您的预期结果:
SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id,
SUM(IF(v1.direction='"UP"' AND v2.direction='"UP"', 1, 0)) AS uu,
SUM(IF(v1.direction='"UP"' AND v2.direction='"DOWN"', 1, 0)) AS ud,
SUM(IF(v1.direction='"DOWN"' AND v2.direction='"UP"', 1, 0)) AS du,
SUM(IF(v1.direction='"DOWN"' AND v2.direction='"DOWN"', 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
请注意,向上/向下的值现在用单引号括起来,以确保双引号被解释为值的一部分。我建议您在创建表时使用双引号。这样,双引号将在SELECT查询中自动处理,因为CSVSerde中的默认字符
是“
CREATE TABLE votes (user_id INT, list_id INT, node_id INT, direction STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = "\t") -- default seperator is ,
STORED AS CSVFILE;
运行SELECT查询
SELECT v1.list_id AS lid, v1.node_id AS nid,
v2.list_id AS rlid, v2.node_id AS rnid,
SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
+------+---------+-------+---------+-----+-----+-----+-----+--+
| lid | nid | rlid | rnid | uu | ud | du | dd |
+------+---------+-------+---------+-----+-----+-----+-----+--+
| 390 | 400058 | 390 | 400058 | 2 | 0 | 0 | 1 |
| 390 | 400058 | 390 | 400059 | 1 | 1 | 1 | 0 |
| 390 | 400059 | 390 | 400058 | 1 | 1 | 1 | 0 |
| 390 | 400059 | 390 | 400059 | 2 | 0 | 0 | 1 |
+------+---------+-------+---------+-----+-----+-----+-----+--+
没错!我在CSV转储中有引号,所以引号也成为了价值的一部分。ThxI已经用Hive serdes添加了答案。请添加您的反馈。