Hive 配置单元查询不工作

Hive 配置单元查询不工作,hive,hiveql,Hive,Hiveql,我正在尝试获取相当于MySQL查询的HiveQL 在MySQL中,我得到了如下表: CREATE TABLE votes( user_id INT UNSIGNED NOT NULL, list_id INT UNSIGNED NOT NULL, node_id INT UNSIGNED NOT NULL, direction ENUM('UP', 'DOWN') NOT NULL, PRIMARY KEY (user_id, list_id, node_id) ) ENGINE=i

我正在尝试获取相当于MySQL查询的HiveQL

在MySQL中,我得到了如下表:

CREATE TABLE votes(
 user_id INT UNSIGNED NOT NULL,
 list_id INT UNSIGNED NOT NULL,
 node_id INT UNSIGNED NOT NULL,
 direction ENUM('UP', 'DOWN') NOT NULL, 
 PRIMARY KEY (user_id, list_id, node_id)
) ENGINE=innodb;
我已在配置单元中使用以下工具创建了一个类似的表:

CREATE TABLE votes (
 user_id INT,
 list_id INT,
 node_id INT,
 direction STRING
) ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
我将MySQL表中的6行复制到配置单元表中。所以在蜂巢里,我得到了:

hive> SELECT * FROM votes;
OK
28      390     400058  "UP"
28      390     400059  "DOWN"
90113   390     400058  "DOWN"
90113   390     400059  "UP"
323694  390     400058  "UP"
323694  390     400059  "UP"
Time taken: 0.059 seconds, Fetched: 6 row(s)
以下语句在MySQL中运行良好:

SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id, 
SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu, 
SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud, 
SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du, 
SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
哪些产出:

390 400058  390 400058  2   0   0   1
390 400058  390 400059  1   1   1   0
390 400059  390 400058  1   1   1   0
390 400059  390 400059  2   0   0   1
但是,相同的语句没有给出配置单元中的正确计数:

hive> SELECT v1.list_id AS lid, v1.node_id AS nid, v2.list_id AS rlid, v2.node_id AS rnid,
    > SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
    > SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
    > SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
    > SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
    > FROM votes v1
    > JOIN votes v2 ON v1.user_id=v2.user_id
    > GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;

...

Status: Finished successfully
OK
390     400058  390     400058  0       0       0       0
390     400058  390     400059  0       0       0       0
390     400059  390     400058  0       0       0       0
390     400059  390     400059  0       0       0       0
Time taken: 19.127 seconds, Fetched: 4 row(s)
我该如何解决这个问题


此外,我还发现了一篇帖子,其中有人提到最好避免在蜂巢中自我加入。如果这是真的,您能解释一下为什么要这样做,并用一个更好的查询来实现我想要得到的吗?

看起来这些引号实际上是上/下值字符串的一部分,所以您需要将它们包含在比较语句中。我可以使用此配置单元查询获得您的预期结果:

SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id,
  SUM(IF(v1.direction='"UP"' AND v2.direction='"UP"', 1, 0)) AS uu,
  SUM(IF(v1.direction='"UP"' AND v2.direction='"DOWN"', 1, 0)) AS ud,
  SUM(IF(v1.direction='"DOWN"' AND v2.direction='"UP"', 1, 0)) AS du,
  SUM(IF(v1.direction='"DOWN"' AND v2.direction='"DOWN"', 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
请注意,向上/向下的值现在用单引号括起来,以确保双引号被解释为值的一部分。

我建议您在创建表时使用双引号。这样,双引号将在SELECT查询中自动处理,因为CSVSerde中的
默认字符

CREATE TABLE votes (user_id INT, list_id INT, node_id INT, direction STRING) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
     WITH SERDEPROPERTIES ("separatorChar" = "\t") -- default seperator is ,
STORED AS CSVFILE;
运行SELECT查询

SELECT v1.list_id AS lid, v1.node_id AS nid, 
     v2.list_id AS rlid, v2.node_id AS rnid,
     SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
     SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
     SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
     SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
     FROM votes v1
     JOIN votes v2 ON v1.user_id=v2.user_id
     GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;


+------+---------+-------+---------+-----+-----+-----+-----+--+
| lid  |   nid   | rlid  |  rnid   | uu  | ud  | du  | dd  |
+------+---------+-------+---------+-----+-----+-----+-----+--+
| 390  | 400058  | 390   | 400058  | 2   | 0   | 0   | 1   |
| 390  | 400058  | 390   | 400059  | 1   | 1   | 1   | 0   |
| 390  | 400059  | 390   | 400058  | 1   | 1   | 1   | 0   |
| 390  | 400059  | 390   | 400059  | 2   | 0   | 0   | 1   |
+------+---------+-------+---------+-----+-----+-----+-----+--+

没错!我在CSV转储中有引号,所以引号也成为了价值的一部分。ThxI已经用Hive serdes添加了答案。请添加您的反馈。