Google bigquery 基于函数而不是原始内容查找重复行
我有一个bigquery表Google bigquery 基于函数而不是原始内容查找重复行,google-bigquery,Google Bigquery,我有一个bigquery表logs,其中有两列包含日志消息: time TIMESTAMP message STRING 我要选择与模式作业匹配的所有消息。+got machine(\d+),其中存在重复的机器。e、 g.给定行: 10000, "job foo got machine 10" 10010, "job bar got machine 10" 10010, "job baz got machine 20" 查询将选择前两行 我可以选择与查询重复的机器: SELECT REG
logs
,其中有两列包含日志消息:
time TIMESTAMP
message STRING
我要选择与模式作业匹配的所有消息。+got machine(\d+)
,其中存在重复的机器。e、 g.给定行:
10000, "job foo got machine 10"
10010, "job bar got machine 10"
10010, "job baz got machine 20"
查询将选择前两行
我可以选择与查询重复的机器:
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1
但是我不知道如何从这里得到包含这些机器的行。我尝试了以下方法:
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
HAVING
machine_id IN (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1)
但这会产生错误“error:Field'machine_id'notfound”
是否可以在一个查询中实现我想要的功能?不要在该上下文中使用have,只使用WHERE
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
AND REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') IN (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1)
我可以通过以下查询解决此问题:
SELECT
[time],
message
FROM (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1) AS A
JOIN (
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')) AS B
ON
A.machine_id = B.machine_id
这感觉有点笨重,但似乎能胜任工作。试试下面的方法
SELECT [time], message
FROM (
SELECT [time], message, machine_id,
COUNT(1) OVER(PARTITION BY machine_id) AS dups
FROM (
SELECT [time], message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
)
)
WHERE dups > 1
没有连接,不那么笨重
或者更简单地说:
SELECT [time], message FROM (
SELECT [time], message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id,
COUNT(1) OVER(PARTITION BY machine_id) AS dups
FROM
[logs]
)
WHERE dups > 1
不幸的是,这失败了,消息是“IN左侧的表达式必须是一个字段。已找到EXTRACT_REGEXP.”。但我已经设法解决了这个问题,我将在下面记录它。那些看起来更好!不幸的是,在我的数据集上,它们爆炸了,“在查询执行期间超出了资源”。这是28101819行(10.8GB的数据)-但这不是问题的一部分,所以我将把这标记为答案。