SQL REDDIT-Jaccard相似性
我试图实现一个奇特的SQL查询,但在执行连接和计数时遇到了麻烦 我有一个很长的数据表:SQL REDDIT-Jaccard相似性,sql,comments,google-bigquery,reddit,Sql,Comments,Google Bigquery,Reddit,我试图实现一个奇特的SQL查询,但在执行连接和计数时遇到了麻烦 我有一个很长的数据表: author | group | id | daniel | group1| 118 adam | group2| 126 harry | group1| 221 daniel | group2| 323 daniel | group2| 122 daniel | group5| 322 harry | group1| 222 harry | group1| 225 。。。 我希望我的输出看
author | group | id |
daniel | group1| 118
adam | group2| 126
harry | group1| 221
daniel | group2| 323
daniel | group2| 122
daniel | group5| 322
harry | group1| 222
harry | group1| 225
。。。
我希望我的输出看起来像:
author1 | author2 | intersection | union
daniel | adam | 2 | 3
daniel | harry| 2 | 11
adam | harry| 0 | 10
其中,交集定义为author1和author2具有共同点的组的#和author1+author-交集组的并集=#
我认为正确的方法是
表a左连接a.group==b.group上的b表
但是我不知道怎么做总计数
谢谢在这里输入代码
因为1)仍然看不到任何答案2)看到作者的相关问题和BigQuery标签
因此,理论上,下面的查询将使您的任务(使用bigquery-samples.reddit.full表查看下面的示例):
BigQuery遗留SQL:
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.subr = b.subr) AS count_intersection,
EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
通过类似的方式,您可以调整遗留SQL以使其正常工作
这可能不是最好的方法,但至少给了这样的任务一些希望,使其能够在BigQuery中轻松运行,而不必去其他解决方法“跳入”,因为1)仍然看不到任何答案2)看到了作者与BigQuery标记相关的问题
CREATE OR REPLACE FUNCTION public.jaccard_similarity(IN vector anyarray)
RETURNS double precision[]
LANGUAGE 'plpgsql'
AS $BODY$
BEGIN
RETURN(select ARRAY(
select(
select (SELECT COUNT(*) FROM (select vector1 INTERSECT select vector2) as intersect_elements)/(SELECT COUNT(*) FROM(select vector1 UNION select vector2) as union_elements) from unnest($1,"TOPIC_VECTOR") as t(vector1,vector2))
from public.tbl_topic)
as score);
END;
$BODY$;
ALTER FUNCTION public.jaccard_similarity(anyarray)
OWNER TO postgres;
COMMENT ON FUNCTION public.jaccard_similarity(anyarray)
IS 'this function is used for calculating a jaccard similarity of input vector with all vector in databse.';
因此,理论上,下面的查询将使您的任务(使用bigquery-samples.reddit.full表查看下面的示例):
BigQuery遗留SQL:
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.subr = b.subr) AS count_intersection,
EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
通过类似的方式,您可以调整遗留SQL以使其正常工作
这可能不是最好的方法,但至少给了这样的任务一些希望,使其能够在BigQuery中轻松运行,而不需要其他解决方法
CREATE OR REPLACE FUNCTION public.jaccard_similarity(IN vector anyarray)
RETURNS double precision[]
LANGUAGE 'plpgsql'
AS $BODY$
BEGIN
RETURN(select ARRAY(
select(
select (SELECT COUNT(*) FROM (select vector1 INTERSECT select vector2) as intersect_elements)/(SELECT COUNT(*) FROM(select vector1 UNION select vector2) as union_elements) from unnest($1,"TOPIC_VECTOR") as t(vector1,vector2))
from public.tbl_topic)
as score);
END;
$BODY$;
ALTER FUNCTION public.jaccard_similarity(anyarray)
OWNER TO postgres;
COMMENT ON FUNCTION public.jaccard_similarity(anyarray)
IS 'this function is used for calculating a jaccard similarity of input vector with all vector in databse.';
可以将此函数用于referenec。
多谢各位
可以将此函数用于referenec。
多谢各位