Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/87.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
SQL REDDIT-Jaccard相似性_Sql_Comments_Google Bigquery_Reddit - Fatal编程技术网

SQL REDDIT-Jaccard相似性

SQL REDDIT-Jaccard相似性,sql,comments,google-bigquery,reddit,Sql,Comments,Google Bigquery,Reddit,我试图实现一个奇特的SQL查询,但在执行连接和计数时遇到了麻烦 我有一个很长的数据表: author | group | id | daniel | group1| 118 adam | group2| 126 harry | group1| 221 daniel | group2| 323 daniel | group2| 122 daniel | group5| 322 harry | group1| 222 harry | group1| 225 。。。 我希望我的输出看

我试图实现一个奇特的SQL查询,但在执行连接和计数时遇到了麻烦

我有一个很长的数据表:

author | group | id |

daniel | group1| 118
adam   | group2| 126
harry  | group1| 221
daniel | group2| 323
daniel | group2| 122
daniel | group5| 322
harry  | group1| 222 
harry  | group1| 225
。。。

我希望我的输出看起来像:

author1 | author2 | intersection | union

daniel | adam | 2 | 3
daniel | harry| 2 | 11
adam   | harry| 0 | 10
其中,交集定义为author1和author2具有共同点的组的#和author1+author-交集组的并集=#

我认为正确的方法是

表a左连接a.group==b.group上的b表

但是我不知道怎么做总计数

谢谢
在这里输入代码

因为1)仍然看不到任何答案2)看到作者的相关问题和BigQuery标签

因此,理论上,下面的查询将使您的任务(使用bigquery-samples.reddit.full表查看下面的示例):

BigQuery遗留SQL:

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.subr = b.subr) AS count_intersection,
  EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
通过类似的方式,您可以调整遗留SQL以使其正常工作

这可能不是最好的方法,但至少给了这样的任务一些希望,使其能够在BigQuery中轻松运行,而不必去其他解决方法“跳入”,因为1)仍然看不到任何答案2)看到了作者与BigQuery标记相关的问题

CREATE OR REPLACE FUNCTION public.jaccard_similarity(IN vector anyarray)
    RETURNS double precision[]
    LANGUAGE 'plpgsql'

AS $BODY$
BEGIN
    RETURN(select ARRAY(
            select(
                select (SELECT COUNT(*) FROM (select vector1 INTERSECT select vector2) as intersect_elements)/(SELECT COUNT(*) FROM(select vector1 UNION select vector2) as union_elements) from unnest($1,"TOPIC_VECTOR") as t(vector1,vector2)) 
            from public.tbl_topic) 
            as score);

END;
$BODY$;

ALTER FUNCTION public.jaccard_similarity(anyarray)
    OWNER TO postgres;

COMMENT ON FUNCTION public.jaccard_similarity(anyarray)
    IS 'this function is used for calculating a jaccard similarity of input vector with all vector in databse.';
因此,理论上,下面的查询将使您的任务(使用bigquery-samples.reddit.full表查看下面的示例):

BigQuery遗留SQL:

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.subr = b.subr) AS count_intersection,
  EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
通过类似的方式,您可以调整遗留SQL以使其正常工作

这可能不是最好的方法,但至少给了这样的任务一些希望,使其能够在BigQuery中轻松运行,而不需要其他解决方法

CREATE OR REPLACE FUNCTION public.jaccard_similarity(IN vector anyarray)
    RETURNS double precision[]
    LANGUAGE 'plpgsql'

AS $BODY$
BEGIN
    RETURN(select ARRAY(
            select(
                select (SELECT COUNT(*) FROM (select vector1 INTERSECT select vector2) as intersect_elements)/(SELECT COUNT(*) FROM(select vector1 UNION select vector2) as union_elements) from unnest($1,"TOPIC_VECTOR") as t(vector1,vector2)) 
            from public.tbl_topic) 
            as score);

END;
$BODY$;

ALTER FUNCTION public.jaccard_similarity(anyarray)
    OWNER TO postgres;

COMMENT ON FUNCTION public.jaccard_similarity(anyarray)
    IS 'this function is used for calculating a jaccard similarity of input vector with all vector in databse.';
可以将此函数用于referenec。 多谢各位

可以将此函数用于referenec。 多谢各位