Google bigquery 试图在单独的表字段中找到精确的单词匹配,并考虑否定词

Google bigquery 试图在单独的表字段中找到精确的单词匹配,并考虑否定词,google-bigquery,Google Bigquery,我尝试了这么多不同的查询来获得正确的结果,但结果却是一团糟。长话短说,我试图根据3个单独的关键字找到一个精确的单词匹配(用空格分隔的孤立单词),并排除任何包含否定关键字的匹配 字段名称1、字段名称2和字段名称3是肯定词。负数_关键字是一组以逗号分隔的单词,它们首先被拆分,然后用于否定ut.title包含负数关键字的任何结果 本质上,查询是询问:“Find where ut.title具有field_name_1、field_name_2或field_name_3,但同时没有来自拆分负面_关键字字

我尝试了这么多不同的查询来获得正确的结果,但结果却是一团糟。长话短说,我试图根据3个单独的关键字找到一个精确的单词匹配(用空格分隔的孤立单词),并排除任何包含否定关键字的匹配

字段名称1、字段名称2和字段名称3是肯定词。负数_关键字是一组以逗号分隔的单词,它们首先被拆分,然后用于否定ut.title包含负数关键字的任何结果

本质上,查询是询问:“Find where ut.title具有field_name_1、field_name_2或field_name_3,但同时没有来自拆分负面_关键字字段的单词。”

非常感谢您的帮助。不幸的是,正则表达式似乎不可能,因为字段名称是常量。提前谢谢

我目前过度借贷的问题如下:

SELECT ut.i_id as i_id, up.id AS p_id, up.option_id as option_id
    FROM ds_test.table_1 AS ut 
    CROSS JOIN 
(
SELECT field_name_1, field_name_2, field_name_3, SPLIT(negative_keywords ,",")  as negative_keywords, option_id, id
FROM ds_test.table_2 ) AS up 

    WHERE 
(
(ut.title contains " "+up.field_name_1+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_1+" ")) contains up.field_name_1+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_1)) contains " "+up.field_name_1)  or
(ut.title contains " "+up.field_name_2+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_2+" ")) contains up.field_name_2+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_2)) contains " "+up.field_name_2)  or
(ut.title contains " "+up.field_name_3+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_3+" ")) contains up.field_name_3+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_3)) contains " "+up.field_name_3) or
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_1, 1 , LENGTH(up.field_name_1))," "))  or  
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_2, 1 , LENGTH(up.field_name_2))," "))  or  
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_3, 1 , LENGTH(up.field_name_3))," ")) 
and (NOT ut.title CONTAINS CONCAT(SUBSTR(up.negative_keywords, 1 , LENGTH(up.negative_keywords))," ")) 
)

GROUP EACH BY i_id, p_id, option_id

IGNORE CASE
SELECT title, field_1, field_2, field_3 FROM (
SELECT title, field_1, field_2, field_3, SPLIT(table2.negative) negative FROM
(SELECT ' ' + title + ' ' AS title FROM 
 (SELECT 'The x301-b tops the x301-p' title),
 (SELECT 'The X301-p and x301-b are Top of the charts' title)) table1
CROSS JOIN
(SELECT * FROM
(SELECT 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative),
(SELECT 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative),
(SELECT 'x301'   field_1, 'x30'   field_2, '' field_3, '' negative)) table2
)
WHERE title CONTAINS ' ' + field_1 + ' ' OR
      title CONTAINS ' ' + field_2 + ' ' OR
      title CONTAINS ' ' + field_3 + ' '
OMIT RECORD IF SOME(title CONTAINS negative)
例如:

SELECT title, field_1, field_2, field_3 FROM (
SELECT title, field_1, field_2, field_3, SPLIT(table2.negative) negative FROM
(SELECT * FROM 
 (SELECT 'The x301-b tops the x301-p' title),
 (SELECT 'The X301-p and x301-b are Top of the charts' title)) table1
CROSS JOIN
(SELECT * FROM
(SELECT 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative),
(SELECT 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative),
(SELECT 'x301'   field_1, 'x30'   field_2, '' field_3, '' negative)) table2
)
WHERE title CONTAINS ' ' + field_1 + ' ' OR title LIKE '% ' + field_1 OR title LIKE field_1 + ' %' OR
      title CONTAINS ' ' + field_2 + ' ' OR title LIKE '% ' + field_2 OR title LIKE field_2 + ' %' OR
      title CONTAINS ' ' + field_3 + ' ' OR title LIKE '% ' + field_3 OR title LIKE field_3 + ' %'
OMIT RECORD IF SOME(title CONTAINS negative)
在表ds_test.table_1中:字段标题包含“X301-p和X301-b位于图表顶部”

在表ds_test.table_2中:字段_name_1、字段_name_2、字段_name_3中,否定的_关键字分别为:

ROW 1 = |x301-f|x301p|x301-p|x301-a,x301-c|

ROW 2 = |x301-b|x301b|x301-d|x301-h,x301-p|

ROW 3 = |x301  |x30  |      |             |
第1行是正确的。有x301-p,标题中没有任何负面关键词

第2行将为false。尽管标题中有x301-b,但也有x301-p作为否定关键字

第3行将是错误的。尽管标题中有x301和/或x30,但它们匹配是因为它们是x301-p或x301-b的子字符串,因此x301或x30不是标题中完整的单个单词。

其思想是:

  • 将否定关键字拆分为重复字段
  • 删除否定词 如果某些(标题包含负数),则使用
    忽略记录
    构造
  • 使用CONTAINS和周围的空格匹配完整单词,或者使用自定义模式和LIKE捕捉字符串的开头/结尾
使用示例中的数据将其放在一起:

SELECT title, field_1, field_2, field_3 FROM (
SELECT title, field_1, field_2, field_3, SPLIT(table2.negative) negative FROM
(SELECT * FROM 
 (SELECT 'The x301-b tops the x301-p' title),
 (SELECT 'The X301-p and x301-b are Top of the charts' title)) table1
CROSS JOIN
(SELECT * FROM
(SELECT 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative),
(SELECT 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative),
(SELECT 'x301'   field_1, 'x30'   field_2, '' field_3, '' negative)) table2
)
WHERE title CONTAINS ' ' + field_1 + ' ' OR title LIKE '% ' + field_1 OR title LIKE field_1 + ' %' OR
      title CONTAINS ' ' + field_2 + ' ' OR title LIKE '% ' + field_2 OR title LIKE field_2 + ' %' OR
      title CONTAINS ' ' + field_3 + ' ' OR title LIKE '% ' + field_3 OR title LIKE field_3 + ' %'
OMIT RECORD IF SOME(title CONTAINS negative)
更新:由于在真实数据集上对like的评估似乎过于昂贵,另一种选择是在进行包含检查之前,在标题两侧填充空格。修改后的查询如下:

SELECT ut.i_id as i_id, up.id AS p_id, up.option_id as option_id
    FROM ds_test.table_1 AS ut 
    CROSS JOIN 
(
SELECT field_name_1, field_name_2, field_name_3, SPLIT(negative_keywords ,",")  as negative_keywords, option_id, id
FROM ds_test.table_2 ) AS up 

    WHERE 
(
(ut.title contains " "+up.field_name_1+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_1+" ")) contains up.field_name_1+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_1)) contains " "+up.field_name_1)  or
(ut.title contains " "+up.field_name_2+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_2+" ")) contains up.field_name_2+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_2)) contains " "+up.field_name_2)  or
(ut.title contains " "+up.field_name_3+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_3+" ")) contains up.field_name_3+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_3)) contains " "+up.field_name_3) or
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_1, 1 , LENGTH(up.field_name_1))," "))  or  
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_2, 1 , LENGTH(up.field_name_2))," "))  or  
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_3, 1 , LENGTH(up.field_name_3))," ")) 
and (NOT ut.title CONTAINS CONCAT(SUBSTR(up.negative_keywords, 1 , LENGTH(up.negative_keywords))," ")) 
)

GROUP EACH BY i_id, p_id, option_id

IGNORE CASE
SELECT title, field_1, field_2, field_3 FROM (
SELECT title, field_1, field_2, field_3, SPLIT(table2.negative) negative FROM
(SELECT ' ' + title + ' ' AS title FROM 
 (SELECT 'The x301-b tops the x301-p' title),
 (SELECT 'The X301-p and x301-b are Top of the charts' title)) table1
CROSS JOIN
(SELECT * FROM
(SELECT 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative),
(SELECT 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative),
(SELECT 'x301'   field_1, 'x30'   field_2, '' field_3, '' negative)) table2
)
WHERE title CONTAINS ' ' + field_1 + ' ' OR
      title CONTAINS ' ' + field_2 + ' ' OR
      title CONTAINS ' ' + field_3 + ' '
OMIT RECORD IF SOME(title CONTAINS negative)

如果你能举几个例子来说明,那会很有帮助,比如什么应该匹配,什么不应该匹配。摩莎,我搞砸了,让人觉得只有一张桌子,而实际上只有两张。我编辑了原始代码。您将看到ds_test.table_1和ds_test.table_2。表1包含标题,表2包含肯定/否定词。我添加了正确/错误的示例。谢谢Mosha。除了字符串的开头/结尾之外,这将起作用。例如,如果标题是“x301-b位于x301-p之上”,那么第1行的x301-p将不正确=| x301-f | x301p | x301-p | x301-a、x301-c |,因为它位于单词的末尾。我希望有一种方法可以将正则表达式与contains一起使用,这样我就可以将title包含“+”field_3+$是的,有一种方法-我修改了查询以使用类似自定义的模式来捕获它。谢谢Mosha。另一个问题,我是否应该使用相同的语法来应用负数作为精确匹配?例如,如果某些(标题包含“”+否定+“”),则省略记录…是的,您可以在某些内容中使用完全相同的谓词来再次进行单词匹配。我现在正在运行查询。它已经运行了大约5个小时,是的。希望它不会超时。使用“LIKE”函数时,BigQuery似乎需要更长的时间。表1为280k行,表2为8k行。