Google bigquery Bigquery%类似于%1分组
我有一个表,其中列出了产品名称。我需要清点每种产品的数量。一些产品名称是以不同的形式书写的,例如:“Juice”产品-Juice、Juice等。我需要将它们组合在一起,并使用bigquery显示计数 果汁-100Google bigquery Bigquery%类似于%1分组,google-bigquery,Google Bigquery,我有一个表,其中列出了产品名称。我需要清点每种产品的数量。一些产品名称是以不同的形式书写的,例如:“Juice”产品-Juice、Juice等。我需要将它们组合在一起,并使用bigquery显示计数 果汁-100 果汁-14 牛奶-10 牛奶-3 mil-1 上表必须如下所示 果汁-114 牛奶-14这对你有用吗: SELECT product_name, COUNT(*) from <table> GROUP BY 1 IGNORE CASE 按1忽略大小写从分组中选择产品名称、
果汁-14
牛奶-10
牛奶-3
mil-1 上表必须如下所示 果汁-114
牛奶-14这对你有用吗:
SELECT product_name, COUNT(*) from <table> GROUP BY 1 IGNORE CASE
按1忽略大小写从分组中选择产品名称、计数(*)
这对您有用吗:
SELECT product_name, COUNT(*) from <table> GROUP BY 1 IGNORE CASE
按1忽略大小写从分组中选择产品名称、计数(*)
< /代码> 如果您没有考虑拼写错误的单词,那么解决方案将简单到以下。
SELECT LOWER(word) AS word, SUM(cnt) AS cnt
FROM YourTable
GROUP BY 1
但在您的情况下,您需要首先处理相似性问题
检查下面的选项考虑< /P>
首先,让我们了解高级逻辑/步骤
第0步-假设您的表(YourTable)如下所示
步骤1–计算相似度
我们只考虑在0.5和1之间有相似性的那些。
因此,预期结果如下所示
word replacement similarity
milkk milk 0.8
mil milk 0.6666666666666667
milkk mil 0.6
第2步-找到赢家
你会期望:
word replacement
milkk milk
mil milk
步骤3–最终汇总
以下是各自的代码
最有可能的方法是优化、改进和组合——但这里的方法就是给你一个想法(和工作代码)
查询1(步骤1)-替换候选人
让我们将输出写入表-->替换
SELECT text1 AS word, text2 AS replacement, similarity FROM
JS(
// input table
(
SELECT
word1 AS text1,
word2 AS text2
FROM (
SELECT
CASE WHEN a.cnt < b.cnt THEN a.word ELSE b.word END AS word1,
CASE WHEN a.cnt < b.cnt THEN b.word ELSE a.word END AS word2
FROM (
SELECT LOWER(word) AS word, SUM(cnt) AS cnt
FROM YourTable
GROUP BY 1
) AS a
CROSS JOIN (
SELECT LOWER(word) AS word, SUM(cnt) AS cnt
FROM YourTable
GROUP BY 1
) AS b
WHERE a.word <= b.word
)
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
{name: 'text2', type:'string'},
{name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_text1;
try {
the_text1 = decodeURI(r.text1).toLowerCase();
} catch (ex) {
the_text1 = r.text1.toLowerCase();
}
try {
the_text2 = decodeURI(r.text2).toLowerCase();
} catch (ex) {
the_text2 = r.text2.toLowerCase();
}
emit({text1: the_text1, text2: the_text2,
similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});
}"
)
WHERE similarity > 0.5 AND similarity < 1
ORDER BY similarity DESC
查询3(步骤2和步骤3合并)-替换和最终聚合
尽管上面的方法有效——你们可以通过这个例子来运行它——但我不能保证这会像你们对实际数据所期望的那个样有效。但我希望这给你一个好的方向来探索 如果你没有考虑到拼写错误的单词,那么解决方法将和下面的
一样简单。
SELECT LOWER(word) AS word, SUM(cnt) AS cnt
FROM YourTable
GROUP BY 1
但在您的情况下,您需要首先处理相似性问题
检查下面的选项考虑< /P>
首先,让我们了解高级逻辑/步骤
第0步-假设您的表(YourTable)如下所示
步骤1–计算相似度
我们只考虑在0.5和1之间有相似性的那些。
因此,预期结果如下所示
word replacement similarity
milkk milk 0.8
mil milk 0.6666666666666667
milkk mil 0.6
第2步-找到赢家
你会期望:
word replacement
milkk milk
mil milk
步骤3–最终汇总
以下是各自的代码
最有可能的方法是优化、改进和组合——但这里的方法就是给你一个想法(和工作代码)
查询1(步骤1)-替换候选人
让我们将输出写入表-->替换
SELECT text1 AS word, text2 AS replacement, similarity FROM
JS(
// input table
(
SELECT
word1 AS text1,
word2 AS text2
FROM (
SELECT
CASE WHEN a.cnt < b.cnt THEN a.word ELSE b.word END AS word1,
CASE WHEN a.cnt < b.cnt THEN b.word ELSE a.word END AS word2
FROM (
SELECT LOWER(word) AS word, SUM(cnt) AS cnt
FROM YourTable
GROUP BY 1
) AS a
CROSS JOIN (
SELECT LOWER(word) AS word, SUM(cnt) AS cnt
FROM YourTable
GROUP BY 1
) AS b
WHERE a.word <= b.word
)
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
{name: 'text2', type:'string'},
{name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_text1;
try {
the_text1 = decodeURI(r.text1).toLowerCase();
} catch (ex) {
the_text1 = r.text1.toLowerCase();
}
try {
the_text2 = decodeURI(r.text2).toLowerCase();
} catch (ex) {
the_text2 = r.text2.toLowerCase();
}
emit({text1: the_text1, text2: the_text2,
similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});
}"
)
WHERE similarity > 0.5 AND similarity < 1
ORDER BY similarity DESC
查询3(步骤2和步骤3合并)-替换和最终聚合
尽管上面的方法有效——你们可以通过这个例子来运行它——但我不能保证这会像你们对实际数据所期望的那个样有效。但我希望这能给你一个很好的探索方向谢谢,但事实并非如此。。有些拼写错误,名字很短。基本上,我需要将相似的单词组合在一起,并汇总计数。我想知道如何使用正则表达式实现这一点谢谢,但这不仅仅是事实。。有些拼写错误,名字很短。基本上,我需要将相似的单词组合在一起,并合计计数。我想知道如何使用正则表达式实现这一点谢谢Mikhail!我对如何解决我的问题有了一个整体的想法。谢谢米哈伊尔!我对如何解决我的问题有了一个全面的想法。