Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/http/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Google bigquery Bigquery%类似于%1分组_Google Bigquery - Fatal编程技术网

Google bigquery Bigquery%类似于%1分组

Google bigquery Bigquery%类似于%1分组,google-bigquery,Google Bigquery,我有一个表,其中列出了产品名称。我需要清点每种产品的数量。一些产品名称是以不同的形式书写的,例如:“Juice”产品-Juice、Juice等。我需要将它们组合在一起,并使用bigquery显示计数 果汁-100 果汁-14 牛奶-10 牛奶-3 mil-1 上表必须如下所示 果汁-114 牛奶-14这对你有用吗: SELECT product_name, COUNT(*) from <table> GROUP BY 1 IGNORE CASE 按1忽略大小写从分组中选择产品名称、

我有一个表,其中列出了产品名称。我需要清点每种产品的数量。一些产品名称是以不同的形式书写的,例如:“Juice”产品-Juice、Juice等。我需要将它们组合在一起,并使用bigquery显示计数

果汁-100
果汁-14
牛奶-10
牛奶-3
mil-1

上表必须如下所示

果汁-114
牛奶-14这对你有用吗:

SELECT product_name, COUNT(*) from <table> GROUP BY 1 IGNORE CASE
按1忽略大小写从分组中选择产品名称、计数(*)

这对您有用吗:

SELECT product_name, COUNT(*) from <table> GROUP BY 1 IGNORE CASE
按1忽略大小写从分组中选择产品名称、计数(*)

< /代码> 如果您没有考虑拼写错误的单词,那么解决方案将简单到

以下。
SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
FROM YourTable
GROUP BY 1
但在您的情况下,您需要首先处理相似性问题
检查下面的选项考虑< /P> 首先,让我们了解高级逻辑/步骤

第0步-假设您的表(YourTable)如下所示

步骤1–计算相似度

我们只考虑在0.5和1之间有相似性的那些。 因此,预期结果如下所示

word    replacement similarity   
milkk   milk        0.8  
mil     milk        0.6666666666666667   
milkk   mil         0.6 
第2步-找到赢家

你会期望:

word    replacement  
milkk   milk     
mil     milk    
步骤3–最终汇总

以下是各自的代码

最有可能的方法是优化、改进和组合——但这里的方法就是给你一个想法(和工作代码)

查询1(步骤1)-替换候选人

让我们将输出写入表-->替换

SELECT text1 AS word, text2 AS replacement, similarity FROM 
JS(
// input table
(
  SELECT 
    word1 AS text1, 
    word2 AS text2
  FROM (
    SELECT
      CASE WHEN a.cnt < b.cnt THEN a.word ELSE b.word END AS word1,
      CASE WHEN a.cnt < b.cnt THEN b.word ELSE a.word END AS word2
    FROM (
      SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
      FROM YourTable
      GROUP BY 1
    ) AS a
    CROSS JOIN (
      SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
      FROM YourTable
      GROUP BY 1
    ) AS b
    WHERE a.word <= b.word 
  )
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
  {name: 'text2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_text1;

  try {
    the_text1 = decodeURI(r.text1).toLowerCase();
  } catch (ex) {
    the_text1 = r.text1.toLowerCase();
  }

  try {
    the_text2 = decodeURI(r.text2).toLowerCase();
  } catch (ex) {
    the_text2 = r.text2.toLowerCase();
  }

  emit({text1: the_text1, text2: the_text2,
        similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});

  }"
)
WHERE similarity > 0.5 AND similarity < 1
ORDER BY similarity DESC
查询3(步骤2和步骤3合并)-替换和最终聚合


尽管上面的方法有效——你们可以通过这个例子来运行它——但我不能保证这会像你们对实际数据所期望的那个样有效。但我希望这给你一个好的方向来探索

如果你没有考虑到拼写错误的单词,那么解决方法将和下面的

一样简单。
SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
FROM YourTable
GROUP BY 1
但在您的情况下,您需要首先处理相似性问题
检查下面的选项考虑< /P> 首先,让我们了解高级逻辑/步骤

第0步-假设您的表(YourTable)如下所示

步骤1–计算相似度

我们只考虑在0.5和1之间有相似性的那些。 因此,预期结果如下所示

word    replacement similarity   
milkk   milk        0.8  
mil     milk        0.6666666666666667   
milkk   mil         0.6 
第2步-找到赢家

你会期望:

word    replacement  
milkk   milk     
mil     milk    
步骤3–最终汇总

以下是各自的代码

最有可能的方法是优化、改进和组合——但这里的方法就是给你一个想法(和工作代码)

查询1(步骤1)-替换候选人

让我们将输出写入表-->替换

SELECT text1 AS word, text2 AS replacement, similarity FROM 
JS(
// input table
(
  SELECT 
    word1 AS text1, 
    word2 AS text2
  FROM (
    SELECT
      CASE WHEN a.cnt < b.cnt THEN a.word ELSE b.word END AS word1,
      CASE WHEN a.cnt < b.cnt THEN b.word ELSE a.word END AS word2
    FROM (
      SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
      FROM YourTable
      GROUP BY 1
    ) AS a
    CROSS JOIN (
      SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
      FROM YourTable
      GROUP BY 1
    ) AS b
    WHERE a.word <= b.word 
  )
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
  {name: 'text2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_text1;

  try {
    the_text1 = decodeURI(r.text1).toLowerCase();
  } catch (ex) {
    the_text1 = r.text1.toLowerCase();
  }

  try {
    the_text2 = decodeURI(r.text2).toLowerCase();
  } catch (ex) {
    the_text2 = r.text2.toLowerCase();
  }

  emit({text1: the_text1, text2: the_text2,
        similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});

  }"
)
WHERE similarity > 0.5 AND similarity < 1
ORDER BY similarity DESC
查询3(步骤2和步骤3合并)-替换和最终聚合


尽管上面的方法有效——你们可以通过这个例子来运行它——但我不能保证这会像你们对实际数据所期望的那个样有效。但我希望这能给你一个很好的探索方向

谢谢,但事实并非如此。。有些拼写错误,名字很短。基本上,我需要将相似的单词组合在一起,并汇总计数。我想知道如何使用正则表达式实现这一点谢谢,但这不仅仅是事实。。有些拼写错误,名字很短。基本上,我需要将相似的单词组合在一起,并合计计数。我想知道如何使用正则表达式实现这一点谢谢Mikhail!我对如何解决我的问题有了一个整体的想法。谢谢米哈伊尔!我对如何解决我的问题有了一个全面的想法。