Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/75.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Sql 我们如何从BigQuery中删除重复数据并将其保存到另一个具有很多属性的表中_Sql_Google Bigquery - Fatal编程技术网

Sql 我们如何从BigQuery中删除重复数据并将其保存到另一个具有很多属性的表中

Sql 我们如何从BigQuery中删除重复数据并将其保存到另一个具有很多属性的表中,sql,google-bigquery,Sql,Google Bigquery,我在Google BigQuery中上传了99628行。 该模式有假设、公司名称、电话、电子邮件、地址、城市、州等。 我只想按公司名称和大多数属性保留不同的行。 如果我有行作为 微软| 2355| 微软| 1234 |ms@example.com| seatle | XYZ | KC 微软| 2355 |any@example.com 我想保留第二行,因为它具有最高属性 SELECT * FROM ( SELECT *, ROW_NUMBER() OVER

我在Google BigQuery中上传了99628行。 该模式有假设、公司名称、电话、电子邮件、地址、城市、州等。 我只想按公司名称和大多数属性保留不同的行。 如果我有行作为

微软| 2355|

微软| 1234 |ms@example.com| seatle | XYZ | KC

微软| 2355 |any@example.com

我想保留第二行,因为它具有最高属性

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
      OVER (PARTITION BY company_name)
      row_number
  FROM `local-bastion-154121.Property_Dataset.pmDATA`
)
WHERE row_number = 1
我尝试了下面的查询,但它只返回不同的结果,而不是具有最高属性的结果

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
      OVER (PARTITION BY company_name)
      row_number
  FROM `local-bastion-154121.Property_Dataset.pmDATA`
)
WHERE row_number = 1

您可以创建一个子查询,计算每行的填充列数,然后进行排序:

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY company_name ORDER BY columns_filled DESC)
          row_number
  FROM (
        SELECT *, 
        IF(uppose !="", 1,0) + IF(company_name !="", 1,0) + IF(phone !="", 1,0) + 
        IF(email !="", 1,0) + IF(address !="", 1,0) + IF(city !="", 1,0) + 
        IF(state !="", 1,0) + <SAME FOR EACH FIELD> as columns_filled
        FROM `local-bastion-154121.Property_Dataset.pmDATA`
   )
)
WHERE row_number = 1

就是这样:

我用highest属性来解释特定公司名称中非空值最多的行。您应该能够执行以下操作:

CREATE TABLE dataset.new_table AS
SELECT
  company_name,
  ARRAY_AGG(
    (SELECT AS STRUCT t.* EXCEPT (company_name))
    ORDER BY ARRAY_LENGTH(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r': null'))
  )[OFFSET(0)].*
FROM dataset.existing_table AS t
GROUP BY company_name
作为示例数据:

WITH existing_table AS (
  SELECT 'Microsoft' AS company_name, 2355 AS x, NULL AS email, NULL AS city, NULL AS y, NULL AS z UNION ALL
  SELECT 'Microsoft', 1234, 'ms@example.com', 'seattle', 'XYZ', 'KC' UNION ALL
  SELECT 'Microsoft', 2355, NULL, NULL, NULL, NULL
)
SELECT
  company_name,
  ARRAY_AGG(
    (SELECT AS STRUCT t.* EXCEPT (company_name))
    ORDER BY ARRAY_LENGTH(SPLIT(TO_JSON_STRING(t), ':null'))
  )[OFFSET(0)].*
FROM existing_table AS t
GROUP BY company_name

使用此技巧并结合使用SPLIT和TO_JSON_字符串计算空值的好处是,您不需要显式地编写其他列的列表。它所做的是构建除column_name之外的所有列的结构,并按行中空值的数量升序排序,这意味着你得到了每一个公司名称中的填充值最高的行。

< P>我会考虑通过引入每个字段的权重稍微不同的翻译与最高属性,例如,我希望电子邮件比城市更重要,所以只有一个字段会为我增加两个字段

。 下面是BigQuery标准SQL,并尝试使用加权方法

#standardSQL
WITH weights AS (
  SELECT 'phone' field, 4 weight UNION ALL
  SELECT 'email', 100 UNION ALL
  SELECT 'city', 2 UNION ALL
  SELECT 'address', 1 UNION ALL
  SELECT 'state', 7
)
SELECT
  ARRAY_AGG(r ORDER BY score DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT 
    ANY_VALUE(t) r,
    SUM(weight) score
  FROM `local-bastion-154121.Property_Dataset.pmDATA` t
  CROSS JOIN weights w 
  WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
  GROUP BY TO_JSON_STRING(t)
)
GROUP BY r.company_name    
您可以使用以下问题中的示例数据进行测试和处理

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Microsoft' company_name, 2355 phone, NULL email, NULL city, NULL address, NULL state UNION ALL
  SELECT 'Microsoft', 1234, NULL, 'seattle', 'XYZ', 'KC' UNION ALL
  SELECT 'Microsoft', 2355, 'any@example.com', NULL, NULL, NULL
), weights AS (
  SELECT 'phone' field, 4 weight UNION ALL
  SELECT 'email', 100 UNION ALL
  SELECT 'city', 2 UNION ALL
  SELECT 'address', 1 UNION ALL
  SELECT 'state', 7
)
SELECT
  ARRAY_AGG(r ORDER BY score DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT 
    ANY_VALUE(t) r,
    SUM(weight) score
  FROM `project.dataset.table` t
  CROSS JOIN weights w 
  WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
  GROUP BY TO_JSON_STRING(t)
)
GROUP BY r.company_name   
结果

Row company_name    phone   email           city    address state    
1   Microsoft       2355    any@example.com null    null    null      
正如您在这里看到的,winner的可用属性比其他行少,因为它有更多有价值的属性

你可以在下面看到分数

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Microsoft' company_name, 2355 phone, NULL email, NULL city, NULL address, NULL state UNION ALL
  SELECT 'Microsoft', 1234, NULL, 'seattle', 'XYZ', 'KC' UNION ALL
  SELECT 'Microsoft', 2355, 'any@example.com', NULL, NULL, NULL
), weights AS (
  SELECT 'phone' field, 4 weight UNION ALL
  SELECT 'email', 100 UNION ALL
  SELECT 'city', 2 UNION ALL
  SELECT 'address', 1 UNION ALL
  SELECT 'state', 7
)
SELECT 
  ANY_VALUE(t).*,
  SUM(weight) score
FROM `project.dataset.table` t
CROSS JOIN weights w 
WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
GROUP BY TO_JSON_STRING(t)
ORDER BY score DESC
所以分数很高

Row company_name    phone   email           city    address state   score   
1   Microsoft       2355    any@example.com null    null    null    104  
2   Microsoft       1234    null            seattle XYZ     KC      14   
3   Microsoft       2355    null            null    null    null    4    

谢谢非常好,谢谢,帮了大忙。我想对该属性进行加权,但从没想过我也可以用这种方式进行查询。当你在这里获得足够的声誉时,回来投票选出答案