Google bigquery 历史天气数据查询

Google bigquery 历史天气数据查询,google-bigquery,Google Bigquery,我试图获取给定日期前7天的天气数据,接近坐标lat,lon。半径大约20公里。如果有多个站点,我可能希望按天平均分组的数据 有没有办法用BigQuery直接计算所有这些?为了进行测试,我计算了最小和最大坐标,并创建了以下查询 SELECT * FROM [bigquery-public-data:noaa_gsod.gsod2016] a JOIN [bigquery-public-data:noaa_gsod.stations] b ON a.stn=b.usaf AND

我试图获取给定日期前7天的天气数据,接近坐标lat,lon。半径大约20公里。如果有多个站点,我可能希望按天平均分组的数据

有没有办法用BigQuery直接计算所有这些?为了进行测试,我计算了最小和最大坐标,并创建了以下查询

SELECT
  *
FROM
  [bigquery-public-data:noaa_gsod.gsod2016] a
JOIN
  [bigquery-public-data:noaa_gsod.stations] b
ON
  a.stn=b.usaf
  AND a.wban=b.wban
WHERE
  (b.lat >= 46.248332
    AND b.lat <= 47.147654)
  AND (b.lon >= 5.689853
    AND b.lon <= 7.001115)
  AND a.mo='03'

根据您提供的信息,我不确定您是否可以计算查询中的最大/最小数据。在遗留SQL中工作时,我可能会尝试嵌套多个查询,或者连接到计算它们的查询,或者两者兼而有之

您也可以在必要时编写一些调整搜索查询的内容,但我只是不了解您正在做的工作的结构,不足以编写建议

关于其他问题:

获取平均值-而不是使用*来调用所有内容,您必须单独调用要平均的列和要忽略的列或分组

选择某个特定日期的过去7天-很不幸,没有时间戳列,因此您必须强制设置时间戳列

在LegacySQL中,我会这样写:

SELECT dte, avg_temp, avg_cnt_temp
FROM 
(SELECT CAST(CONCAT(a.year, '-', a.mo, '-', a.da) AS timestamp) AS dte,
/* This is calling the separate year, month, and day strings as a 
datetime funtion so I can use date_add later */ 
AVG(a.temp) AS avg_temp, AVG(a.count_temp) AS avg_cnt_temp /* You'll 
want to include all of the data you're wanting to call here, I 
only tested with these two */
FROM [bigquery-public-data:noaa_gsod.gsod2016] AS a
JOIN [bigquery-public-data:noaa_gsod.stations] AS b
ON a.stn=b.usaf AND a.wban=b.wban
GROUP BY dte, mo, da)
WHERE dte >= (DATE_ADD('2016-12-31 00:00:00', -7, "DAY")) AND dte <= 
TIMESTAMP('2016-12-31 00:00:00') /* replace with your date */
我认为在标准SQL中,嵌套方式不尽相同


如果要跨站点等组合数据,请不要调用站点标识符。

这是我能想到的最佳解决方案:

#standardSQL
CREATE TEMP FUNCTION distance(lat1 FLOAT64, lat2 FLOAT64, lon1 FLOAT64, lon2 FLOAT64) AS((
WITH data AS(
SELECT POW(SIN((ACOS(-1) / 180 * (lat1 -lat2)) / 2), 2) + COS(ACOS(-1) / 180 * (lat1)) * COS(ACOS(-1) / 180 * (lat2)) * POW(SIN((ACOS(-1) / 180 * (lon1 -lon2)) / 2), 2) a
)
SELECT 6371 * 2 * ATAN2(SQRT((SELECT a FROM data)), SQRT(1 - (SELECT a FROM data)))
));

WITH temperature_data AS(
SELECT
  CONCAT(year, mo, da) date,
  temp,
  b.lat lat,
  b.lon lon
FROM `bigquery-public-data.noaa_gsod.gsod2016` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.stn = b.usaf AND a.wban = b.wban
WHERE concat(year, mo, da) BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(PARSE_DATE('%Y%m%d', '20160725'), INTERVAL 7 DAY)) AND '20160725'
)

SELECT
  date,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS std_temp) data_20km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS std_temp) data_50km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS std_temp) data_100km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS std_temp) data_200km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS std_temp) data_500km
FROM temperature_data t
WHERE
distance(t.lat, 10.1, t.lon, 10.2) < 2000
GROUP BY date
ORDER BY date
这是从给定日期中选择最后7天的位置。只需更改值“20160725”,即可选择要分析的日期

可以通过查询直接计算最大和最小lat/lon吗

对。我想你的意思是,如果有可能选择给定范围内的空间点,比如说20公里。 一种方法是定义一个临时函数来计算所需点和桩号点之间的距离,该距离在查询中表示为:

CREATE TEMP FUNCTION distance(lat1 FLOAT64, lat2 FLOAT64, lon1 FLOAT64, lon2 FLOAT64) AS((
WITH data AS(
SELECT POW(SIN((ACOS(-1) / 180 * (lat1 -lat2)) / 2), 2) + COS(ACOS(-1) / 180 * (lat1)) * COS(ACOS(-1) / 180 * (lat2)) * POW(SIN((ACOS(-1) / 180 * (lon1 -lon2)) / 2), 2) a
)
SELECT 6371 * 2 * ATAN2(SQRT((SELECT a FROM data)), SQRT(1 - (SELECT a FROM data)))
));
您可以尝试并测试此功能,例如:

SELECT distance(50, 60, 30, 10) # result is ~ 1680km
此功能在此处使用:

WHERE
distance(t.lat, 10.1, t.lon, 10.2) < 2000
因此,这将过滤掉距离点10.1、10.2远2000公里以上的所有站点。您还可以根据需要更改此值

最后一点:我还带来了STDDEV_SAMP,它是。这可能对您有一定的价值,因为它可以让您了解平均值在通过采样数据大小效应校正的平均值周围的分布情况。如果我们不知道我们真正接近正确值的程度,平均值本身就没有那么大的价值

我能得到更好的、免费的历史天气数据吗


不知道。希望这个公共数据集对您来说足够好。

请提供一些数据来澄清您的问题;我不完全明白。哪一部分不清楚?如果运行上面的查询,将从多个站点获取数据。它们应按天和月进行平均分组。此外,查询仅显示给定月份的数据。但我更希望有一个给定日期的最后一周。无论如何,这不是一个直接的答案,但这个答案可能有用。
SELECT distance(50, 60, 30, 10) # result is ~ 1680km
WHERE
distance(t.lat, 10.1, t.lon, 10.2) < 2000
SELECT
  date,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS std_temp) data_20km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS std_temp) data_50km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS std_temp) data_100km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS std_temp) data_200km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS std_temp) data_500km
FROM temperature_data t
WHERE
distance(t.lat, 10.1, t.lon, 10.2) < 2000
GROUP BY date
distance(t.lat, 10.1, t.lon, 10.2) < 2000