Python 计算xarray中缺少数据的相关性_Python_Numpy_Correlation_Python Xarray

Python 计算xarray中缺少数据的相关性

python numpy

Python 计算xarray中缺少数据的相关性,python,numpy,correlation,python-xarray,Python,Numpy,Correlation,Python Xarray,我试图计算xarray中两个数据集之间沿时间维度的相关性。我的数据集都是lat x lon x time。我的一个数据集丢失了足够多的数据，这对于插值和消除间隙是不合理的，相反，我只想忽略丢失的值。我有一些简单的代码，但没有一个适合我的具体用例。例如： def covariance(x,y,dims=None): return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims) def correlatio

我试图计算xarray中两个数据集之间沿时间维度的相关性。我的数据集都是lat x lon x time。我的一个数据集丢失了足够多的数据，这对于插值和消除间隙是不合理的，相反，我只想忽略丢失的值。我有一些简单的代码，但没有一个适合我的具体用例。例如：

def covariance(x,y,dims=None):
    return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)

def correlation(x,y,dims=None):
    return covariance(x,y,dims) / (x.std(dims) * y.std(dims))

如果没有数据丢失，则可以正常工作，但当然不能使用NAN。虽然有一个很好的例子，但即使有了这段代码，我仍在努力计算皮尔逊相关性，而不是斯皮尔曼相关性

import numpy as np
import xarray as xr
import bottleneck

def covariance_gufunc(x, y):
    return ((x - x.mean(axis=-1, keepdims=True))
            * (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)

def pearson_correlation_gufunc(x, y):
    return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))

def spearman_correlation_gufunc(x, y):
    x_ranks = bottleneck.rankdata(x, axis=-1)
    y_ranks = bottleneck.rankdata(y, axis=-1)
    return pearson_correlation_gufunc(x_ranks, y_ranks)

def spearman_correlation(x, y, dim):
    return xr.apply_ufunc(
        spearman_correlation_gufunc, x, y,
        input_core_dims=[[dim], [dim]],
        dask='parallelized',
        output_dtypes=[float])

最后，有一个将此作为特性添加到xarray的方法，但它尚未实现。有没有一种有效的方法可以在有数据缺口的数据集上实现这一点？

我一直在关注Github的讨论以及随后实现.corr方法的尝试，看起来我们已经很接近了，但还没有实现

同时，大多数人试图合并的基本代码在另一个答案中概述得相当好。这是一个很好的解决方案，它利用NumPy中的矢量化操作，通过一些小的调整，可以在链接中找到可接受的答案，以解释沿时间轴的NAN

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr

希望这有帮助！祈祷合并很快就会到来。

解决方案在github线程中 xr.cov和xr.corr的PR已在合并，应在即将发布的xarray v0.16.0.xr.corr和xr.cov中删除！：

def covariance(x, y, dim=None):
    valid_values = x.notnull() & y.notnull()
    valid_count = valid_values.sum(dim)

    demeaned_x = (x - x.mean(dim)).fillna(0)
    demeaned_y = (y - y.mean(dim)).fillna(0)
    
    return xr.dot(demeaned_x, demeaned_y, dims=dim) / valid_count

def correlation(x, y, dim=None):
    # dim should default to the intersection of x.dims and y.dims
    return covariance(x, y, dim) / (x.std(dim) * y.std(dim))