Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/364.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 跳过NaN值以获取距离_Python_Pandas_Numpy_Distance_Valueerror - Fatal编程技术网

Python 跳过NaN值以获取距离

Python 跳过NaN值以获取距离,python,pandas,numpy,distance,valueerror,Python,Pandas,Numpy,Distance,Valueerror,我的数据集的一部分(实际上我的数据集大小(1061800)): df= 根据汤姆的回答,我现在能做什么: 我手动写了第1-2行,如p和q值: p= q= 然后: 然后: 它起作用了。但是我如何将p和q应用于整个数据集呢?不是逐行选择吗 最后,我需要使用0对角线获得106到106对称矩阵,我认为您需要做的唯一更改是在frdist函数中,首先从p中删除nan值。这就需要消除p和q长度相同的条件,但我认为这应该是可以的,因为你自己说p有1个值,q有1800个值 def frdist(p, q):

我的数据集的一部分(实际上我的数据集大小
(1061800)
):

df=

根据汤姆的回答,我现在能做什么:

  • 我手动写了第1-2行,如p和q值:
p=

q=

然后:

然后:

它起作用了。但是我如何将p和q应用于整个数据集呢?不是逐行选择吗


最后,我需要使用
0
对角线获得
106到106
对称矩阵,我认为您需要做的唯一更改是在
frdist
函数中,首先从
p
中删除
nan
值。这就需要消除
p
q
长度相同的条件,但我认为这应该是可以的,因为你自己说
p
有1个值,
q
有1800个值

def frdist(p, q):

    # Remove nan values from p
    p = np.array([i for i in p if np.any(np.isfinite(i))], np.float64)
    q = np.array(q, np.float64)

    len_p = len(p)
    len_q = len(q)

    if len_p == 0 or len_q == 0:
        raise ValueError('Input curves are empty.')

    # p and q no longer have to be the same length
    if len(p[0]) != len(q[0]):
        raise ValueError('Input curves do not have the same dimensions.')

    ca = (np.ones((len_p, len_q), dtype=np.float64) * -1)

    dist = _c(ca, len_p-1, len_q-1, p, q)
    return(dist)
然后给出:

frdist(p, q)
1.9087938076177846
删除
NaN
值 简单明了:

p = p[~np.isnan(p)]

计算整个数据集的Fréchet距离 最简单的方法是使用SciPy计算成对距离。它通过
n
dimensions数组进行
m
观察,因此我们需要使用
restrape(-1,2)
inside
frdist
来重塑我们的行数组
pdist
返回压缩(上三角)距离矩阵。我们根据要求使用
0
对角线得到
mxm
对称矩阵

import pandas as pd
import numpy as np
import io
from scipy.spatial.distance import pdist, squareform

data = """    1           1.1     2           2.1     3           3.1     4           4.1     5           5.1
0   43.1024     6.7498  NaN         NaN     NaN         NaN     NaN         NaN     NaN         NaN
1   46.0595     1.6829  25.0695     3.7463  NaN         NaN     NaN         NaN     NaN         NaN
2   25.0695     5.5454  44.9727     8.6660  41.9726     2.6666  84.9566     3.8484  44.9566     1.8484
3   35.0281     7.7525  45.0322     3.7465  14.0369     3.7463  NaN         NaN     NaN         NaN
4   35.0292     7.5616  45.0292     4.5616  23.0292     3.5616  45.0292     6.7463  NaN         NaN
"""
df = pd.read_csv(io.StringIO(data), sep='\s+')

def _c(ca, i, j, p, q):

    if ca[i, j] > -1:
        return ca[i, j]
    elif i == 0 and j == 0:
        ca[i, j] = np.linalg.norm(p[i]-q[j])
    elif i > 0 and j == 0:
        ca[i, j] = max(_c(ca, i-1, 0, p, q), np.linalg.norm(p[i]-q[j]))
    elif i == 0 and j > 0:
        ca[i, j] = max(_c(ca, 0, j-1, p, q), np.linalg.norm(p[i]-q[j]))
    elif i > 0 and j > 0:
        ca[i, j] = max(
            min(
                _c(ca, i-1, j, p, q),
                _c(ca, i-1, j-1, p, q),
                _c(ca, i, j-1, p, q)
            ),
            np.linalg.norm(p[i]-q[j])
            )
    else:
        ca[i, j] = float('inf')

    return ca[i, j]

def frdist(p, q):

    # Remove nan values and reshape into two column array
    p = p[~np.isnan(p)].reshape(-1,2)
    q = q[~np.isnan(q)].reshape(-1,2)

    len_p = len(p)
    len_q = len(q)

    if len_p == 0 or len_q == 0:
        raise ValueError('Input curves are empty.')

    # p and q will no longer be the same length
    if len(p[0]) != len(q[0]):
        raise ValueError('Input curves do not have the same dimensions.')

    ca = (np.ones((len_p, len_q), dtype=np.float64) * -1)

    dist = _c(ca, len_p-1, len_q-1, p, q)
    return(dist)

print(squareform(pdist(df.values, frdist)))
结果:

[[ 0.         18.28131545 41.95464432 29.22027212 20.32481187]
 [18.28131545  0.         38.9573328  12.59094238 20.18389517]
 [41.95464432 38.9573328   0.         39.92453004 39.93376923]
 [29.22027212 12.59094238 39.92453004  0.         31.13715882]
 [20.32481187 20.18389517 39.93376923 31.13715882  0.        ]]

没有必要重新发明轮子 Fréchet距离计算已由提供。因此,下面将给出与上述相同的结果:

from scipy.spatial.distance import pdist, squareform
import similaritymeasures

def frechet(p, q):
    p = p[~np.isnan(p)].reshape(-1,2)
    q = q[~np.isnan(q)].reshape(-1,2)
    return similaritymeasures.frechet_dist(p,q)

print(squareform(pdist(df.values, frechet))) 

您可以从
p
中删除NaN值,也可以从
q
中删除相应的值。例如,@Poolka不可能,因为最小值是1,最大值是1500。我不理解你的原因部分。如何防止从
p
简单地删除所有NAN?假设您有100个值,其中有2个N->drop NAN->您有98个值,您可以执行计算。@Poolka抱歉。是我的错。它不是真实的数据集。在真实数据集中,p是1个值,q是1800个值。看起来你删除了大部分问题,因为我只能看到2行没有任何代码。您好,有一刻:`NameError:name'squareform'没有定义`My fault'。他进来了,但忘了跑。谢谢另外,你的代码和我的代码都适用于小数据。对于我的真实数据,它给了我
递归错误:相比之下超过了最大递归深度
。我想我会提出一个新的问题,但也许你可以给我一些建议来避免这个问题。
重复错误
是否也会发生在
相似的测量中。frechet_dist
?是的。我可以拆分数据,但要寻找更好的解决方案,您可以尝试增加,例如
sys.setrecursionlimit(1500)
frdist(p, q)
1.9087938076177846
p = p[~np.isnan(p)]
import pandas as pd
import numpy as np
import io
from scipy.spatial.distance import pdist, squareform

data = """    1           1.1     2           2.1     3           3.1     4           4.1     5           5.1
0   43.1024     6.7498  NaN         NaN     NaN         NaN     NaN         NaN     NaN         NaN
1   46.0595     1.6829  25.0695     3.7463  NaN         NaN     NaN         NaN     NaN         NaN
2   25.0695     5.5454  44.9727     8.6660  41.9726     2.6666  84.9566     3.8484  44.9566     1.8484
3   35.0281     7.7525  45.0322     3.7465  14.0369     3.7463  NaN         NaN     NaN         NaN
4   35.0292     7.5616  45.0292     4.5616  23.0292     3.5616  45.0292     6.7463  NaN         NaN
"""
df = pd.read_csv(io.StringIO(data), sep='\s+')

def _c(ca, i, j, p, q):

    if ca[i, j] > -1:
        return ca[i, j]
    elif i == 0 and j == 0:
        ca[i, j] = np.linalg.norm(p[i]-q[j])
    elif i > 0 and j == 0:
        ca[i, j] = max(_c(ca, i-1, 0, p, q), np.linalg.norm(p[i]-q[j]))
    elif i == 0 and j > 0:
        ca[i, j] = max(_c(ca, 0, j-1, p, q), np.linalg.norm(p[i]-q[j]))
    elif i > 0 and j > 0:
        ca[i, j] = max(
            min(
                _c(ca, i-1, j, p, q),
                _c(ca, i-1, j-1, p, q),
                _c(ca, i, j-1, p, q)
            ),
            np.linalg.norm(p[i]-q[j])
            )
    else:
        ca[i, j] = float('inf')

    return ca[i, j]

def frdist(p, q):

    # Remove nan values and reshape into two column array
    p = p[~np.isnan(p)].reshape(-1,2)
    q = q[~np.isnan(q)].reshape(-1,2)

    len_p = len(p)
    len_q = len(q)

    if len_p == 0 or len_q == 0:
        raise ValueError('Input curves are empty.')

    # p and q will no longer be the same length
    if len(p[0]) != len(q[0]):
        raise ValueError('Input curves do not have the same dimensions.')

    ca = (np.ones((len_p, len_q), dtype=np.float64) * -1)

    dist = _c(ca, len_p-1, len_q-1, p, q)
    return(dist)

print(squareform(pdist(df.values, frdist)))
[[ 0.         18.28131545 41.95464432 29.22027212 20.32481187]
 [18.28131545  0.         38.9573328  12.59094238 20.18389517]
 [41.95464432 38.9573328   0.         39.92453004 39.93376923]
 [29.22027212 12.59094238 39.92453004  0.         31.13715882]
 [20.32481187 20.18389517 39.93376923 31.13715882  0.        ]]
from scipy.spatial.distance import pdist, squareform
import similaritymeasures

def frechet(p, q):
    p = p[~np.isnan(p)].reshape(-1,2)
    q = q[~np.isnan(q)].reshape(-1,2)
    return similaritymeasures.frechet_dist(p,q)

print(squareform(pdist(df.values, frechet)))