Python 一种获取唯一相交区间数的有效方法_Python_Algorithm_Numpy_Pandas

Python 一种获取唯一相交区间数的有效方法

python algorithm numpy pandas

Python 一种获取唯一相交区间数的有效方法,python,algorithm,numpy,pandas,Python,Algorithm,Numpy,Pandas,我有一个间隔为 start end 1 10 3 7 8 10 我需要找到与其他数据帧的交点数 value 2 5 9 结果应该是 1 2 2 问题的第二部分比较棘手。我的带有间隔的数据帧还包含一种类型 start end type 1 10 1 3 7 1 8 10 2 我需要知道许多唯一的（按类型）间隔将被相交。结果应该是： 1 1 2 我想第一部分可以通过numpy.searchsorted完成，但是第二部分呢？让我们调

我有一个间隔为

start end
1     10
3     7
8     10

我需要找到与其他数据帧的交点数

value
2
5
9

结果应该是

1
2
2

问题的第二部分比较棘手。我的带有间隔的数据帧还包含一种

类型

start end type
1     10  1
3     7   1
8     10  2

我需要知道许多唯一的（按类型）间隔将被相交。结果应该是：

1
1
2

我想第一部分可以通过

numpy.searchsorted

完成，但是第二部分呢？

让我们调用您的第一个数据帧

df

。对于给定的值，可以按如下方式找到相交间隔：

mask = (df['start'] <= value) & (df['end'] >= value)

以下内容将返回相交类型的数量：

len(df['type'][mask].unique())

现在，您可以

将

lambda函数应用于值系列：

values = pd.Series([2, 5, 9], name=['value'])
values.apply(lambda value: len(df['type'][(df['start'] <= value) & (df['end'] >= value)].unique()))

values=pd.Series（[2,5,9]，name=['value']）
values.apply（lambda值：len（df['type'][（df['start']=value）].unique（））

DSM使用熊猫显示。按照该模式，我们可以将

start

和

end

值组合成一列

idx

s，第二列（

change

）在

idx

对应于

start

时等于1，在

idx

对应于

end

时等于-1

df = pd.DataFrame(
    {'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})
event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
#    type  change  idx
# 3     1       1    1
# 4     1       1    3
# 1     1      -1    7
# 5     2       1    8
# 0     1      -1   10
# 2     2      -1   10

现在，由于我们希望跟踪间隔的

类型

，我们可以使用

event.pivot

将每个类型放置在其自己的列中。取

cumsum

计算覆盖

idx

的区间数：

event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
# type  1  2
# idx       
# 1     1  0
# 3     2  0
# 7     1  0
# 8     1  1
# 10    0  0

对于每个

类型

，我们只关心覆盖的值，而不关心覆盖的次数。因此，让我们计算

event>0

以找到包含的值：

event = event > 0
# type      1      2
# idx               
# 1      True  False
# 3      True  False
# 7      True  False
# 8      True   True
# 10    False  False

现在，我们可以使用

searchsorted

查找所需的结果：

other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values

总而言之：

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})

event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
event = event > 0
other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)

屈服

   value  result
0      2       1
1      5       1
2      9       2

    value  result
0       0       0
1       1       0   <-- The half-open interval (1, 10] does not include 1
2       2       1
3       3       1
4       4       1
5       5       1
6       6       1
7       7       1
8       8       1   <-- The half-open interval (8, 10] does not include 8
9       9       2
10     10       2
11     11       0
12     12       0

为了检查计算的正确性，让我们看看

other = pd.DataFrame({'value': np.arange(13)})

然后

屈服

   value  result
0      2       1
1      5       1
2      9       2

    value  result
0       0       0
1       1       0   <-- The half-open interval (1, 10] does not include 1
2       2       1
3       3       1
4       4       1
5       5       1
6       6       1
7       7       1
8       8       1   <-- The half-open interval (8, 10] does not include 8
9       9       2
10     10       2
11     11       0
12     12       0

在第二行中，类型为1。如果另一数据帧中的值为5，则[1,10]和[3,7]都相交。因此，结果（即相交的唯一间隔数）不应该是

[1,2,2]

而不是

[1,1,2]

？如果你做了第一部分并向我们展示了你的代码，那么第二部分就更容易帮助你了。@unutbu[1,10]和[3,7]都属于同一类型（类型1），因此，只有一种类型的间隔5相交。这不是唯一的间隔，而是唯一的类型。我关心的是代码的效率。它能以100%广播的方式实现吗？我知道我们需要unutbu来拯救：）

idx = event.index.searchsorted(other['value'], side='right')-1