Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/281.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫-在行上交互并比较以前的值-更快_Python_Pandas_Loops - Fatal编程技术网

Python 熊猫-在行上交互并比较以前的值-更快

Python 熊猫-在行上交互并比较以前的值-更快,python,pandas,loops,Python,Pandas,Loops,我试图更快地获得结果(800行13分钟)。我在这里问了一个类似的问题:-但我不能使用我的变体的好的解决方案。区别在于,如果“col2”中先前值的重叠大于“n=3”,则行中“col1”的值设置为“0”,并影响以下代码 import pandas as pd d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70], 'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68

我试图更快地获得结果(800行13分钟)。我在这里问了一个类似的问题:-但我不能使用我的变体的好的解决方案。区别在于,如果“col2”中先前值的重叠大于“n=3”,则行中“col1”的值设置为“0”,并影响以下代码

import pandas as pd
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
    'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)


df["overlap_count"] = ""  #create new column
n = 3 #if x >= n, then value = 0

for row in range(len(df)):
        x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
        df["overlap_count"].loc[row] = x

        if x >= n:                 
            df["col2"].loc[row] = 0
            df["overlap_count"].loc[row] = 'x'
df
我得到以下结果:如果col1中的值大于'n',则替换它们,并且列重叠\u计数

   col1 col2 overlap_count
0   20  39  0
1   23  32  1
2   40  42  0
3   41  50  1
4   46  63  1
5   47  67  2
6   48  0   x
7   49  0   x
8   50  68  2
9   50  0   x
10  52  0   x
11  55  0   x
12  56  0   x
13  69  71  0
14  70  66  1

谢谢你的帮助和时间

创建一个函数,然后按如下所示应用该函数:


df['overlap\u count']=[fn(i)对于df中的i['overlap\u count']]

试试这个,也许会更快

df['overlap_count'] = df.groupby('col1')['col2'].transform(lambda g: len((g >= g.name).index))

我认为您可以使用
numba
来提高性能,只需要处理数值,因此将
x
添加为
-1
,新列由
0
填充,而不是空字符串:

df["overlap_count"] = 0  #create new column
n = 3 #if x >= n, then value = 0

a = df[['col1','col2','overlap_count']].values

from numba import njit

@njit
def custom_sum(arr, n):
    for row in range(arr.shape[0]):
        x = (arr[0:row, 1] > arr[row, 0]).sum()
        arr[row, 2] = x
        if x >= n:
            arr[row, 1] = 0
            arr[row, 2] = -1
    return arr

性能

d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
    'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)

#4500rows
df = pd.concat([df] * 300, ignore_index=True)

print (df)
In [115]: %%timeit
     ...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
     ...: 
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [116]: %%timeit 
     ...: for row in range(len(df)):
     ...:         x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
     ...:         df["overlap_count"].loc[row] = x
     ...: 
     ...:         if x >= n:                 
     ...:             df["col2"].loc[row] = 0
     ...:             df["overlap_count"].loc[row] = 'x'
     ...:             
     ...:             
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对不起,结果和我预期的不一样。非常感谢。您希望唯一值的数量>=x,还是行数?在唯一的情况下,用
g.drop_duplicates()>=g.name
替换
g.name
,如果查看索引8,其中col1=50,col2的先前值为39,32。。。63,67,0.0. 代码将显示大于等于50的值的数量。结果是2(63,67)。如果结果大于n=3,而不是索引8,则col2值将从68变为0I,我无法真正理解逻辑。为什么是68?您是想重新填写您的帐户,还是想做些别的事?@jezarel,请您接一下分机。。。
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
    'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)

#4500rows
df = pd.concat([df] * 300, ignore_index=True)

print (df)
In [115]: %%timeit
     ...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
     ...: 
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [116]: %%timeit 
     ...: for row in range(len(df)):
     ...:         x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
     ...:         df["overlap_count"].loc[row] = x
     ...: 
     ...:         if x >= n:                 
     ...:             df["col2"].loc[row] = 0
     ...:             df["overlap_count"].loc[row] = 'x'
     ...:             
     ...:             
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)