Python 检查pandas中的元组列是否包含列表中的某些值_Python_Pandas

Python 检查pandas中的元组列是否包含列表中的某些值

python pandas

Python 检查pandas中的元组列是否包含列表中的某些值,python,pandas,Python,Pandas,我有一个带有元组列的数据帧。我想要一个掩码，为每一行标识元组列中的任何值是否与预定元组中的任何值匹配。我的尝试如下： import pandas as pd df = pd.DataFrame([{'a': 1, 'b': (2, 3, 4)}, {'a': 5, 'b': (6, 7, 8)}]) print(df) codes = (3, 4, 20, 22) mask = df.b.str.contains_any(codes) # This line is incorrect 期

我有一个带有元组列的数据帧。我想要一个掩码，为每一行标识元组列中的任何值是否与预定元组中的任何值匹配。我的尝试如下：

import pandas as pd

df = pd.DataFrame([{'a': 1, 'b': (2, 3, 4)}, {'a': 5, 'b': (6, 7, 8)}])
print(df)

codes = (3, 4, 20, 22)
mask = df.b.str.contains_any(codes)  # This line is incorrect

期望输出：

0     True
1    False

我希望str函数可以用于元组，但我无法从

codes

中获得一个值：

a = df['has_code'] = df['b'].str.contains(4)

给予

试试这个：

res = df['b'].apply(lambda x: any(val in x for val in codes))
print(res)

输出：

0     True
1    False

另一种选择

df['b'].apply(lambda x: any(set(x).intersection(codes)))

您可以使用

set.intersection

和

astype（bool）

时间分析

#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000

s = pd.Series(o)
s
0       [6, 2, 5]
1       [7, 4, 0]
2       [1, 8, 2]
3       [4, 8, 9]
4       [7, 3, 4]
          ...
9995    [3, 9, 4]
9996    [6, 2, 9]
9997    [2, 0, 5]
9998    [5, 0, 7]
9999    [7, 4, 2]
Length: 10000, dtype: object

对于大小为100万的

系列

bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0         [6, 2, 5]
1         [6, 2, 5]
2         [6, 2, 5]
3         [6, 2, 5]
4         [6, 2, 5]
            ...
999995    [7, 4, 2]
999996    [7, 4, 2]
999997    [7, 4, 2]
999998    [7, 4, 2]
999999    [7, 4, 2]
Length: 1000000, dtype: object

添加了timeit结果；）完成后，只需补充一点，您应该始终使用更大的数据帧进行基准测试。然后观察每个建议的解决方案是如何工作的，例如，对于小数据，我的解决方案比所有建议的解决方案都慢，但在使用大数据时速度更快。komatiraju的解决方案可以很好地处理小数据，但在处理大数据时，速度几乎慢了8倍。根据我的使用案例，没有语音代码应该是一组。这很有效。@AttilatheFun是的，设置适合您的用例。很高兴这有帮助

#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000

s = pd.Series(o)
s
0       [6, 2, 5]
1       [7, 4, 0]
2       [1, 8, 2]
3       [4, 8, 9]
4       [7, 3, 4]
          ...
9995    [3, 9, 4]
9996    [6, 2, 9]
9997    [2, 0, 5]
9998    [5, 0, 7]
9999    [7, 4, 2]
Length: 10000, dtype: object

# Adam's answer
In [38]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
19.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#komatiraju's answer
In [39]: %timeit s.apply(lambda x: any(val in x for val in codes))
83.8 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#My answer
In [42]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
15.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#wwnde's answer
In [74]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
19.5 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0         [6, 2, 5]
1         [6, 2, 5]
2         [6, 2, 5]
3         [6, 2, 5]
4         [6, 2, 5]
            ...
999995    [7, 4, 2]
999996    [7, 4, 2]
999997    [7, 4, 2]
999998    [7, 4, 2]
999999    [7, 4, 2]
Length: 1000000, dtype: object

In [54]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit s.apply(lambda x: any(val in x for val in codes))
8.9 s ± 652 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [56]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
1.54 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
1.95 s ± 88.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df.b.apply(lambda x:len([*{*x}&{*codes}])>0)#my preferred speed wise
#df.b.apply(lambda x:[*{*x}&{*codes}]).str.len()>0 #Works as well

0     True
1    False
Name: b, dtype: bool

%timeit df.b.apply(lambda x:len([*{*x}&{*codes}])>0)
220 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit 

code = set(codes)
df.b.map(code.intersection).astype(bool)
364 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(val in x for val in codes))
210 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(set(x).intersection(codes)))
211 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)