Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/282.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 检查pandas中的元组列是否包含列表中的某些值_Python_Pandas - Fatal编程技术网

Python 检查pandas中的元组列是否包含列表中的某些值

Python 检查pandas中的元组列是否包含列表中的某些值,python,pandas,Python,Pandas,我有一个带有元组列的数据帧。我想要一个掩码,为每一行标识元组列中的任何值是否与预定元组中的任何值匹配。我的尝试如下: import pandas as pd df = pd.DataFrame([{'a': 1, 'b': (2, 3, 4)}, {'a': 5, 'b': (6, 7, 8)}]) print(df) codes = (3, 4, 20, 22) mask = df.b.str.contains_any(codes) # This line is incorrect 期

我有一个带有元组列的数据帧。我想要一个掩码,为每一行标识元组列中的任何值是否与预定元组中的任何值匹配。我的尝试如下:

import pandas as pd

df = pd.DataFrame([{'a': 1, 'b': (2, 3, 4)}, {'a': 5, 'b': (6, 7, 8)}])
print(df)

codes = (3, 4, 20, 22)
mask = df.b.str.contains_any(codes)  # This line is incorrect
期望输出:

0     True
1    False
我希望str函数可以用于元组,但我无法从
codes
中获得一个值:

a = df['has_code'] = df['b'].str.contains(4)
给予

试试这个:

res = df['b'].apply(lambda x: any(val in x for val in codes))
print(res)
输出:

0     True
1    False
另一种选择

df['b'].apply(lambda x: any(set(x).intersection(codes)))

您可以使用
set.intersection
astype(bool)

时间分析

#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000

s = pd.Series(o)
s
0       [6, 2, 5]
1       [7, 4, 0]
2       [1, 8, 2]
3       [4, 8, 9]
4       [7, 3, 4]
          ...
9995    [3, 9, 4]
9996    [6, 2, 9]
9997    [2, 0, 5]
9998    [5, 0, 7]
9999    [7, 4, 2]
Length: 10000, dtype: object

对于大小为100万的
系列

bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0         [6, 2, 5]
1         [6, 2, 5]
2         [6, 2, 5]
3         [6, 2, 5]
4         [6, 2, 5]
            ...
999995    [7, 4, 2]
999996    [7, 4, 2]
999997    [7, 4, 2]
999998    [7, 4, 2]
999999    [7, 4, 2]
Length: 1000000, dtype: object


添加了timeit结果;)完成后,只需补充一点,您应该始终使用更大的数据帧进行基准测试。然后观察每个建议的解决方案是如何工作的,例如,对于小数据,我的解决方案比所有建议的解决方案都慢,但在使用大数据时速度更快。komatiraju的解决方案可以很好地处理小数据,但在处理大数据时,速度几乎慢了8倍。根据我的使用案例,没有语音代码应该是一组。这很有效。@AttilatheFun是的,设置适合您的用例。很高兴这有帮助
#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000

s = pd.Series(o)
s
0       [6, 2, 5]
1       [7, 4, 0]
2       [1, 8, 2]
3       [4, 8, 9]
4       [7, 3, 4]
          ...
9995    [3, 9, 4]
9996    [6, 2, 9]
9997    [2, 0, 5]
9998    [5, 0, 7]
9999    [7, 4, 2]
Length: 10000, dtype: object
# Adam's answer
In [38]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
19.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#komatiraju's answer
In [39]: %timeit s.apply(lambda x: any(val in x for val in codes))
83.8 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#My answer
In [42]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
15.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#wwnde's answer
In [74]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
19.5 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0         [6, 2, 5]
1         [6, 2, 5]
2         [6, 2, 5]
3         [6, 2, 5]
4         [6, 2, 5]
            ...
999995    [7, 4, 2]
999996    [7, 4, 2]
999997    [7, 4, 2]
999998    [7, 4, 2]
999999    [7, 4, 2]
Length: 1000000, dtype: object
In [54]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit s.apply(lambda x: any(val in x for val in codes))
8.9 s ± 652 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [56]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
1.54 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
1.95 s ± 88.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df.b.apply(lambda x:len([*{*x}&{*codes}])>0)#my preferred speed wise
#df.b.apply(lambda x:[*{*x}&{*codes}]).str.len()>0 #Works as well

0     True
1    False
Name: b, dtype: bool

%timeit df.b.apply(lambda x:len([*{*x}&{*codes}])>0)
220 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit 

code = set(codes)
df.b.map(code.intersection).astype(bool)
364 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(val in x for val in codes))
210 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(set(x).intersection(codes)))
211 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)