Python 熊猫：如何限制str.contains的结果？_Python_Performance_Pandas_Contains

Python 熊猫：如何限制str.contains的结果？

python performance pandas

Python 熊猫：如何限制str.contains的结果？,python,performance,pandas,contains,Python,Performance,Pandas,Contains,我有一个大于1M行的数据帧。我想选择某列包含某个子字符串的所有行： matching = df['col2'].str.contains('substr', case=True, regex=False) rows = df[matching].col1.drop_duplicates() 但是这个选择很慢，我想加快速度。假设我只需要前n个结果。在获得n个结果后，是否有方法停止匹配？我试过： matching = df['col2'].str.contains('substr', case=T

我有一个大于1M行的数据帧。我想选择某列包含某个子字符串的所有行：

matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()

但是这个选择很慢，我想加快速度。假设我只需要前n个结果。在获得n个结果后，是否有方法停止

匹配

？我试过：

matching = df['col2'].str.contains('substr', case=True, regex=False).head(n)

以及：

但他们并没有更快。第二个语句是布尔语句，速度非常快。如何加快第一条语句的速度？

您可以通过以下方式加快速度：

matching = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows = df['col1'].head(n)[matching==True]

但是，此解决方案将检索第一个

行中的匹配结果，而不是第一个

匹配结果

如果您确实想要第一个

匹配结果，您应该使用：

rows =  df['col1'][df['col2'].str.contains("substr")==True].head(n)

但这个选择当然要慢得多

受@ScottBoston回答的启发，您可以使用以下方法来完成更快的解决方案：

rows = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)
这比使用此选项显示整个结果要快，但不是那么快。使用此解决方案，您可以获得第一个
n
匹配结果
通过以下测试代码我们可以看到每个解决方案的速度及其结果：

import pandas as pd import time n = 10 a = ["Result", "from", "first", "column", "for", "this", "matching", "test", "end"] b = ["This", "is", "a", "test", "has substr", "also has substr", "end", "of", "test"] col1 = a*1000000 col2 = b*1000000 df = pd.DataFrame({"col1":col1,"col2":col2}) # Original option start_time = time.time() matching = df['col2'].str.contains('substr', case=True, regex=False) rows = df[matching].col1.drop_duplicates() print("--- %s seconds ---" % (time.time() - start_time)) # Faster option start_time = time.time() matching_fast = df['col2'].head(n).str.contains('substr', case=True, regex=False) rows_fast = df['col1'].head(n)[matching==True] print("--- %s seconds for fast solution ---" % (time.time() - start_time)) # Other option start_time = time.time() rows_other = df['col1'][df['col2'].str.contains("substr")==True].head(n) print("--- %s seconds for other solution ---" % (time.time() - start_time)) # Complete option start_time = time.time() rows_complete = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n) print("--- %s seconds for complete solution ---" % (time.time() - start_time))
这将产生：

>>> --- 2.33899998665 seconds --- --- 0.302999973297 seconds for fast solution --- --- 4.56700015068 seconds for other solution --- --- 1.61599993706 seconds for complete solution ---
由此产生的系列将是：

>>> rows 4 for 5 this Name: col1, dtype: object >>> rows_fast 4 for 5 this Name: col1, dtype: object >>> rows_other 4 for 5 this 13 for 14 this 22 for 23 this 31 for 32 this 40 for 41 this Name: col1, dtype: object >>> rows_complete 4 for 5 this 13 for 14 this 22 for 23 this 31 for 32 this 40 for 41 this Name: col1, dtype: object

信不信由你，但是str存取器很慢。您可以使用性能更好的列表理解

df = pd.DataFrame({'col2':np.random.choice(['substring','midstring','nostring','substrate'],100000)})
平等性检验

all(df['col2'].str.contains('substr', case=True, regex=False) == pd.Series(['substr' in i for i in df['col2']]))
输出：

True
时间：

%timeit df['col2'].str.contains('substr', case=True, regex=False) 10 loops, best of 3: 37.9 ms per loop
对

%timeit pd.Series(['substr' in i for i in df['col2']]) 100 loops, best of 3: 19.1 ms per loop

这并没有真正回答我的问题。我对限制搜索空间持怀疑态度：这显然会加快性能，但以牺牲结果为代价。但是，在尝试了n=10000的“更快”解决方案后，结果并不糟糕，时间也有了显著的改进。但最后，我不能部署这个“更快”的解决方案，因为它假定在前n个结果中会有匹配，这可能不是真的！我将编辑我的问题以澄清这一点。是的，我想您想要的是第一个
n
匹配项，而不是第一个
n
行中的匹配项。我会检查一个方法，以改善时间，如果有任何帮助你。也许@ScottBoston answer是一个相当好的解决方案。请注意，您的解决方案还会返回第一行
n
中的匹配项。没错。确实，您的“其他”解决方案返回前n个匹配项，但它比根本不使用
.head（）
要慢，即不限制搜索。请查看我的更新。我相信“完整解决方案”是一个相当好的方法。
%timeit pd.Series(['substr' in i for i in df['col2']]) 100 loops, best of 3: 19.1 ms per loop