Python 检查数据帧中列的字符串值是否以元组的字符串元素开头（str.startswith除外）_Python_Pandas_Dataframe_Optimization_Startswith

Python 检查数据帧中列的字符串值是否以元组的字符串元素开头（str.startswith除外）

python pandas dataframe optimization

Python 检查数据帧中列的字符串值是否以元组的字符串元素开头（str.startswith除外）,python,pandas,dataframe,optimization,startswith,Python,Pandas,Dataframe,Optimization,Startswith,我有一个带有随机值（“457645”、“458762496”、“1113423453”…）的熊猫数据帧列，我需要检查这些值是否以元组的元素开头（“323”、“229”、“111”）在这种情况下，对于“1113423453”，应该是正确的我尝试了df[column].str.startswith（tuple），效果很好；但对于大量数据（2M df行和3K元组元素），它比10K df行和3K元组元素（1.47秒）慢得多（大约28秒）有没有更有效的方法我尝试了df[column].str.st

我有一个带有随机值（

“457645”、“458762496”、“1113423453”…

）的熊猫数据帧列，我需要检查这些值是否以元组的元素开头

（“323”、“229”、“111”）

在这种情况下，对于

“1113423453”

，应该是正确的

我尝试了

df[column].str.startswith（tuple）

，效果很好；但对于大量数据（2M df行和3K元组元素），它比10K df行和3K元组元素（1.47秒）慢得多（大约28秒）

有没有更有效的方法

我尝试了

df[column].str.startswith（tuple）

，效果很好……但如果可能的话，我正在寻找一种更有效的方法

由于

startswith（）

没有针对大量前缀字符串进行优化，只对它们进行线性搜索，因此在这里使用二进制搜索可能更有效。为此，我们需要对前缀进行排序

from bisect import bisect_right
s = sorted(tuple)
df[column].apply(lambda str: str.startswith(s[bisect_right(s, str)-1]))

是否可以将前缀提取到数据帧的新列中

是的，e。G使用此功能：

def startwiths(str):
    prefix = s[bisect_right(s, str)-1]
    if str.startswith(prefix): return prefix

df['new column'] = df[column].apply(startwiths)

Armali的解决方案只适用于长度相同的字符串。如果字符串长度可变，则需要按长度分组，然后使用Armalis算法。它仍然比大数据帧上的内置解决方案快得多

import numpy  as np
import pandas as pd 
import random
from pandas._testing import rands_array
from bisect import bisect_right

# create random strings
def zufallsdaten(anz):   
    result = pd.DataFrame()         
    result['string_A'] = pd.util.testing.rands_array(10, anz)   
    result['string_B'] = pd.util.testing.rands_array(10, anz)    
    def bearbeite_element( skalar ):
        l = random.randint(2,5)
        return skalar[0:l] 
    result['string_B'] = result['string_B'].apply(bearbeite_element)
    return result

# create data to search in
manystrings = pd.DataFrame(zufallsdaten(1000000)['string_A'])

# create data to search
search_me   = pd.DataFrame(zufallsdaten(100000)['string_B'].drop_duplicates())



# fast startswith alternative. Finds the longest / shortest matching fragment and writes it into the field foundfieldname.
# if find_identical, the strings may not be identical.  
def fast_startswith(df, searchfieldname, foundfieldname, searchseries, find_longest=True, find_identical=True):
    
    # startswith alternative, works only if all strings in searchme have the same length. Also returns the matching fragment
    def startwiths(data, searchme, find_identical):
        prefix = searchme[bisect_right(searchme, data)-1]
        if ((data!=prefix) or find_identical ) and data.startswith(prefix): 
            return prefix    
        
    search = pd.DataFrame(searchseries)
    search.columns = ['searchstring']
    search['len'] = search.searchstring.str.len()
    grouped = search.groupby('len')
    lengroups = grouped.agg(list).reset_index().sort_values('len', ascending=find_longest)  
    result = df.copy()
    result[foundfieldname] = None
    for index, row in lengroups.iterrows():
        result[foundfieldname].update(result[searchfieldname].apply(startwiths, searchme=sorted(row.searchstring), find_identical=find_identical)  )  
        #result[foundfieldname] = result[foundfieldname].fillna(  result[searchfieldname].apply(startwiths, searchme=sorted(row.searchstring))  )
    return result
    
    

def fast_startswith2(df, searchfieldname, foundfieldname, searchseries):

    # startswith alternative, works only if all strings in searchme have the same length. Also returns the matching fragment
    def startwiths(data, searchme):
        prefix = searchme[bisect_right(searchme, data)-1]
        if data.startswith(prefix): return prefix    
    
    def grouped_startswith(searchme, data):
        data[foundfieldname].update(data[searchfieldname].apply(startwiths, searchme=sorted(searchme.searchstring)))
        return list(searchme.searchstring)   
    
    search = pd.DataFrame(searchseries)
    search.columns = ['searchstring']
    search['len'] = search.searchstring.str.len()
    grouped = search.groupby('len')     
    result = df.copy()
    result[foundfieldname] = None
    grouped.apply(grouped_startswith, data=result)
    return result    



%%time 
mask = manystrings.string_A.str.startswith(tuple(search_me.string_B))
result0 = manystrings[mask]
# result0: built-in startswith
# Wall time: 1min 6s 



%%time
df = fast_startswith(manystrings, 'string_A', 'found', search_me.string_B) 
mask = df.found.notnull()
result1 = df[mask]   
#print( result0.shape[0],   result1.shape[0])
assert result0.shape[0] == result1.shape[0]

# result1: iterate through groups of strings with same length.
# also returns the matching fragment
# Wall time: 6.33 s



%%time
df = fast_startswith2(manystrings, 'string_A', 'found', search_me.string_B) 
mask = df.found.notnull()
result2 = df[mask]    
#print( result0.shape[0],   result2.shape[0])
assert result0.shape[0] == result2.shape[0]

# result2: apply fast startswith method on groups of strings with same length
# also returns the matching fragment
# Wall time: 5.94 s



# differences? May occur if you use find_longest=False
result = pd.merge(result1,result2, on='string_A', suffixes=('_1','_2'))
mask = (result.found_1 != result.found_2)
result[mask]



# search self
df = fast_startswith(search_me, 'string_B', 'found', search_me.string_B, find_identical=False) 
mask = df.found.notnull()
df[mask]

实际上这是我的错。不管怎么说，我的意思是，对于较小的数据集，它实际上不会影响性能，但对于较大的数据集，我正在寻找一种方法（如果可能的话）来更快地完成。即使是100K，在28秒时的2M行速度与在1.47秒时的100K行速度几乎完全相同。我知道，我明白了。。但如果可能的话，我正在寻找一种更有效的方法！这要快得多。。有可能将前缀提取到数据帧的新列中吗？是的-我修改了答案。我看到你的帖子纯属巧合。我想对它进行评估，但如果没有缺失的部分，这很难。谢谢回答！我编辑了这篇文章，所以代码现在应该已经完成了。事实上，如果元组元素的长度不同，并且一个元素是下一个元素的开始，那么按原样进行对分的方法不起作用。G使用

search\u me

中的元素

AI

和

AIKHB0

，在

manystrings

中找不到字符串

AIljP9udNn

。我对其进行了一些处理，不理解算法。对分_right有两个附加参数hi和lo，它们有用吗？参数lo和hi可用于指定应考虑的列表子集-这没有帮助，我们必须在整个列表中搜索。该算法只是使用

对分(bisect)right

快速查找排序后的元组元素中最接近当前

'string\u fixlen'

的一个元素，因此只需使用

startswith

测试这一元组元素。但如果（在我的示例中）另一个元素（

AIKHB0

）比正确的元素（

AI

）更接近

'string\u fixlen'

）（

AIljP9udNn

），则此操作失败。