Python 数据帧中的二进制搜索？_Python_String_Pandas_Search_Dataframe

Python 数据帧中的二进制搜索？

python string pandas search dataframe

Python 数据帧中的二进制搜索？,python,string,pandas,search,dataframe,Python,String,Pandas,Search,Dataframe,我正在大熊猫数据框中搜索大量单词，但性能有问题。有没有办法在数据帧中的列字符串中进行二进制搜索现在我的代码是这样的： names = pd.DataFrame(data=['one', 'two', 'three', 'four'], index=range(0, 4), columns=['Name']) sentence = 'There are two trees in the street.' for word in word_tokenize(sentence): # Se

我正在大熊猫数据框中搜索大量单词，但性能有问题。有没有办法在数据帧中的列字符串中进行二进制搜索

现在我的代码是这样的：

names = pd.DataFrame(data=['one', 'two', 'three', 'four'], index=range(0, 4), columns=['Name'])
sentence = 'There are two trees in the street.'

for word in word_tokenize(sentence):
    # Search for each word in all the names
    new_names = names[names['Name'].str.startswith(word)]
    # then do some operations on the names

但是我需要一个更好的

names[names['Name'].str.startswith（word）]

性能，我想我应该找到一种在'Name'列上进行二进制搜索的方法。

这种方法至少有两个问题。首先，

names['Name'].str.startswith（word）

是为每个单词计算的，尽管它可以被缓存。其次，

startswith（）

将匹配单词“the”的“There”。翻译成代码后，可以通过以下方式进行更改：

# calculate startword only once.
startword = names.apply(lambda row: row['Name'].split(" ", 1)[0])

for word in word_tokenize(sentence):
    # also, match by the full word only
    new_names = names[startword == word]

如果startword是一个索引，它可以更快：

names.index = startword
for word in word_tokenize(sentence):
    # also, match by the full word only
    new_names = names.loc[word]

你到底试过什么？你需要给出更多的细节。提供一个带有您尝试过的一些代码的示例数据框将大有帮助。@TEDPROU谢谢！我把问题改了一点。仍然没有足够的细节来提供答案。

ItErrors

下面发生了什么。您通常应该不惜一切代价避免使用

iterrows

。一个包含更多信息的示例数据框将大有帮助。@TEDPROU我在开头添加了一个示例数据。这并不重要。我可以在接下来的操作中使用其他方法。主要问题是当数据框变得太大时，在数据框中进行搜索。@AmirAhmad，您可能需要检查一下，谢谢！当有多行具有该startword时，this.loc[]返回一个数据帧，但当只有一行时，它返回一些其他数据类型，我应该以不同的方式处理这些数据类型。有解决办法吗？@AmirAhmad只需使用

names.loc[[word]]

而不是

names.loc[word]