Python 如何在数据框中的多行中搜索多个搜索词？_Python_Pandas_Dataframe_Search

Python 如何在数据框中的多行中搜索多个搜索词？

python pandas dataframe search

Python 如何在数据框中的多行中搜索多个搜索词？,python,pandas,dataframe,search,Python,Pandas,Dataframe,Search,因此，我先前提出的更简单的问题是：我想做的基本上是能够将一个包含多个短语的文本文档（而不仅仅是单数单词，如“newjersey”等）输入到搜索中，然后跨多行搜索这些术语，并在表中输出一个新列，如果术语为“True”，则为“present”，如果不是，则为“False”。例如，这是我表格中非常小的一部分，我想搜索“新泽西州”和“长大成人”这两个单词，它们分别排在不同的行中 subtitle start end duration 14

因此，我先前提出的更简单的问题是：

我想做的基本上是能够将一个包含多个短语的文本文档（而不仅仅是单数单词，如“newjersey”等）输入到搜索中，然后跨多行搜索这些术语，并在表中输出一个新列，如果术语为“True”，则为“present”，如果不是，则为“False”。例如，这是我表格中非常小的一部分，我想搜索“新泽西州”和“长大成人”这两个单词，它们分别排在不同的行中

             subtitle        start          end  duration
14                new    71.986000    72.096000  0.110000
15             jersey    72.106000    72.616000  0.510000
16               grew    72.696000    73.006000  0.310000
17                 up    73.007000    73.147000  0.140000
18          believing    73.156000    73.716000  0.560000

到目前为止，感谢旧线程上的善意帮助，这就是我所拥有的，其中terms.txt是搜索词列表：

import re

search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"
text = " ".join(df["subtitle"])
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False

到目前为止，一切正常：

for match in re.finditer(search, text, re.IGNORECASE):
    idx1 = df[df["start"] == match.start()].index[0]
    idx2 = df[df["end"] == match.end()].index[0]
    df.loc[idx1:idx2, "match"] = True

我收到错误消息：

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-9f347152f616> in <module>
      1 for match in re.finditer(search, text, re.IGNORECASE):
----> 2     idx1 = df[df["start"] == match.start()].index[0]
      3     idx2 = df[df["end"] == match.end()].index[0]
      4     df.loc[idx1:idx2, "match"] = True

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
   4099         if is_scalar(key):
   4100             key = com.cast_scalar_indexer(key, warn_float=True)
-> 4101             return getitem(key)
   4102 
   4103         if isinstance(key, slice):

IndexError: index 0 is out of bounds for axis 0 with size 0

---------------------------------------------------------------------------
索引器回溯（最后一次最近调用）
在里面
1用于re.FindItemer中的匹配（搜索、文本、re.IGNORECASE）：
---->2 idx1=df[df[“start”]==match.start（）]索引[0]
3 idx2=df[df[“end”]==match.end（）]索引[0]
4 df.loc[idx1:idx2，“匹配”]=True
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/index/base.py in\uuuuu getitem\uuuuuu（self，key）
4099如果是标量（键）：
4100 key=com.cast\u scalar\u indexer（key，warn\u float=True）
->4101返回getitem（键）
4102
4103如果存在（键，切片）：
索引器错误：索引0超出大小为0的轴0的界限

有没有人知道我如何解决这个问题，或者我是否可以使用其他方法来达到预期的效果？非常感谢您的帮助，对于任何格式问题，我深表歉意，因为我是新来的。

共有两列“开始”和“结束”

import re

terms = [term.strip() for term in open("terms.txt").readlines()]
word = df["subtitle"].str.strip()
end = word.apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
text = " ".join(word)
df["match"] = False

输出：

$ cat terms.txt
new jersey
hello

>>> df
   id   subtitle   start     end  duration  match
0  14        new  71.986  72.096      0.11   True
1  15     jersey  72.106  72.616      0.51   True
2  16       grew  72.696  73.006      0.31  False
3  17         up  73.007  73.147      0.14  False
4  18  believing  73.156  73.716      0.56  False

谢谢你的帮助，但不幸的是，我仍然得到与上面相同的错误！您是否仅在示例或数据集上测试此代码？你的Python和Pandas版本是什么？也许你应该共享你的excel文件。我只是尝试了这些数据。如何准确地共享我的Excel文件？对不起，我是新来的！另外，这也是一个我正在处理的JSON文件。编辑：如果某些搜索词与表中的确切措辞/条件不符，是否会出现问题？否，我已经尝试了不存在的词，并返回了一个空数据框，但没有引发错误。

$ cat terms.txt
new jersey
hello

>>> df
   id   subtitle   start     end  duration  match
0  14        new  71.986  72.096      0.11   True
1  15     jersey  72.106  72.616      0.51   True
2  16       grew  72.696  73.006      0.31  False
3  17         up  73.007  73.147      0.14  False
4  18  believing  73.156  73.716      0.56  False