Python 使用str.contains从搜索词列表中选择包含所有字符串的数据帧行_Python_Pandas_Dataframe

Python 使用str.contains从搜索词列表中选择包含所有字符串的数据帧行

python pandas dataframe

Python 使用str.contains从搜索词列表中选择包含所有字符串的数据帧行,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个Pandads数据框，其中一列（“已处理”）是单个字符串，其中包含一些不同长度的预处理文本我希望使用任意长度的关键字列表进行搜索，以仅返回字符串“processed”包含列表中所有元素的行的已处理注释当然，我可以单独搜索这些术语，如： words = ['searchterm1', 'searchterm2'] notes = df.loc[(df.processed.str.contains(words[0])) & (df.processed.str.contains(w

我有一个Pandads数据框，其中一列（“已处理”）是单个字符串，其中包含一些不同长度的预处理文本

我希望使用任意长度的关键字列表进行搜索，以仅返回字符串“processed”包含列表中所有元素的行的已处理注释

当然，我可以单独搜索这些术语，如：

words = ['searchterm1', 'searchterm2']
notes = df.loc[(df.processed.str.contains(words[0])) & (df.processed.str.contains(words[1]))].processed

但这似乎效率低下，并且需要根据我使用的搜索词的数量使用不同的代码

我要找的是

notes = (df.loc[[(df.processed.str.contains(words[i])) for i in range(len(words))]]).processed

其中包括

“searchterm1 foo bar searchterm”

但不包括

“foo-bar搜索术语1”

或

“searchterm2”

但这不起作用-loc不支持生成器对象或列表作为输入

那么，找到包含多个子字符串的字符串的最佳方法是什么？谢谢

示例数据：

df = pd.DataFrame(data=[[1,'a', 3],
                   [1,'b', 4],
                   [2,'c', 22],
                   [2,'s', 3],
                   [2,'f', 3],
                   [1,'d', 56]], 
             columns = ['group', 'value', 'value2'])

words = ['two', 'three', 'two']

输出：

  processed
0       one
1       two
2     three
3   one one
4  two, one

  processed
1       two
2     three
4  two, one

我修改您的原始代码：

notes = df.loc[sum([df.processed.str.contains(word) for word in words]) > 0]

输出：

  processed
0       one
1       two
2     three
3   one one
4  two, one

  processed
1       two
2     three
4  two, one

示例数据：

df = pd.DataFrame(data=[[1,'a', 3],
                   [1,'b', 4],
                   [2,'c', 22],
                   [2,'s', 3],
                   [2,'f', 3],
                   [1,'d', 56]], 
             columns = ['group', 'value', 'value2'])

words = ['two', 'three', 'two']

输出：

  processed
0       one
1       two
2     three
3   one one
4  two, one

  processed
1       two
2     three
4  two, one

我修改您的原始代码：

notes = df.loc[sum([df.processed.str.contains(word) for word in words]) > 0]

输出：

  processed
0       one
1       two
2     three
3   one one
4  two, one

  processed
1       two
2     three
4  two, one

您是否在寻找任何（即至少一个）或所有要匹配的子字符串？是否在寻找所有要匹配的子字符串。将编辑问题以澄清。的可能重复项的可能重复项您是否正在寻找任何（即至少一个）或所有要匹配的子字符串？正在寻找所有要匹配的子字符串。将编辑问题以澄清。可能重复的