Python 当列与str.contains中的列匹配时,添加第二列

Python 当列与str.contains中的列匹配时,添加第二列,python,pandas,Python,Pandas,我想搜索搜索列表并检查列文本str.是否包含一个或多个搜索词。如果我得到一个匹配项,我想将数据附加到masterdf,这很容易实现,如下所示。但是我还想添加一个带有searchWord的新列,这样我就知道哪个text与哪个匹配。下面的代码用匹配的最新搜索填充列searchWord masterdf = pd.DataFrame(columns=['doc_id','text',]) for searchWord in searchList: search = jsons_data[js

我想搜索
搜索列表
并检查列
文本
str.是否包含一个或多个
搜索词
。如果我得到一个匹配项,我想将数据附加到
masterdf
,这很容易实现,如下所示。但是我还想添加一个带有
searchWord
的新列,这样我就知道哪个
text
与哪个匹配。下面的代码用匹配的最新搜索填充列
searchWord

masterdf = pd.DataFrame(columns=['doc_id','text',])

for searchWord in searchList:
    search = jsons_data[jsons_data['text'].str.contains(searchWord)]
    if len(search) > 0:
        masterdf = masterdf.append(search)
        masterdf['searchWord'] = searchWord

我想这就是你想要的

让我们设置示例数据:

tt = '''I want to search through the. searchList and check if column text str.contains one or more of each searchWord. If I get a match I want to append the data to masterdf which is easily accomplished as seen below. But I also want to add a new column with searchWord so that I know which text matched with what. This code below fills the column searchWord with the. latest search that matched'''
text_col = tt.split('.')
id_col = range(len(text_col))
jsons_data = pd.DataFrame({'doc_id':id_col,'text':text_col})

searchList = ['code','fills', 'But','also','want']
示例
jsons\u data
如下

    doc_id  text
0   0       I want to search through the
1   1       searchList and check if column text str
2   2       contains one or more of each searchWord
3   3       If I get a match I want to append the data to...
4   4       But I also want to add a new column with sear...
5   5       This code below fills the column searchWord w...
6   6       latest search that matched
doc_id  text                                                searchWord
5   5.0 This code below fills the column searchWord w...    code
5   5.0 This code below fills the column searchWord w...    fills
4   4.0 But I also want to add a new column with sear...    But
4   4.0 But I also want to add a new column with sear...    also
0   0.0 I want to search through the                        want
3   3.0 If I get a match I want to append the data to...    want
4   4.0 But I also want to add a new column with sear...    want
使用
search['searchWord']=searchWord修改您的代码,我们得到:

masterdf = pd.DataFrame(columns=['doc_id','text','searchWord'])

for searchWord in searchList:
    search = jsons_data[jsons_data['text'].str.contains(searchWord)]
    if len(search) > 0:
        search['searchWord'] = searchWord
        masterdf = masterdf.append(search)
并且
masterdf

    doc_id  text
0   0       I want to search through the
1   1       searchList and check if column text str
2   2       contains one or more of each searchWord
3   3       If I get a match I want to append the data to...
4   4       But I also want to add a new column with sear...
5   5       This code below fills the column searchWord w...
6   6       latest search that matched
doc_id  text                                                searchWord
5   5.0 This code below fills the column searchWord w...    code
5   5.0 This code below fills the column searchWord w...    fills
4   4.0 But I also want to add a new column with sear...    But
4   4.0 But I also want to add a new column with sear...    also
0   0.0 I want to search through the                        want
3   3.0 If I get a match I want to append the data to...    want
4   4.0 But I also want to add a new column with sear...    want

我想这就是你想要的

让我们设置示例数据:

tt = '''I want to search through the. searchList and check if column text str.contains one or more of each searchWord. If I get a match I want to append the data to masterdf which is easily accomplished as seen below. But I also want to add a new column with searchWord so that I know which text matched with what. This code below fills the column searchWord with the. latest search that matched'''
text_col = tt.split('.')
id_col = range(len(text_col))
jsons_data = pd.DataFrame({'doc_id':id_col,'text':text_col})

searchList = ['code','fills', 'But','also','want']
示例
jsons\u data
如下

    doc_id  text
0   0       I want to search through the
1   1       searchList and check if column text str
2   2       contains one or more of each searchWord
3   3       If I get a match I want to append the data to...
4   4       But I also want to add a new column with sear...
5   5       This code below fills the column searchWord w...
6   6       latest search that matched
doc_id  text                                                searchWord
5   5.0 This code below fills the column searchWord w...    code
5   5.0 This code below fills the column searchWord w...    fills
4   4.0 But I also want to add a new column with sear...    But
4   4.0 But I also want to add a new column with sear...    also
0   0.0 I want to search through the                        want
3   3.0 If I get a match I want to append the data to...    want
4   4.0 But I also want to add a new column with sear...    want
使用
search['searchWord']=searchWord修改您的代码,我们得到:

masterdf = pd.DataFrame(columns=['doc_id','text','searchWord'])

for searchWord in searchList:
    search = jsons_data[jsons_data['text'].str.contains(searchWord)]
    if len(search) > 0:
        search['searchWord'] = searchWord
        masterdf = masterdf.append(search)
并且
masterdf

    doc_id  text
0   0       I want to search through the
1   1       searchList and check if column text str
2   2       contains one or more of each searchWord
3   3       If I get a match I want to append the data to...
4   4       But I also want to add a new column with sear...
5   5       This code below fills the column searchWord w...
6   6       latest search that matched
doc_id  text                                                searchWord
5   5.0 This code below fills the column searchWord w...    code
5   5.0 This code below fills the column searchWord w...    fills
4   4.0 But I also want to add a new column with sear...    But
4   4.0 But I also want to add a new column with sear...    also
0   0.0 I want to search through the                        want
3   3.0 If I get a match I want to append the data to...    want
4   4.0 But I also want to add a new column with sear...    want

我建议使用矢量化(无循环)方法:

In [84]: df
Out[84]:
   doc_id                                                                                                text
0       0                                                                        I want to search through the
1       1                                                             searchList and check if column text str
2       2                                                             contains one or more of each searchWord
3       3   If I get a match I want to append the data to masterdf which is easily accomplished as seen below
4       4     But I also want to add a new column with searchWord so that I know which text matched with what
5       5                                                This code below fills the column searchWord with the
6       6                                                                          latest search that matched

In [85]: searchList = ['code', 'fills', 'but', 'also', 'want']

In [86]: words_re = '{}'.format('|'.join(searchList).lower())

In [87]: words_re
Out[87]: 'code|fills|but|also|want'

In [88]: masterdf = df[df.text.str.contains('(?:{})'.format(words_re))].copy()

In [89]: masterdf['searchWord'] = masterdf.text.str.findall('({})'.format(words_re)).str.join('|')

In [90]: masterdf
Out[90]:
   doc_id                                                                                                text  searchWord
0       0                                                                        I want to search through the        want
3       3   If I get a match I want to append the data to masterdf which is easily accomplished as seen below        want
4       4     But I also want to add a new column with searchWord so that I know which text matched with what   also|want
5       5                                                This code below fills the column searchWord with the  code|fills

我建议使用矢量化(无循环)方法:

In [84]: df
Out[84]:
   doc_id                                                                                                text
0       0                                                                        I want to search through the
1       1                                                             searchList and check if column text str
2       2                                                             contains one or more of each searchWord
3       3   If I get a match I want to append the data to masterdf which is easily accomplished as seen below
4       4     But I also want to add a new column with searchWord so that I know which text matched with what
5       5                                                This code below fills the column searchWord with the
6       6                                                                          latest search that matched

In [85]: searchList = ['code', 'fills', 'but', 'also', 'want']

In [86]: words_re = '{}'.format('|'.join(searchList).lower())

In [87]: words_re
Out[87]: 'code|fills|but|also|want'

In [88]: masterdf = df[df.text.str.contains('(?:{})'.format(words_re))].copy()

In [89]: masterdf['searchWord'] = masterdf.text.str.findall('({})'.format(words_re)).str.join('|')

In [90]: masterdf
Out[90]:
   doc_id                                                                                                text  searchWord
0       0                                                                        I want to search through the        want
3       3   If I get a match I want to append the data to masterdf which is easily accomplished as seen below        want
4       4     But I also want to add a new column with searchWord so that I know which text matched with what   also|want
5       5                                                This code below fills the column searchWord with the  code|fills

这看起来真不错。为什么这比循环好?有没有比“匹配时添加列…”更清晰/更正确的方法来思考这个问题?@user3471881,因为对于更大(1000+以上)的数据集,矢量化解决方案通常比“循环”解决方案快几个数量级,这看起来真的很好。为什么这比循环好?有没有比“匹配时添加列…”更清晰/更正确的方法来思考这个问题?@user3471881,因为对于更大(1000+以上)的数据集,矢量化解决方案通常比“循环”解决方案快几个数量级