Python 如何匹配dataframe中文本的部分字符串_Python_Python 3.x_Pandas

Python 如何匹配dataframe中文本的部分字符串

python python-3.x pandas

Python 如何匹配dataframe中文本的部分字符串,python,python-3.x,pandas,Python,Python 3.x,Pandas,我的数据框看起来像- id text 1 good,i am interested..please mail me. 2 call me...good to go with you 3 not interested...bye 4 i am not interested don't call me 5 price is too high so not int

我的数据框看起来像-

id                               text
1         good,i am interested..please mail me.
2         call me...good to go with you
3         not interested...bye
4         i am not interested don't call me
5         price is too high so not interested
6         i have some requirement..please mail me

id                               text                          is_relevant
1         good,i am interested..please mail me.                    yes
2         call me...good to go with you                            yes
3         not interested...bye                                      no
4         i am nt interested don't call me                          no
5         price is too high so not interested                       no
6         i have some requirement..please mail me                   yes

我希望数据框看起来像-

id                               text
1         good,i am interested..please mail me.
2         call me...good to go with you
3         not interested...bye
4         i am not interested don't call me
5         price is too high so not interested
6         i have some requirement..please mail me

id                               text                          is_relevant
1         good,i am interested..please mail me.                    yes
2         call me...good to go with you                            yes
3         not interested...bye                                      no
4         i am nt interested don't call me                          no
5         price is too high so not interested                       no
6         i have some requirement..please mail me                   yes

我已经完成了以下代码-

d1 = {'no': ['Not interested','nt interested']}
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
df["is_relevant"] = df['new_text'].map(d).fillna('yes')

你可以做：

d1 = {'no': ['not interested','nt interested']}

# create regex 
reg = '|'.join([f'\\b{x}\\b' for x in list(d1.values())[0]])

# apply function
df['is_relevant'] = df['text'].str.lower().str.contains(reg).map({True: 'no', False: 'yes'})

   id                                     text is_relevant
0   1    good,i am interested..please mail me.         yes
1   2            call me...good to go with you         yes
2   3                     not interested...bye          no
3   4        i am not interested don't call me          no
4   5      price is too high so not interested          no
5   6  i have some requirement..please mail me         yes
print(df)

你可以做：

d1 = {'no': ['not interested','nt interested']}

# create regex 
reg = '|'.join([f'\\b{x}\\b' for x in list(d1.values())[0]])

# apply function
df['is_relevant'] = df['text'].str.lower().str.contains(reg).map({True: 'no', False: 'yes'})

   id                                     text is_relevant
0   1    good,i am interested..please mail me.         yes
1   2            call me...good to go with you         yes
2   3                     not interested...bye          no
3   4        i am not interested don't call me          no
4   5      price is too high so not interested          no
5   6  i have some requirement..please mail me         yes
print(df)

这与上面YOLO的答案类似，但允许多个文本类

df = pd.DataFrame(
    data = ["good,i am interested..please mail me.",
            "call me...good to go with you",
            "not interested...bye",
            "i am not interested don't call me",
            "price is too high so not interested",
            "i have some requirement..please mail me"],
    columns=['text'], index=[1,2,3,4,5,6])

d1 = {'no': ['Not interested','nt interested','not interested'],
      'maybe': ['requirement']}
df['is_relevant'] = 'yes'

for k in d1:
    match_inds = reduce(lambda x,y: x | y,
                        [df['text'].str.contains(pat) for pat in d1[k]])
    df.loc[match_inds, 'is_relevant'] = k

print(df)

输出

   text                                    is_relevant
1    good,i am interested..please mail me.         yes
2            call me...good to go with you         yes
3                     not interested...bye          no
4        i am not interested don't call me          no
5      price is too high so not interested          no
6  i have some requirement..please mail me       maybe

这与上面YOLO的答案类似，但允许多个文本类

df = pd.DataFrame(
    data = ["good,i am interested..please mail me.",
            "call me...good to go with you",
            "not interested...bye",
            "i am not interested don't call me",
            "price is too high so not interested",
            "i have some requirement..please mail me"],
    columns=['text'], index=[1,2,3,4,5,6])

d1 = {'no': ['Not interested','nt interested','not interested'],
      'maybe': ['requirement']}
df['is_relevant'] = 'yes'

for k in d1:
    match_inds = reduce(lambda x,y: x | y,
                        [df['text'].str.contains(pat) for pat in d1[k]])
    df.loc[match_inds, 'is_relevant'] = k

print(df)

输出

   text                                    is_relevant
1    good,i am interested..please mail me.         yes
2            call me...good to go with you         yes
3                     not interested...bye          no
4        i am not interested don't call me          no
5      price is too high so not interested          no
6  i have some requirement..please mail me       maybe

如果您想要的只是列表中的内容，

[“不感兴趣”，“不感兴趣”]

如果值在ad dict中，则将它们发送到如下列表中

lst=list（dict.values（））

并且仍然

np.where

lst=['not interested', 'nt interested']
df['is_relevant']=np.where(df.text.str.contains("|".join(lst)),'no','yes')

                                     text    is_relevant
1    good,i am interested..please mail me.         yes
2            call me...good to go with you         yes
3                     not interested...bye          no
4        i am not interested don't call me          no
5      price is too high so not interested          no
6  i have some requirement..please mail me         yes

然后就是

np.where

lst=['not interested', 'nt interested']
df['is_relevant']=np.where(df.text.str.contains("|".join(lst)),'no','yes')

                                     text    is_relevant
1    good,i am interested..please mail me.         yes
2            call me...good to go with you         yes
3                     not interested...bye          no
4        i am not interested don't call me          no
5      price is too high so not interested          no
6  i have some requirement..please mail me         yes

如果您想要的只是列表中的内容，

[“不感兴趣”，“不感兴趣”]

如果值在ad dict中，则将它们发送到如下列表中

lst=list（dict.values（））

并且仍然

np.where

lst=['not interested', 'nt interested']
df['is_relevant']=np.where(df.text.str.contains("|".join(lst)),'no','yes')

                                     text    is_relevant
1    good,i am interested..please mail me.         yes
2            call me...good to go with you         yes
3                     not interested...bye          no
4        i am not interested don't call me          no
5      price is too high so not interested          no
6  i have some requirement..please mail me         yes

然后就是

np.where

lst=['not interested', 'nt interested']
df['is_relevant']=np.where(df.text.str.contains("|".join(lst)),'no','yes')

                                     text    is_relevant
1    good,i am interested..please mail me.         yes
2            call me...good to go with you         yes
3                     not interested...bye          no
4        i am not interested don't call me          no
5      price is too high so not interested          no
6  i have some requirement..please mail me         yes

假设我的文本像“谢谢但不谢谢”。现在我把我的“谢谢但不谢谢”包括在我的列表中，即a=[“不感兴趣”，“不感兴趣”，“谢谢但不谢谢”]…It's give me is_relevant=是，这是不正确的…因为列表可能是100或1000。@JohnDavis您必须在列表中使用小写字母

“谢谢，但不谢谢”

，我们需要修改此部分吗？如果（x.lower（）中的a[0]或a[1]）…因为它只捕获前两个单词..已经完成小写的“谢谢但不谢谢”假设我的文本类似于“谢谢但不谢谢”。现在我在列表中包括i“谢谢但不谢谢”，即a=[“不感兴趣”，“不感兴趣”，“谢谢但不谢谢”]…It's give me is_relevant=是，这是不正确的…因为列表可能是100或1000。@JohnDavis您必须在列表中使用小写字母

“谢谢，但不谢谢”

，我们需要修改此部分吗？如果（x.lower（）中的[0]或[1]）…因为它只捕获前两个单词..已完成小写“谢谢但不感谢”请参阅我的尝试。我相信这是最快看到我的尝试。我相信这是最快的