Python 如何使用pandas中的输入语料库/列表从列中提取所有字符串匹配？_Python_Regex_Pandas_Nltk_Text Mining

Python 如何使用pandas中的输入语料库/列表从列中提取所有字符串匹配？

python regex pandas

Python 如何使用pandas中的输入语料库/列表从列中提取所有字符串匹配？,python,regex,pandas,nltk,text-mining,Python,Regex,Pandas,Nltk,Text Mining,例如，我有下面的字符串列表作为输入语料库（实际上是一个包含100个值的大列表）。动作=[‘跳跃’、‘飞翔’、‘奔跑’、‘游泳’] 数据包含一个名为action\u description的列。如何使用操作列表作为输入语料库提取操作描述中的所有字符串匹配注意：我已经完成了lemitization description_action，因此如果列中有跳跃或跳跃之类的词，那么它已经转换为跳跃样本输入和输出注意：我发现了下面的pandas函数，但是它没有很好的文档记录，所以我不知道如何使用它

例如，我有下面的字符串列表作为输入语料库（实际上是一个包含100个值的大列表）。动作=[‘跳跃’、‘飞翔’、‘奔跑’、‘游泳’]

数据包含一个名为action\u description的列。如何使用操作列表作为输入语料库提取操作描述中的所有字符串匹配

注意：我已经完成了lemitization description_action，因此如果列中有跳跃或跳跃之类的词，那么它已经转换为跳跃

样本输入和输出

注意：我发现了下面的pandas函数，但是它没有很好的文档记录，所以我不知道如何使用它

请推荐最佳解决方案，因为by input dataframe有200K行

编辑算法应忽略跳线和跑道等词，即不应归类为跳跃和奔跑。

action=['jump'、'fly'、'run'、'swim']
action=['jump','fly','run','swim']


str1="I    love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run"

actionDtl=""
for word in str1.split():
    if word in action:
        if actionDtl<>"":
            actionDtl=actionDtl+" " +word
        else:
            actionDtl=actionDtl+word
    else:
        for act in action:
            if word.find(act)>=0:
                if actionDtl<>"":
                    actionDtl=actionDtl+" " +act
                else:
                    actionDtl=actionDtl+act
                break      
print actionDtl 

str1=“我喜欢跑步，而我的朋友更喜欢游泳”##-->“跑步游泳”
str2=“艾伦擅长跳高，但他不擅长跑步”##-->“跳跃跑”
actionDtl=“”
对于str1.split（）中的单词：
如果文字在起作用：
如果actionDtl“”：
actionDtl=actionDtl+“”+word
其他：
actionDtl=actionDtl+word
其他：
对于实际行动：
如果word.find（act）>=0：
如果actionDtl“”：
actionDtl=actionDtl+“”+act
其他：
actionDtl=actionDtl+act
打破
打印操作DTL

动作=['jump'、'fly'、'run'、'swim']
str1=“我喜欢跑步，而我的朋友更喜欢游泳”##-->“跑步游泳”
str2=“艾伦擅长跳高，但他不擅长跑步”##-->“跳跃跑”
actionDtl=“”
对于str1.split（）中的单词：
如果文字在起作用：
如果actionDtl“”：
actionDtl=actionDtl+“”+word
其他：
actionDtl=actionDtl+word
其他：
对于实际行动：
如果word.find（act）>=0：
如果actionDtl“”：
actionDtl=actionDtl+“”+act
其他：
actionDtl=actionDtl+act
打破
打印操作DTL

步骤：

df

#   A                                                  B  ApproxMatch  \
#0  1  I    love to run and while my friend prefer to...  [run, swim]   
#1  2  Allan excels at high jump but he is not a good...  [jump, run]   
#2  3           Ostrich can run very fast but cannot fly   [fly, run]   
#3  4   The runway was wet hence the Jumper flew over it        [run]   
#
#    ExactMatch  
#0  [run, swim]  
#1       [jump]  
#2   [fly, run]  
#3           []

我们仅通过提供

pos='v'

对动词进行柠檬化，并通过对

str.split

操作得到的列表中的每个单词进行迭代，使名词保持原样

然后，使用

set

，获取查找列表中存在的单词的所有匹配项和柠檬化列表

最后，连接它们以返回字符串作为输出

启动

DF

已使用：

df = pd.DataFrame(dict(action_description=["I love to run and while my friend prefer to swim", 
                                           "Allan excels at high jump but he is not a good at running"]))

要生成二进制标志（0/1），我们可以通过在空白处拆分字符串并计算其指示符变量来使用该方法，如图所示：

bin_flag = df['action_description'].str.get_dummies(sep=' ').add_suffix('_flag')
pd.concat([df['action_description'], bin_flag], axis=1)