Python 数据帧中的文本模式识别_Python_Pandas_Pattern Matching

Python 数据帧中的文本模式识别

python pandas

Python 数据帧中的文本模式识别,python,pandas,pattern-matching,Python,Pandas,Pattern Matching,我正试图让python与pandas dataframe中的文本模式匹配我正在做的是 list = ['sarcasm','irony','humor'] pattern = '|'.join(list) pattern2 = str("( " + pattern.strip().lstrip().rstrip() + " )").strip().lstrip().rstrip() frame = pd.DataFrame(docs_list, columns=['words']) # do

我正试图让python与pandas dataframe中的文本模式匹配

我正在做的是

list = ['sarcasm','irony','humor']
pattern = '|'.join(list)
pattern2 = str("( " + pattern.strip().lstrip().rstrip() + " )").strip().lstrip().rstrip()

frame = pd.DataFrame(docs_list, columns=['words'])
# docs_list is the list containing the snippets

#Skipping the inbetween steps for the simplicity of viewing
cp2 = frame.words.str.extract(pattern2)
c2 = cp2.to_frame().fillna("No Matching Word Found")

这样的输出

Snips                                     pattern_found    matching_Word
A different type of humor                    True             humor
A different type of sarcasm                  True             sarcasm 
A different type of humor and irony          True             humor
A different type of reason                   False            NA
A type of humor and sarcasm                  True             humor
A type of comedy                             False            NA

因此，python检查模式并给出相应的输出

现在，我的问题来了。根据我的理解，只要python在代码段中没有遇到模式中的单词，它就会继续检查整个模式。一旦遇到模式的一部分，它就会接受该部分并跳过剩余的单词

我如何让python查找每个单词，而不仅仅是第一个匹配的单词，以便像这样输出

Snips                                     pattern_found    matching_Word
A different type of humor                    True             humor
A different type of sarcasm                  True             sarcasm 
A different type of humor and irony          True             humor
A different type of humor and irony          True             irony
A different type of reason                   False            NA
A type of humor and sarcasm                  True             humor
A type of humor and sarcasm                  True             sarcasm
A type of comedy                             False            NA

一个简单的解决方案显然是将模式放在一个列表中，并通过检查每个代码段中的每个单词来迭代for循环。但时间是一种限制。尤其是因为我正在处理的数据集非常庞大，而且剪报相当长。

对于我来说，用于删除

多索引的级别，最后是原始级别
L = ['sarcasm','irony','humo', 'humor', 'hum']
#sorting by http://stackoverflow.com/a/4659539/2901002
L.sort()
L.sort(key = len, reverse=True)
print (L)
['sarcasm', 'humor', 'irony', 'humo', 'hum']

pattern2 = r'(?P<COL>{})'.format('|'.join(L))
print (pattern2)
(?P<COL>sarcasm|irony|humor|humo|hum)

cp2 = frame.words.str.extractall(pattern2).reset_index(level=1, drop=True)
print (cp2)
       COL
0    humor
1  sarcasm
2    humor
2    irony
4    humor
4  sarcasm

frame = frame.join(cp2['COL']).reset_index(drop=True)
print (frame)
                                 words pattern_found matching_Word      COL
0            A different type of humor          True         humor    humor
1          A different type of sarcasm          True       sarcasm  sarcasm
2  A different type of humor and irony          True         humor    humor
3  A different type of humor and irony          True         humor    irony
4           A different type of reason         False           NaN      NaN
5          A type of humor and sarcasm          True         humor    humor
6          A type of humor and sarcasm          True         humor  sarcasm
7                     A type of comedy         False           NaN      NaN

L=['讽刺'，'讽刺'，'幽默'，'哼']
#按排序http://stackoverflow.com/a/4659539/2901002
L.排序（）
L.sort（key=len，reverse=True）
印刷品（L）
[‘讽刺’、‘幽默’、‘讽刺’、‘幽默’、‘哼哼’]
pattern2=r'（？P{}）'.format（'|'.join（L））
印刷品（图案2）
（？诗篇|讽刺|幽默|休谟|哼）
cp2=frame.words.str.extractall（pattern2）.重置索引（level=1，drop=True）
打印（cp2）
上校
0幽默
1讽刺
2幽默
2讽刺
4幽默
4讽刺
frame=frame.join（cp2['COL']）。重置索引（drop=True）
打印（帧）
单词模式\u找到匹配的\u单词列
0不同类型的幽默真正的幽默
一种不同类型的讽刺真正的讽刺讽刺
2不同类型的幽默和讽刺真正的幽默
3不同类型的幽默和讽刺真正的幽默讽刺
4不同类型的原因假楠楠楠
5一种幽默和讽刺真正的幽默
一种幽默和讽刺真正的幽默讽刺
7一种喜剧类型假楠楠楠
你检查了吗？顺便问一下，你知道你的模式2中的空格有意义吗？您需要删除“（“
和”
中的空格。在定义正则表达式的方式中，最好使用pattern=r'（{}）'.format（'|'.join（list））
。但是，由于交替没有锚定，您需要按长度降序对项目进行排序。但是现在我做了，再加上提供的另一个答案，它是有效的。如果你的输入包含L=['sarcasm'、'irony'、'humo'、'humo'、'幽默'、'hum']
？它再也不能工作了。@WiktorStribiżew-不幸的是，你是对的。我不是正则表达式专家，所以现在我没有解决方案。好吧，我已经在我对这个问题的第二次评论中分享了需要做的事情。只需按长度降序排列L
列表，然后用|
连接即可。