从pandas数据帧列(python3)中提取所有模式
我正在使用jupyter笔记本(python 3)。我正试图从我的列表中提取熊猫数据框关键字。我将有大约50个关键字在列表中 例如:从pandas数据帧列(python3)中提取所有模式,python,regex,Python,Regex,我正在使用jupyter笔记本(python 3)。我正试图从我的列表中提取熊猫数据框关键字。我将有大约50个关键字在列表中 例如: import pandas as pd import re rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien'] pattern = "\\b("+'|'.join(rgx_words1)+")\\b" re_patt = re.c
import pandas as pd
import re
rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']
pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)
pattern2 = "("+'|'.join(rgx_words1)+")"
re_patt2 = re.compile(pattern2)
data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
# Create the pandas DataFrame
mydf = pd.DataFrame(data, columns = ['id', 'text'])
mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt,x['text']),axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2,x['text']),axis=1)
通过Reu patt,我提取了精确的单词,得到了正确的结果。在id 1中,我的输出是algaecide,algaecid,algaecides。有了re_patt2,我希望所有模式都像“SSSSALGAECIDLLL”一样,输出为“algaecid”。id 1中re_patt2的输出为algaecid、algaecid、algaecid,我想要的输出为algaecid、algaecid、algaecides。
如蒙指教,我将不胜感激。提前感谢。您可以更改
模式2
以选择匹配非空白字符,但左侧和右侧的逗号[^\s,]*
除外
pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
代码可能看起来像
import pandas as pd
import re
rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']
pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)
pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
re_patt2 = re.compile(pattern2)
data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
mydf = pd.DataFrame(data, columns = ['id', 'text'])
mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt, x['text']), axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2, x['text']), axis=1)
print(mydf)
输出
id text matches matches2
0 1 I, will, find, algaecide, dd, algaecid, algaec... [algaecide, algaecid, algaecides] [algaecide, algaecid, algaecides]
1 2 fff, algaecid, dd, algaecide [algaecid, algaecide] [algaecid, algaecide]
2 3 ssssalgaecidllll, algaecides [algaecides] [ssssalgaecidllll, algaecides]
例如,有没有一种方法可以在id 3“SSSSALGAECIDLL”中获得列匹配中的输出2“algaecid”而不是“SSSSSSALGAECIDLL”?@DinkoJantoš您可以通过重新定义单词,如
rgx_words1=['algaecides','algaecide','algaecid','anti-bakerien]
然后用管道状的pattern2='124;连接这些单词。连接(rgx_words1)
我这样做了:rgx_words1.sort(key=len,reverse=True)['anti-bakterien','algaecides','algaecide','algaecid']mypatt='124;'。join(rgx_words1)re_mypatt=re.compile(mypatt)mydf['matches3']=mydf.apply(lambda x:re.findall(re_mypatt,x['text']),axis=1)在我的新列匹配3中,我有输出algaecid,algaecid,在列匹配2中,输出是ssssssalgaecidlll,algaecides。对于单词ssssssalgaecidlll,我的输出algaecide是正确的,但是对于单词algaecides,我的输出是algaecides,我需要algaecides。@DinkoJantoš如果您仅使用
连接单词,则替换将是正确的tch第一个。排序后,备选方案看起来像这样anti-bakterien | algaecides | algaecides | algaecides | algaecides
,因此它将首先匹配algaecides
,如果可以的话。那么有什么解决方案吗?或者这是错误的->pattern2='|。join(rgx|words1)?我可以从单词'sssalgaecidesllll'-->algaecides而不是algaecid获得输出吗?