Python正则表达式:捕获组捕获/覆盖后续匹配
在正则表达式中,如何匹配任意数量的任意字符(例如,(.|\n)*),而不使用后面可能出现的其他匹配项?如果这个问题不清楚,我的情况如下: 在一个文本文件中,我有一堆电子邮件,包括所有粘贴在一起的标题 编辑:下面更干净的版本在换行符的开头有每个标题。我的实际数据可能是这样,也可能不是这样。每个标题组件(如“From:xxx”)前面可以有任何内容,也可以没有任何内容。在某些情况下,许多电子邮件和邮件头都可能在一行上,而不是在一堆其他垃圾之后。除此之外,我还需要识别其他电子邮件标题,其中包括“发件人”。所以,我需要识别整个标题样式 在我编辑之前,下面给出的几个答案依赖于^或制表符分隔之类的东西,这是我无法指望的。它们似乎需要稍加修改才能工作,但我(显然)对regex不太在行,我自己也无法调整它们。我很抱歉以前忽略了这一点,只是有几个回答者抓住了它。。。另一个原因是我对正则表达式缺乏经验 这是一个丑陋的版本-这是一个我正在尝试匹配的字符串。它包含两个要拉出的标题和消息Python正则表达式:捕获组捕获/覆盖后续匹配,python,regex,Python,Regex,在正则表达式中,如何匹配任意数量的任意字符(例如,(.|\n)*),而不使用后面可能出现的其他匹配项?如果这个问题不清楚,我的情况如下: 在一个文本文件中,我有一堆电子邮件,包括所有粘贴在一起的标题 编辑:下面更干净的版本在换行符的开头有每个标题。我的实际数据可能是这样,也可能不是这样。每个标题组件(如“From:xxx”)前面可以有任何内容,也可以没有任何内容。在某些情况下,许多电子邮件和邮件头都可能在一行上,而不是在一堆其他垃圾之后。除此之外,我还需要识别其他电子邮件标题,其中包括“发件人”
emailsString = u"""From:\n Lastname, Firstname\n Sent:\n Monday, June 24, 2013 1:48 PM\n To:\n Othername, Name\n Subject:\n RE: Center update\n Message message message.\n Such a lovely message\n Take care,\n Firstname Lastname, MS\n Long signature\n in this email\n \n E-mail:\n email@email.com\n Web\n my blog\n From:\n Lastname, Firstname\n Sent:\n Monday, June 24, 2013 9:33 AM\n To:\n Othername, Name\n Subject:\n Center update\n Importance:\n High\n Good Morning Name,\n I hope this finds you doing well.\n I wanted to inform you of some changes. The Center will be closing August 30\n th\n . or September 1\n st\n . I\u2019ve enjoyed my experience. """
这里有一个更清晰的版本,可以显示标题的外观
From: Lastname, Firstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: Othername, Name
Subject: blah
Importance: High
Message message message
second line of message
second para of message
From: Lastname, Firstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: Othername, Name
Subject: blahblah
message
...
我试图将标题中的信息与消息本身一起正则化。我有一个正则表达式,可以成功地匹配所有的头,但我正在努力与消息。问题是,消息可以包含任何内容(或不包含任何内容)。可能会有多条换行,等等。我想得到所有这些,但我仍然想拆分电子邮件。我的尝试(注意标题的“重要”部分是可选的):
for-hit-in-re.finditer(r'[\s\n]*From:[\s\n]*(?P.*)[\s\n]*发送:[\s\n]*(?P.*)[\s\n]*(?P.*)[\s\n]*主题:[\s\n]*(?P.*)[\s\n]*(?:重要性:)?[\s\n]*(?P.[\s\n)**。[\s\n]*(?P(。\n)**)*,allessString):
打印“发件人”:+hit.group(“发件人”)
打印“至:”+点击组(“至”)
打印“日期:”+点击组(“日期”)
打印“主题:”+点击组(“主题”)
打印“消息:”+点击组(“消息”)
问题是,消息组正在捕获所有内容。因此,我正确地获得了第一个电子邮件头的from/to/etc,然后看到一条包含该电子邮件消息的消息,以及下面所有的电子邮件头和消息。我需要抓取“直到下一个电子邮件头/regex匹配或字符串结束之前的所有内容”
我已经有了一个解决办法——我可以摆脱消息捕获组,只抓取标题。然后,遍历匹配对象并根据其开始/结束对字符串进行切片。例如,消息1来自match1.end-up到match2.start
所以,我在问
- 有没有办法在正则表达式中捕获组
- 有更好的解决办法吗
- 我想在“发送”部分,总是有一周中某一天的名字
- 我想如果“重要性”一行存在,那么只有一个词来描述这种重要性,然后
[^\t\r\n]+
- 我假设主题描述不能在几行上,然后
[^\r\n]+
[\t\r\n]*(?p.*?[^\t\r\n])[\t\r\n]*”
对捕获的组有条带效应。
然后,如果消息由多个空行组成,则匹配结果表明消息为'
如果在最后一条消息之后没有其他行,则有必要显示\Z
以捕获mast电子邮件,如我的文本示例所示
import re
emailsString = (u' From:\n'
' Lastname, Firstname\n'
' Sent:\n'
' Monday, June 24, 2013 1:48 PM\n'
' To:\n'
' Othername, Name\n'
' Subject:\n'
' RE: Center update\n'
' Message message message.\n'
' Such a lovely message\n'
' Take care,\n'
' Firstname Lastname, MS\n'
' Long signature\n'
' in this email\n'
' \n'
' E-mail:\n'
' email@email.com\n'
' Web\n'
' my blog\n'
' From:\n'
' Lastname, Firstname\n'
' Sent:\n'
' Monday, June 24, 2013 9:33 AM\n'
' To:\n'
' Othername, Name\n'
' Subject:\n'
' Center update\n'
' Importance:\n'
' High\n'
' Good Morning Name,\n'
' I hope this finds you doing well.\n'
' I wanted to inform you of some changes. The Center will be closing August 30\n'
' th\n'
' . or September 1\n'
' st\n'
' . I\u2019ve enjoyed my experience. ')
allEmailsString = '''
From: FirstLastname, FirstFirstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: TheOne
Subject: blah
Importance: High
Message message message
second line of message
second para of message
From: MidLastname, MidFirstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: TWOTWO
Subject: once upon
From: LastLastname, LastFirstname
Sent: Saturday, July 20th, 2011, 12:51 AM
To: Mr Three
Subject: blobloblo
Nothing to say. '''
dispat = ("* from: {from}\n"
"* to: {to}\n"
"* date: {date}\n"
"* subject: {subject}\n"
"** message (beginning on next line):\n{message}\n"
"-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-")
regx = re.compile('From:[ \t\r\n]*(?P<from>.*?[^ \t\r\n])'
'[ \t\r\n]*'
'Sent:[ \t\r\n]*'
'(?P<date>.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?[^ \t\r\n])'
'[ \t\r\n]*'
'To:[ \t\r\n]*(?P<to>.*?[^ \t\r\n])'
'[ \t\r\n]*'
'Subject:[ \t\r\n]*(?P<subject>[^\r\n]+)'
'[ \t\r\n]*'
'(?:Importance:[ \t\r\n]*(?P<importance>[^ \t\r\n]+))?'
'[ \t\r\n]*'
'(?P<message>.*?)'
'(?=[ \t\r\n]*From:.*?'
'Sent:.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?'
'To.*?Subject:|\Z)',
re.DOTALL)
for s in (emailsString,allEmailsString):
print ''.join(dispat.format(**d)
for d in (ma.groupdict('') for ma in regx.finditer(s)))
print '\n#######################################\n'
我只是分开(split
)然后征服(re.match
):
这可能让人感到痛苦。为清晰起见,将其扩展。
使用多行模式,不使用DotAll
@mobabo-在您第一次发表评论后编辑此内容
必须有一个明确的描述你的关键字,并没有。您对
我不能指望像“^From”这样的东西来工作
表明你没有看前面的
正则表达式,这一部分在这一部分中是相同的<代码>^[^\S\n]*From:
与^From
此外,主题和信息之间没有明确的界限或者重要性和信息。如果“重要性”是电子邮件的一部分,则主题有一个终点 我制作了一个正则表达式来处理你的脏邮件和干净邮件,底部是一个Perl
练习它的程序。输出包括在内。看看这是否能解决您的问题
(见下文) 不幸的是,这是你所能期望的最好结果 祝你好运,先生
(注意——如果Python有递归,这个正则表达式将是这个大小的1/4)
#压缩
# -------------------
#(S\S\S\n)从以下以下几点中:::::)以下以下几几(S\S\S\S\n)的::::::::从发送(发送)到主题(1244)发送(发送(发送)到主题(1244)的;发送(发送)到主题(1244)研究(重要性)的重要性)的:)[\S\S(S\S(S)可能)S(S(S)他们)他们)他们)的,:::::::::::::)从以下以下以下以下以下以下以下以下以下以下以下以下几几几(S\S\S\S\S\S(S(S)S)S)他们)的,::::::::::::)从::::::::::::::::)从以下以下以下以下以下以下以下以下::::::::::::::::::::::::::::| To | Subject | Importance:)[\S\S])*)?(?:\S*^[^\S\n]*主语:\S*(?:(?!\S*^[^\S\n]*(?:(?:From | Sent | To | Subject | Importance)):)[\S\S]*(?:\S*^[^\S\n]*重要性:\S*(?:(?:(?!\S*^[^\S\n]*(?:From | Sent | To | Subject | Importance:)[\S\S*))))?
#扩大
# -------------------
#
^
import re
emailsString = (u' From:\n'
' Lastname, Firstname\n'
' Sent:\n'
' Monday, June 24, 2013 1:48 PM\n'
' To:\n'
' Othername, Name\n'
' Subject:\n'
' RE: Center update\n'
' Message message message.\n'
' Such a lovely message\n'
' Take care,\n'
' Firstname Lastname, MS\n'
' Long signature\n'
' in this email\n'
' \n'
' E-mail:\n'
' email@email.com\n'
' Web\n'
' my blog\n'
' From:\n'
' Lastname, Firstname\n'
' Sent:\n'
' Monday, June 24, 2013 9:33 AM\n'
' To:\n'
' Othername, Name\n'
' Subject:\n'
' Center update\n'
' Importance:\n'
' High\n'
' Good Morning Name,\n'
' I hope this finds you doing well.\n'
' I wanted to inform you of some changes. The Center will be closing August 30\n'
' th\n'
' . or September 1\n'
' st\n'
' . I\u2019ve enjoyed my experience. ')
allEmailsString = '''
From: FirstLastname, FirstFirstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: TheOne
Subject: blah
Importance: High
Message message message
second line of message
second para of message
From: MidLastname, MidFirstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: TWOTWO
Subject: once upon
From: LastLastname, LastFirstname
Sent: Saturday, July 20th, 2011, 12:51 AM
To: Mr Three
Subject: blobloblo
Nothing to say. '''
dispat = ("* from: {from}\n"
"* to: {to}\n"
"* date: {date}\n"
"* subject: {subject}\n"
"** message (beginning on next line):\n{message}\n"
"-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-")
regx = re.compile('From:[ \t\r\n]*(?P<from>.*?[^ \t\r\n])'
'[ \t\r\n]*'
'Sent:[ \t\r\n]*'
'(?P<date>.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?[^ \t\r\n])'
'[ \t\r\n]*'
'To:[ \t\r\n]*(?P<to>.*?[^ \t\r\n])'
'[ \t\r\n]*'
'Subject:[ \t\r\n]*(?P<subject>[^\r\n]+)'
'[ \t\r\n]*'
'(?:Importance:[ \t\r\n]*(?P<importance>[^ \t\r\n]+))?'
'[ \t\r\n]*'
'(?P<message>.*?)'
'(?=[ \t\r\n]*From:.*?'
'Sent:.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?'
'To.*?Subject:|\Z)',
re.DOTALL)
for s in (emailsString,allEmailsString):
print ''.join(dispat.format(**d)
for d in (ma.groupdict('') for ma in regx.finditer(s)))
print '\n#######################################\n'
* from: Lastname, Firstname
* to: Othername, Name
* date: Monday, June 24, 2013 1:48 PM
* subject: RE: Center update
** message (beginning on next line):
Message message message.
Such a lovely message
Take care,
Firstname Lastname, MS
Long signature
in this email
E-mail:
email@email.com
Web
my blog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-* from: Lastname, Firstname
* to: Othername, Name
* date: Monday, June 24, 2013 9:33 AM
* subject: Center update
** message (beginning on next line):
Good Morning Name,
I hope this finds you doing well.
I wanted to inform you of some changes. The Center will be closing August 30
th
. or September 1
st
. I\u2019ve enjoyed my experience.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#######################################
* from: FirstLastname, FirstFirstname
* to: TheOne
* date: Monday, July 15th, 2011, 9:36 AM
* subject: blah
** message (beginning on next line):
Message message message
second line of message
second para of message
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-* from: MidLastname, MidFirstname
* to: TWOTWO
* date: Thursday, July 18th, 2011, 10:45 AM
* subject: once upon
** message (beginning on next line):
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-* from: LastLastname, LastFirstname
* to: Mr Three
* date: Saturday, July 20th, 2011, 12:51 AM
* subject: blobloblo
** message (beginning on next line):
Nothing to say.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#######################################
import re
# `data` is your text file
delimiter = r'(^|\n)From:'
capturer = re.compile(r'From:[\n\s]*(?P<from>.*)[\n\s]*'
r'Sent:[\n\s]*(?P<date>.*)[\n\s]*'
r'To:[\n\s]*(?P<to>.*)[\n\s]*'
r'Subject:[\n\s]*(?P<subject>.*)[\n\s]*'
r'(?:Importance:)?[\n\s]*.*[\n\s]*'
r'(?P<message>(\n|.)*)')
raw_emails = ['From:' + d for d in re.split(delimiter, data) if d.strip()]
emails = []
for raw_email in raw_emails:
parts = capturer.match(raw_email)
emails.append(parts.groupdict())
[{'date': 'Monday, July 15th, 2011, 9:36 AM',
'from': 'Lastname, Firstname',
'message': 'Message message message\nsecond line of message\n\nsecond para of message\n',
'subject': 'blah',
'to': 'Othername, Name'},
{'date': 'Thursday, July 18th, 2011, 10:45 AM',
'from': 'Lastname, Firstname',
'message': '...\n',
'subject': 'blahblah',
'to': 'Othername, Name'}]
# Compressed
# -------------------
# ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?
# Expanded
# -------------------
#
^ [^\S\n]* From: \s*
(?P<from>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
(?:
\s* ^ [^\S\n]* Sent: \s*
(?P<sent>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
)?
(?:
\s* ^ [^\S\n]* To: \s*
(?P<to>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
)?
(?:
\s* ^ [^\S\n]* Subject: \s*
(?P<subject>
(?:
(?!
\s* ^ [^\S\n]*
(?:
(?: From | Sent | To | Subject | Importance )
)
:
)
[\S\s]
)*
)
(?:
\s* ^ [^\S\n]* Importance: \s*
(?P<importance>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
)?
)?
# // Output from Perl sample code (below)
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Monday, July 15th, 2011, 9:36 AM
# // To:
# // Othername, Name
# // Subject:
# // blah
# // Importance/Message:
# // High
# //
# // Message message message
# // second line of message
# //
# // second para of message
# //
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Thursday, July 18th, 2011, 10:45 AM
# // To:
# // Othername, Name
# // Subject/Message:
# // blahblah
# //
# // message
# //
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Monday, June 24, 2013 1:48 PM
# // To:
# // Othername, Name
# // Subject/Message:
# // RE: Center update
# // Message message message.
# // Such a lovely message
# // Take care,
# // Firstname Lastname, MS
# // Long signature
# // in this email
# //
# // E-mail:
# // email@email.com
# // Web
# // my blog
# //
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Monday, June 24, 2013 9:33 AM
# // To:
# // Othername, Name
# // Subject:
# // Center update
# // Importance/Message:
# // High
# // Good Morning Name,
# // I hope this finds you doing well.
# // I wanted to inform you of some changes. The Center will be closing August 30
# //
# // th
# // . or September 1
# // st
# // . I've enjoyed my experience.
# //
# ------------------------------------------------------------
# # Perl sample code
# use strict;
# use warnings;
#
# $/ = undef;
#
# my $str = <DATA>;
#
#
#
# while ( $str =~ /
# ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?
# /xmg)
#
# {
# print "\n\n======================\n";
# print "From: \n\t$+{from}\n";
# if (defined $+{sent})
# {
# print "Sent: \n\t$+{sent}\n";
# }
# if (defined $+{to})
# {
# print "To: \n\t$+{to}\n";
# }
# if (defined $+{importance})
# {
# print "Subject: \n\t$+{subject}\n";
# print "Importance/Message: \n\t$+{importance}\n";
# }
# elsif (defined $+{subject})
# {
# print "Subject/Message: \n\t$+{subject}\n";
# }
# }
#
#
# __DATA__
#
# From: Lastname, Firstname
# Sent: Monday, July 15th, 2011, 9:36 AM
# To: Othername, Name
# Subject: blah
# Importance: High
#
# Message message message
# second line of message
#
# second para of message
#
# From: Lastname, Firstname
# Sent: Thursday, July 18th, 2011, 10:45 AM
# To: Othername, Name
# Subject: blahblah
#
# message
#
#
#
#
#
# From:
# Lastname, Firstname
# Sent:
# Monday, June 24, 2013 1:48 PM
# To:
# Othername, Name
# Subject:
# RE: Center update
# Message message message.
# Such a lovely message
# Take care,
# Firstname Lastname, MS
# Long signature
# in this email
#
# E-mail:
# email@email.com
# Web
# my blog
# From:
# Lastname, Firstname
# Sent:
# Monday, June 24, 2013 9:33 AM
# To:
# Othername, Name
# Subject:
# Center update
# Importance:
# High
# Good Morning Name,
# I hope this finds you doing well.
# I wanted to inform you of some changes. The Center will be closing August 30
# th
# . or September 1
# st
# . I've enjoyed my experience.
#
#