Python:使用正则表达式仅在字符串中的特定单词之后查找完整文本
全文如下:Python:使用正则表达式仅在字符串中的特定单词之后查找完整文本,python,regex,Python,Regex,全文如下: text = list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com checklist creation date 31 03 2018 checklist print date time 3
text = list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no
vill bitta ta naliya abadasa despatched through destination march 18 terms of
目标:
我想提取单词“invoice”后的文本,特别是第二个出现的“invoice”
我的方法:
txt = re.findall('invoice (.*)',text)
在上述方法中,我希望字符串列表如下所示:
txt = ['in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered','parth enterprise â invoice no dated
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment
taluka ..... #rest of the string]
但我得到的是text
中给出的整个字符串,即原始字符串。
如果我使用text.partition('invoice')
我没有得到txt
中提到的正确字符串
任何帮助都将不胜感激 如果您想获得问题中的2个匹配项,可以使用2个捕获组 第一次匹配,直到第一次出现发票。然后在第二次出现发票之前在第1组中捕获 然后再次匹配invoice,并捕获组2中字符串的其余部分
^.*? invoice (.*?) invoice (.*)
|
比如说
import re
text = "list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered invoice parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of"
regex = r"^.*? invoice (.*?) invoice (.*)"
matches = re.search(regex, text)
if matches:
print(matches.group(1))
print('\n')
print(matches.group(2))
输出
in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered
parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of
这可以通过split()方法轻松完成 例如:
myText="jhon is going abroad jhon is thinking about future jhon is absent"
1) print(myText.split('jhon',1)[1])
output -> is going abroad jhon is thinking about future jhon is absent
2) print(myText.split('jhon',2)[2])
output -> is thinking about future jhon is absent
3) print(myText.split('jhon',3)[3])
output -> is absent
1 -> it will print text after first occurrence of jhon
2 -> it will print text after second occurrence of jhon
3 -> it will print text after third occurrence of jhon
您的regexinvoice(.*)
将匹配第一个文本invoice
,后跟空格,然后(.*)
将贪婪地捕获group1中的其余文本,这就是正在发生的情况,也是预期的正确行为
但是如果你想得到你提到的输出,你必须相应地编写你的正则表达式。您可以使用以下正则表达式来实现所需的结果
invoice (.*?)(?=(?:(?:invoice.*){2,}|$))
正则表达式解释:
txt = re.findall('invoice (.*)',text)
-匹配文字发票和空格发票
-以惰性方式匹配文本(.*)
-在看到2个文本(?=(?:(?:发票。*){2,}|$)
文本或在整个输入结束时停止匹配时向前看发票
import re
s = '''list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered invoice parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of'''
print(re.findall(r'invoice (.*?)(?=(?:(?:invoice.*){2,}|$))', s))
输出如您所愿
['in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered ', 'parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of']
已更新: 我使用的正则表达式依赖于正向后向和正向前向: 印刷品:
Match 1:
in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered
Match 2:
parth enterprise â
Match 1:
in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered
Match 2:
parth enterprise â
使用更简单的正则表达式分割输入可能会更有效地解决此问题:
import re
text= r"""list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no
vill bitta ta naliya abadasa despatched through destination march 18 terms of"""
#matches = re.split(r'\b\s*invoice\s*\b', text)[1:-1] # if arbitrary white space can come before and after "invoice"
matches = re.split(r'\b ?invoice ?\b', text)[1:-1]
for i, match in enumerate(matches):
print(f'\nMatch {i + 1}:\n', match, sep='')
印刷品:
Match 1:
in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered
Match 2:
parth enterprise â
Match 1:
in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered
Match 2:
parth enterprise â
看起来您没有正确理解OP的帖子,因此您的输出不是OP所期望的。仔细检查OP的第一个字符串是否以
结尾…将不被视为。此外,OP的帖子中并没有换行符,它们只是格式问题,这一定是由于在创建帖子时进行粘贴造成的。如果你看到OP只使用了*
,并且捕获的所有文本都确认OP的文本中没有换行符。事实上,你的帖子的输出已经是OP在帖子中提到的正则表达式的输出。@Silvanas很抱歉回复太晚了。我明白你的意思,我忽略了期望结果中的两个输出。他只想要匹配1和匹配2,不想要匹配3。基本上他的Match2是Match2+Match3的组合。