Python 用于捕获具有特定模式的日期的正则表达式
我试图从几个PDF中提取数据。有一个数据点与日期相关,其中日期前的字符串在某些PDF中有所不同。我检查了各个regex语句是否正常工作,但是,当我尝试在for循环中将regex语句组合成一个语句时,我没有提取日期。以下是我试图与代码匹配的字符串,它们分别是在“生日日期”之后提取日期信息的regex语句:Python 用于捕获具有特定模式的日期的正则表达式,python,regex,regex-lookarounds,regex-group,regex-greedy,Python,Regex,Regex Lookarounds,Regex Group,Regex Greedy,我试图从几个PDF中提取数据。有一个数据点与日期相关,其中日期前的字符串在某些PDF中有所不同。我检查了各个regex语句是否正常工作,但是,当我尝试在for循环中将regex语句组合成一个语句时,我没有提取日期。以下是我试图与代码匹配的字符串,它们分别是在“生日日期”之后提取日期信息的regex语句: DATE OF BIRTHDAY\n01/11/2011 date_of_birthday1 = re.search('(?<=DATE OF BIRTHDAY \\n)(.*)', im
DATE OF BIRTHDAY\n01/11/2011
date_of_birthday1 = re.search('(?<=DATE OF BIRTHDAY \\n)(.*)', img).groups()
DATE OF BIRTHDAY\n\n02/14/2015
date_of_birthday2 = re.search('(?<=DATE OF BIRTHDAY \\n\\n)(.*)', img).groups()
DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018
date_of_birthday3 = re.search('(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups()
看起来是这样的:
date_of_birthdays = re.search('(?<=DATE OF BIRTHDAY\\n\\n)(.*)|(?<=DATE OF BIRTHDAY\\n)(.*)|(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups
df = pd.DataFrame({"Birthdays": ['01/11/2011', '02/14/2015', '05/07/2018']})
df
但是,我无法提取任何日期信息。想想我做错了什么 这很有效
>>> import re
>>> re.findall(
... r"(?:DATE[ ]OF[ ]BIRTHDAY)(?:\\n(?:\\n)?|[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])?(.*)",
... (
... r'DATE OF BIRTHDAY\n01/11/2011' + "\n"
... r'DATE OF BIRTHDAY\n\n02/14/2015' + "\n"
... r'DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018' + "\n"
... ))
['01/11/2011', '02/14/2015', '05/07/2018']
>>>
正则表达式扩展
(?: DATE [ ] OF [ ] BIRTHDAY )
(?:
\\ n
(?: \\ n )?
| [ ] GIRL [ ] \\ n \\ ni [ ] : [ ] Pll [ ] i [ ] ii \\ n i [ ] \\ n \\ n Pll [ ]
)?
( .* ) # (1)
只是一个合理的警告,该表达式与lookbehind assessions一起使用
在这两种备选方案中提出一个问题:
(?<= DATE [ ] OF [ ] BIRTHDAY \\ n \\ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \\ n )
( .* ) # (2)
嗯,这是不可能的,所以这里有一些正在考虑的方法(这并不是实现这一目标的理想方法)
Regex1:(?:日期[]生日])(?:\\n(?:\\n)?|[]女孩[]\\n\\n[]尼[]]:[]Pll[]i[]ii\\ni[]\\n\\nPll[])?(*)
选项:
完成的迭代:50/50(x 1000)
每次迭代找到的匹配项:3
运行时间:0.29秒、294.80毫秒、294801微秒
每秒匹配数:508817
Regex2:(?:(?你能展示一个你试图匹配的文本的例子吗?它似乎有效:你可能应该使用+
而不是*
。否则,它匹配一个空行,所以第一个regexp将匹配第二个示例输入。我没说这是问题。你的regexp对我来说是*
的,它只是有一个额外的匹配项。 re.search
仅查找一个匹配项,您应该使用re.findall
查找所有匹配项。
(?<= DATE [ ] OF [ ] BIRTHDAY \\ n \\ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \\ n )
( .* ) # (2)
(?<= DATE [ ] OF [ ] BIRTHDAY \\ n \\ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \\ n )
(?! \\ n )
( .* ) # (2)
Regex1: (?:DATE[ ]OF[ ]BIRTHDAY)(?:\\n(?:\\n)?|[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])?(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 0.29 s, 294.80 ms, 294801 µs
Matches per sec: 508,817
Regex2: (?:(?<=DATE[ ]OF[ ]BIRTHDAY\\n\\n)|(?<=DATE[ ]OF[ ]BIRTHDAY\\n)(?!\\n)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ]))(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.27 s, 2268.42 ms, 2268417 µs
Matches per sec: 66,125
Regex3: (?<=DATE[ ]OF[ ]BIRTHDAY\\n\\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY\\n)(?!\\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.76 s, 2760.81 ms, 2760809 µs
Matches per sec: 54,331