Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/332.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在python中使用正则表达式从字符串中提取不同格式的日期_Python_Regex_Pandas - Fatal编程技术网

在python中使用正则表达式从字符串中提取不同格式的日期

在python中使用正则表达式从字符串中提取不同格式的日期,python,regex,pandas,Python,Regex,Pandas,我需要在python中使用正则表达式从字符串中提取日期,日期可以是多种格式中的一种,并且可以在一些随机文本之间 日期格式为: 04/20/2009; 04/20/09; 4/20/09; 4/3/09 Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009; Mar 2

我需要在python中使用正则表达式从字符串中提取日期,日期可以是多种格式中的一种,并且可以在一些随机文本之间

日期格式为:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
提取日期后,我需要将其升序排序

我曾尝试使用这6种正则表达式模式,但似乎它并没有完成所有的工作

pattern1 = r'((?:\d{1,2}[- ,./]*)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[- ,./]*\d{4})'

pattern2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{1,2}[ ,./-]*\d{4})'

pattern3 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{4})'

pattern4 = r'((?:\d{1,2}[/-]\d{1,2}[/-](?:\d{4}|\d{2})))'

pattern5 = r'(?:(\s\d{2}[/-](?:\d{4})))'

pattern6 = r'(?:\d{4})'

设置一些中间变量可能很有用

import re

short_month_names = (
    'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
    'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
)

long_month_names = (
    'January', 'February', 'March', 'April', 'May', 'June', 'July',
    'August', 'September', 'October', 'November', 'December'
)

short_month_cap = '(?:' + '|'.join(short_month_names) + ')'
long_month_cap = '(?:' + '|'.join(long_month_names) + ')'
short_num_month_cap = '(?:[1-9]|1[12])'
long_num_month_cap = '(?:0[1-9]|1[12])'

long_day_cap = '(?:0[1-9]|[12][0-9]|3[01])'
short_day_cap = '(?:[1-9]|[12][0-9]|3[01])'

long_year_cap = '(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3})'
short_year_cap = '(?:[0-9][0-9])'

ordinal_day = '(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st)'

formats = (
    r'(?P<month_0>{lnm}|{snm})/(?P<day_0>{ld}|{sd})/(?P<year_0>{sy}|{ly})',
    r'(?P<month_1>{sm})\-(?P<day_1>{ld}|{sd})\-(?P<year_1>{ly})',
    r'(?P<month_2>{sm}|{lm})(?:\.\s+|\s*)(?P<day_2>{ld}|{sd})(?:,\s+|\s*)(?P<year_2>{ly})',
    r'(?P<day_3>{ld}|{sd})(?:[\.,]\s+|\s*)(?P<month_3>{lm}|{sm})(?:[\.,]\s+|\s*)(?P<year_3>{ly})',
    r'(?P<month_4>{lm}|{sm})\s+(?P<year_4>{ly})',
    r'(?P<month_5>{lnm}|{snm})/(?P<year_5>{ly})',
    r'(?P<year_6>{ly})',
    r'(?P<month_6>{sm})\s+(?P<day_4>(?={od})[0-9][0-9]?)..,\s*(?P<year_7>{ly})'
)

_pattern = '|'.join(
    i.format(
        sm=short_month_cap, lm=long_month_cap, snm=short_num_month_cap,
        lnm=long_num_month_cap, ld=long_day_cap, sd=short_day_cap,
        ly=long_year_cap, sy=short_year_cap, od=ordinal_day
    ) for i in formats
)

pattern = re.compile(_pattern)


def get_fields(match):
    if not match:
        return None
    return {
        k[:-2]: v
        for k, v in match.groupdict().items()
        if v is not None
    }

tests = r'''04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010'''

for test_line in tests.split('\n'):
    for test in test_line.split('; '):
        print('{!r}: {!r}'.format(test, get_fields(pattern.fullmatch(test))))
    print('')
主要部分是
格式
变量,其中定义了所有不同的格式。它比定义的匹配度稍高,并且可以轻松扩展

总的模式是:

'(?P<month_0>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<day_0>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))/(?P<year_0>(?:[0-9][0-9])|(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_1>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\-(?P<day_1>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))\\-(?P<year_1>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_2>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?:January|February|March|April|May|June|July|August|September|October|November|December))(?:\\.\\s+|\\s*)(?P<day_2>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:,\\s+|\\s*)(?P<year_2>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<day_3>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:[\\.,]\\s+|\\s*)(?P<month_3>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))(?:[\\.,]\\s+|\\s*)(?P<year_3>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_4>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<year_4>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_5>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<year_5>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<year_6>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_6>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<day_4>(?=(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st))[0-9][0-9]?)..,\\s*(?P<year_7>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))'
(代码<<代码(代码<<<代码(代码<<代码(代码<<代码<<代码(P:0[1-9[1-9[1-9[1-1[1-9[1-1[1[1-9[1[1[1-9[1[1[1[1[1[1[1-9]1[1[1[1[12[12[1[12[12[1-9]12[1[1-9[1[1-9[1-9[1-9[1[12[0-9[0-9[12[0-9[0-9[0-9[0[12[0-9[0[0-9[0-9[12[0-9[0-9[0[0-9[0[0[0-9[0-9[0[0-9[0-(三))(一月,二月,三月,四月,五月,六月,七月,八月,九月,十月,十一月,十二月)(三)(三)((一)(:0[1-9][0-9][3[01))(:[1-9][12][12][0-9][3[01]))(((九)九)九)九)九)九)九)九)[1-9|[0-9 0 0-9 0 0-9 0 0 0-9 0 0-9 0 0 0-9 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-9 9 9 0 0 0 0 0 0 0 0 0 0 0 0-9 9 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-9 9 9[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-9 9 9 9[0 0 0 0 0 0 0 0 0 0 0 0 0-9 9 9 9 9 9 9 9[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-9 9 9 9[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-9 9 9 9 9 9 9[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[1-9][0-9][3[01])(?:[1-9][0-9][3[01])(?:,\\s+\\s*)(?:[0-9][3-9][0-9][1-9][0-9][0-9][0-9][0-9][2-9][1-9][1-9][0-9][0-9][2-9][1-9][1-9][1-9][0-9][0-9][2-9][1-9][1-9][1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1-9[1[1-9[1-9[1[1-9[1[1-9[1-9[1-9[1[1-9[1[1-9[1-9[1-9[1-9[1[1]]9]9]]]9[1[1[1[1[1[1[1[1-9[1-9[1[1[1[1[1[1-9[9]]]9]9[1[1[1[1-9[1[1[1[1[1[1[1-9[9]]]]9]9[1[1]|Dec))(?:[\\,]\\s+\\\\s*)(?[0-9]{3}[1-9]{0-9]{2}[1-9][0-9]{0-9][1-9][0-9]{2}[1-9][0-9]{3})(P1月:1月:1月:1月:1月:1月:1月:1月:1月:1月:1月月月;2月月:2月:2月:2月月月:2月:2月月:2月月:2月月月:2月月:2月月:2月:3月:3月:3月;3月;3月:3月;3月;3月;3月;3月;3月;3月;3月;3月;3月;3月;3月;3月;4月;4月;4月;4月;4月;4月;4月;月;4月;4月;4月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;月;|[1-9][0-9]{3}]|(P(?:0[1-9]|1[12])/(P(?:[1-9]{3}[1-9]{[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}[1-9][1-9][0-9]/(P(?:[0-9]{2}[1-9][1-9]{3}]|[0-0-9[0-0-9[0-0-9[0-0-9[0-9[0-9[0-9[0-9[0-9[0-9[0-9{{{3}[0-9[0-9[0-9[0-9[0-9[0-9[0-9[0-9[0-9[0-9[0-9[0-9[0 0-9[0 0 0 0-9[3}}}}})))124;(P(P(?)P(?)P(P(P(P(?:(P(?:(P(P(P(?:1-1-1-1-1-1-1-1月[1)1[0[0[0[0[0[0[0[0[0-9[0[0-9[0-9[0-9[0[0-9[0-9[3}[3}[3}[3}[3}[3}[0-9][0-9]?,\\s*(?[0-9][1-9][0-9]{2}[1-9][0-9][1-9][0-9][0-9]{2}[1-9][0-9]{3}) 这几乎是不可能用手写的

“随机文本之间”的边界可以在
\u模式
周围添加


我建议使用
\u pattern=r'\b(?:{})\b.格式(\u pattern)

非常感谢!
'(?P<month_0>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<day_0>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))/(?P<year_0>(?:[0-9][0-9])|(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_1>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\-(?P<day_1>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))\\-(?P<year_1>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_2>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?:January|February|March|April|May|June|July|August|September|October|November|December))(?:\\.\\s+|\\s*)(?P<day_2>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:,\\s+|\\s*)(?P<year_2>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<day_3>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:[\\.,]\\s+|\\s*)(?P<month_3>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))(?:[\\.,]\\s+|\\s*)(?P<year_3>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_4>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<year_4>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_5>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<year_5>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<year_6>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_6>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<day_4>(?=(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st))[0-9][0-9]?)..,\\s*(?P<year_7>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))'