使用正则表达式python提取以月份开头的年份_Python_Regex

使用正则表达式python提取以月份开头的年份

python regex

使用正则表达式python提取以月份开头的年份,python,regex,Python,Regex,我有数千个数据集，我想从中提取一个月前的年份。例如：数据集1:1980年9月数据集2:1978年10月我使用以下方法编写的正则表达式： ^（？）\w+（\1）\s[0-9]{4}$|（^（？）\w+，\s[0-9]{4}$）它使用链接来完成任务。然而，当我试图在python代码中使用它时，我得到了以下错误： File "<ipython-input-216-a995358d0957>", line 1, in <module> runfile('C:

我有数千个数据集，我想从中提取一个月前的年份。例如：

数据集1:1980年9月

数据集2:1978年10月

我使用以下方法编写的正则表达式：

^（？）\w+（\1）\s[0-9]{4}$|（^（？）\w+，\s[0-9]{4}$）

它使用链接来完成任务。然而，当我试图在python代码中使用它时，我得到了以下错误：

  File "<ipython-input-216-a995358d0957>", line 1, in <module>
    runfile('C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py', wdir='C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data')
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py", line 76, in <module>
    year_data = re.findall('^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)', tokenized_string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 691, in _parse
    len(char) + 2)
error: unknown extension ?<m

文件“”，第1行，在
runfile（'C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year（clean）.py'，wdir='C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data'）
文件“C:\Users\Muntabir\Anaconda3\lib\site packages\spyder\utils\site\sitecustomize.py”，第705行，在runfile中
execfile（文件名、命名空间）
文件“C:\Users\Muntabir\Anaconda3\lib\site packages\spyder\utils\site\sitecustomize.py”，第102行，在execfile中
exec（编译（f.read（），文件名，'exec'），命名空间）
文件“C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year（clean）.py”，第76行，in
年份数据=re.findall（'^（？）\w+（\1）\s[0-9]{4}$|（^（？）\w+，\s[0-9]{4}$），标记化的字符串）
文件“C:\Users\Muntabir\Anaconda3\lib\re.py”，第222行，findall中
返回编译（模式、标志）.findall（字符串）
文件“C:\Users\Muntabir\Anaconda3\lib\re.py”，第301行，在编译中
p=sre_compile.compile（模式、标志）
文件“C:\Users\Muntabir\Anaconda3\lib\sre_compile.py”，第562行，在compile中
p=sre_parse.parse（p，标志）
文件“C:\Users\Muntabir\Anaconda3\lib\sre_parse.py”，第855行，在parse中
p=\u parse\u sub（源、模式、标志和SRE\u标志\u VERBOSE，0）
文件“C:\Users\Muntabir\Anaconda3\lib\sre_parse.py”，第416行，在_parse_sub中
非嵌套和非项目）
文件“C:\Users\Muntabir\Anaconda3\lib\sre_parse.py”，第691行，在_parse中
len（char）+2）
错误：未知扩展名？重新导入
年份=重新编译（r'（\b\d{1,2}\d{0,3}）\b（？：1月？）；2月？；3月？；4月？；6月？；7月？；8月？；8月？；9月？；10月？；10月？；11月？；12月？）\d
打印（年份匹配（'1980年9月'）。组（3））
打印（年份匹配（'1978年10月'）。组（3））

输出：
1980
1978

命名的捕获组是：（？p..）
而不是（？…）

用法：^（？p\w+），\s[0-9]{4}$

我非常感谢你的贡献。但是@Joan Lara Ganau的解决方案为我提供了一个关于regexp可以是什么的指南@Joan，如果任何年份之前有一个月和一个日期，您的regexp都将匹配。此外，它不搜索逗号和空格。正如我提到的，我有数千个数据集，我正想从中提取一个月之前的年份。我正在寻找以下格式：
a、 ）月年
b、 ）月，年
无论如何，我在做了大量实验后找到了解决问题的方法。解决办法是：
year_result = re.compile(
                    r"(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|"
                    "Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|"
                    "Dec(ember)?)(,?)(\s\d{4})")

此外，如果模式不匹配，match（）方法也将返回None。在这种情况下，使用group（）方法将抛出AttributeError。该错误类似于None类型对象没有匹配的组（）。因此，我用以下方式修复它：
def matched(document):                   
         year = year_result.match(document)
         year = year_result.search(document)
         if year is None:
               return '0'
         return year.group(14)

现在，您可以将要提取年份的文本文档传递到上述函数
谢谢
这不是有效的Python regexp。您可能使用选中的PHP（在“Flavor”下）对其进行了测试。r'\w+，？\s+[0-9]{4}（？！\d）Hello Wiktor，它可以工作，但它从文档中提取了我不想要的多年。我想提取以月份开头的唯一年份和以月份开头的行。这就是我在正则表达式之前使用“^”（cap）符号的原因。很高兴我帮了你：）
def matched(document):                   
         year = year_result.match(document)
         year = year_result.search(document)
         if year is None:
               return '0'
         return year.group(14)