Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Regex将在更多条件下匹配版权声明中的公司名称_Python_Regex - Fatal编程技术网

Python Regex将在更多条件下匹配版权声明中的公司名称

Python Regex将在更多条件下匹配版权声明中的公司名称,python,regex,Python,Regex,一段时间以来,我一直在试图找到一个强大的正则表达式,从版权声明中提取公司名称(而且对正则表达式了解不多) 在这个问题上: 我得到了正则表达式: (?i)(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*) 但当我尝试更多的例子时,我发现这是不够的。我希望对其进行更改,使其也符合以下条件,同时仍适用于以前的所有情况: 考

一段时间以来,我一直在试图找到一个强大的正则表达式,从版权声明中提取公司名称(而且对正则表达式了解不多)

在这个问题上:

我得到了正则表达式:

(?i)(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)
但当我尝试更多的例子时,我发现这是不够的。我希望对其进行更改,使其也符合以下条件,同时仍适用于以前的所有情况:

  • 考虑到“)”或“版权”(以最后一个为准)之前可能会出现任何其他内容,并忽略它
  • 示例:

  • 考虑到“)”或“版权”之后可能没有年份,但公司名称已经存在
  • 例如:

  • 考虑到“版权”或“)(我认为条件1也满足这一点)之前的年份可能会出现
  • 示例:

  • 如果在此之前有一场比赛,然后忽略其他比赛:
  • 例如:

    版权所有2019 ComputerEase建筑软件| 1-800-544-2530

    我相信你会得到你所需要的。解释如下:

    (?i)                                # make the regex case insensitive
    (?:Copyright\s*©?|©\s*(Copyright)?) # Look for Copyright and/or © to get us started
    ([\d\s—-]+)?                        # There might be some digits, spaces, and dashes, but not necessarily
    (©|Copyright)?\s*                   # Copyright or © could be separated by dates, so look for them again
    (.+?)                               # This is the sugar we're looking for
    (?=All rights reserved|\||$)        # If you find "All rights reserved" a | or end of string, stop capturing the text
    
    你可以用

    (?i)(?:©(?:\s*(?:\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*Copyright)?|Copyright(?:\s*(?:\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*©)?)(?:\s*\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*(.*?(?=\s*[.|]|\W*All\s+rights\s+reserved)|.*\b)
    

    Python代码:

    import re
    s = "Copyright © 2019 Apple Inc. All rights reserved.\r\n© 2019 Quid, Inc. All Rights Reserved.\r\n© 2009 Database Designs \r\n© 2019 Rediker Software, All Rights Reserved\r\n©2019 EVOSUS, INC. ALL RIGHTS RESERVED\r\n© 2019 Walmart. All Rights Reserved.\r\n© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.\r\nCopyright © 1978-2019 Berkshire Hathaway Inc.\r\n© 2019 McKesson Corporation\r\n© 2019 UnitedHealth Group. All rights reserved.\r\n© Copyright 1999 - 2019 CVS Health\r\nCopyright 2019 General Motors. All Rights Reserved.\r\n© 2019 Ford Motor Company\r\n©2019 AT&T Intellectual Property. All rights reserved.\r\n© 2019 GENERAL ELECTRIC\r\nCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.\r\n© 2019 Verizon\r\n© 2019 Fannie Mae\r\nCopyright © 2018 Jonas Construction Software Inc. All rights reserved.\r\nAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved\r\n© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121\r\n© 2019 JPMorgan Chase & Co.\r\nCopyright © 1995 - 2018 Boeing. All Rights Reserved.\r\n© 2019 Bank of America Corporation. All rights reserved.\r\n© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801\r\n©2019 Cardinal Health. All rights reserved.\r\n© 2019 Quid, Inc All Rights Reserved.\r\n602-226-2389 ©2019 Endurance International Group.\r\nCopyright 1999 — 2019 © Iflexion. All rights reserved.\r\nISO 9001:2008, ISO/ IEC 27001:2005 © Mobikasa 2019\r\n© 2019 Copyright arcadia.io.\r\n2018 © Power Tools LLC\r\nCopyright 2019 ComputerEase Construction Software | 1-800-544-2530\r\n© 2019 3M. 3M Health Information Systems Privacy Policy"
    rx = r'''(?xi)
    (?:©                                        # Start of a group: © symbol
    (?:\s*                                      #  Start of optional group: 0+ whitespaces
      (?:\d{4}                                  #   Start of optional group: 4 digits
        (?:\s*[-—–]\s*\d{4})?                   #     0+ spaces, dashes, spaces, 4 digits
      )?                                        #   End of group
      \s*Copyright                              #  Spaces and Copyright
    )?                                          #  End of group 
    |                                           #  OR 
    Copyright                                   
     (?:\s*                                     #  Start of optional group: 0+ whitespaces
       (?:\d{4}                                 #   Start of optional group: 4 digits
         (?:\s*[-—–]\s*\d{4})?                  #     0+ spaces, dashes, spaces, 4 digits
       )?\s*©                                   #   End of group, 0+ spaces, ©
     )?                                         #  End of group
    )                                           # End of group
    (?:\s*\d{4}(?:\s*[-—–]\s*\d{4})?)?          # Optional group, 9999 optionally followed with dash enclosed with whitespaces and then 9999
    \s*                                         # 0+ whitespaces
    (                                           # Start of a capturing group:
       .*?                                      # any 0+ chars other than linebreak chars, as few as possible, up to...
        (?=\s*[.|]|                             # 0+ spaces and then | or ., or
            \W*All\s+rights\s+reserved)         # All rights reserved with any 0+ non-word chars before it
      |                                         # or
       .*\b                                     # any 0+ chars other than linebreak chars, as many as possible
    )'''
    
    for m in re.findall(rx, s):
        print(m)
    
    看。输出:


    我知道它的老问题,但想发布更好的解决方案。 我训练了spacy模型,该模型在5k+版权文本样本上训练。
    这是模型和作品

    那么,考虑到新规则4,所有评论©版权2017 Kroger | Kroger公司的预期输出是什么?版权所有?@WiktorStribiżew应该是“Kroger”有一条规则在点之前停止,但2019版权所有arcadia.io.呢?是的,这很棘手。但是如果是那样的话,我们可以保留arcadiaCheck。最后一个问题:如果公司名称中有一个号码怎么办?例如“©2019 3M.3M健康信息系统隐私政策”@FilipeAleixo预期结果是什么<代码>3M?“3M”。正如您在第一个答案中所说,我尝试将{4}添加到\d中,但没有成功地使其工作
    (?i)                                # make the regex case insensitive
    (?:Copyright\s*©?|©\s*(Copyright)?) # Look for Copyright and/or © to get us started
    ([\d\s—-]+)?                        # There might be some digits, spaces, and dashes, but not necessarily
    (©|Copyright)?\s*                   # Copyright or © could be separated by dates, so look for them again
    (.+?)                               # This is the sugar we're looking for
    (?=All rights reserved|\||$)        # If you find "All rights reserved" a | or end of string, stop capturing the text
    
    (?i)(?:©(?:\s*(?:\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*Copyright)?|Copyright(?:\s*(?:\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*©)?)(?:\s*\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*(.*?(?=\s*[.|]|\W*All\s+rights\s+reserved)|.*\b)
    
    import re
    s = "Copyright © 2019 Apple Inc. All rights reserved.\r\n© 2019 Quid, Inc. All Rights Reserved.\r\n© 2009 Database Designs \r\n© 2019 Rediker Software, All Rights Reserved\r\n©2019 EVOSUS, INC. ALL RIGHTS RESERVED\r\n© 2019 Walmart. All Rights Reserved.\r\n© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.\r\nCopyright © 1978-2019 Berkshire Hathaway Inc.\r\n© 2019 McKesson Corporation\r\n© 2019 UnitedHealth Group. All rights reserved.\r\n© Copyright 1999 - 2019 CVS Health\r\nCopyright 2019 General Motors. All Rights Reserved.\r\n© 2019 Ford Motor Company\r\n©2019 AT&T Intellectual Property. All rights reserved.\r\n© 2019 GENERAL ELECTRIC\r\nCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.\r\n© 2019 Verizon\r\n© 2019 Fannie Mae\r\nCopyright © 2018 Jonas Construction Software Inc. All rights reserved.\r\nAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved\r\n© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121\r\n© 2019 JPMorgan Chase & Co.\r\nCopyright © 1995 - 2018 Boeing. All Rights Reserved.\r\n© 2019 Bank of America Corporation. All rights reserved.\r\n© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801\r\n©2019 Cardinal Health. All rights reserved.\r\n© 2019 Quid, Inc All Rights Reserved.\r\n602-226-2389 ©2019 Endurance International Group.\r\nCopyright 1999 — 2019 © Iflexion. All rights reserved.\r\nISO 9001:2008, ISO/ IEC 27001:2005 © Mobikasa 2019\r\n© 2019 Copyright arcadia.io.\r\n2018 © Power Tools LLC\r\nCopyright 2019 ComputerEase Construction Software | 1-800-544-2530\r\n© 2019 3M. 3M Health Information Systems Privacy Policy"
    rx = r'''(?xi)
    (?:©                                        # Start of a group: © symbol
    (?:\s*                                      #  Start of optional group: 0+ whitespaces
      (?:\d{4}                                  #   Start of optional group: 4 digits
        (?:\s*[-—–]\s*\d{4})?                   #     0+ spaces, dashes, spaces, 4 digits
      )?                                        #   End of group
      \s*Copyright                              #  Spaces and Copyright
    )?                                          #  End of group 
    |                                           #  OR 
    Copyright                                   
     (?:\s*                                     #  Start of optional group: 0+ whitespaces
       (?:\d{4}                                 #   Start of optional group: 4 digits
         (?:\s*[-—–]\s*\d{4})?                  #     0+ spaces, dashes, spaces, 4 digits
       )?\s*©                                   #   End of group, 0+ spaces, ©
     )?                                         #  End of group
    )                                           # End of group
    (?:\s*\d{4}(?:\s*[-—–]\s*\d{4})?)?          # Optional group, 9999 optionally followed with dash enclosed with whitespaces and then 9999
    \s*                                         # 0+ whitespaces
    (                                           # Start of a capturing group:
       .*?                                      # any 0+ chars other than linebreak chars, as few as possible, up to...
        (?=\s*[.|]|                             # 0+ spaces and then | or ., or
            \W*All\s+rights\s+reserved)         # All rights reserved with any 0+ non-word chars before it
      |                                         # or
       .*\b                                     # any 0+ chars other than linebreak chars, as many as possible
    )'''
    
    for m in re.findall(rx, s):
        print(m)
    
    Apple Inc
    Quid, Inc
    Database Designs
    Rediker Software
    EVOSUS, INC
    Walmart
    Exxon Mobil Corporation
    Berkshire Hathaway Inc
    McKesson Corporation
    UnitedHealth Group
    CVS Health
    General Motors
    Ford Motor Company
    AT&T Intellectual Property
    GENERAL ELECTRIC
    AmerisourceBergen Corporation
    Verizon
    Fannie Mae
    Jonas Construction Software Inc
    Kroger
    Express Scripts Holding Company
    JPMorgan Chase & Co
    Boeing
    Bank of America Corporation
    Wells Fargo
    Cardinal Health
    Quid, Inc
    Endurance International Group
    Iflexion
    Mobikasa 2019
    arcadia
    Power Tools LLC
    ComputerEase Construction Software
    3M