Python 正则表达式:使用表格和换行符拆分长字符串

Python 正则表达式:使用表格和换行符拆分长字符串,python,r,regex,Python,R,Regex,考虑以下字符串: 08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER L

考虑以下字符串:

08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet,  Member of the Executive Board of the ECB,  conducted by Pascal Dendooven and Goele De Cort on 3 July 2017,  published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER LANGUAGES\n\t\t\t\t\t\t\t(1)\n\t\t\t\t\t\t\t+\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tSelect your language\n\t\t\t\t\t\t\t\n\t\t\t\t\t\tNederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré,  Member of the Executive Board of the ECB,  conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa),  on 3 July,  published on 7 July 2017ENGLISH"
我想摘录其中的两句话,即:

  • “2017年7月8日Peter Praet:与欧洲央行执行委员会成员Peter Praet的标准访谈,由Pascal Dendooven和Goele De Cort于2017年7月3日进行,于2017年7月8日出版英文版”

  • “荷兰NL07/07/2017Benoît Cœuré:与欧洲央行执行委员会成员Benoît Cœuré的《世界报》和《斯坦帕报》访谈,由玛丽·沙雷尔(世界报)和亚历山德罗·巴贝拉(拉斯坦帕)于7月3日主持,2017年7月7日出版英文版”

我试着使用
[\w]+(?!\\t)
,但这捕获了
t(1
中的
t

这里的正确语法是什么?
谢谢!

假设\n和\t字符实际上是换行符和制表符。请尝试:

([^\n\t]*)

然后将其扩充以摆脱其他语言,等等。

假设\n和\t字符实际上是换行符和制表符。请尝试:

([^\n\t]*)

然后对其进行扩充以摆脱其他语言,等等。在Python中,您可以根据制表符和换行符拆分字符串,然后过滤掉过短的内容

import re

[x for x in re.split('\n\t+', long_string) if len(x) > 20]

在Python中,可以根据制表符和换行符拆分字符串,然后过滤掉太短的内容

import re

[x for x in re.split('\n\t+', long_string) if len(x) > 20]

给你,在这上面分开

r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr])*\\[\\ntr])*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times
 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

注意
上面的正则表达式最多将文本分成两部分

如果拆分内容包含非转义r、n、t,则可以允许
如果文本低于某个阈值,则进行多次拆分

@物理学家建议长度为20。我给它40,用在
通过在本节
(?:(?!\\[\\ntr]){0,20}
中为regex指定一个范围

新的正则表达式是

r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\[\\ntr])){0,40}?\s*\\[\\ntr])*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times
 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

给你,在这上面分开

r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr])*\\[\\ntr])*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times
 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

注意
上面的正则表达式最多将文本分成两部分

如果拆分内容包含非转义r、n、t,则可以允许
如果文本低于某个阈值,则进行多次拆分

@物理学家建议长度为20。我给它40,用在
通过在本节
(?:(?!\\[\\ntr]){0,20}
中为regex指定一个范围

新的正则表达式是

r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\[\\ntr])){0,40}?\s*\\[\\ntr])*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times
 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

您使用哪种编程语言?因为它们的
regexpr
对于每种语言都略有不同。另外,您是否正试图仅从该字符串中提取内容?或者regexpr也必须适用于其他类似情况?请尝试
\w{4,}
而不是
[\w]+
。这将消除三个或更少字符的匹配,而且我很确定
\w
已经是一个字符类:不需要括号。在Python中,您可以在选项卡和换行符上执行
re.split
,过滤掉任何少于20个字符的内容。这就像通过参加射击来磨练剑术一样g gallery:)找到一个从纯正则表达式解决方案中真正受益的问题,然后磨练一下。现在,在这个
r'(?:\\[\\ntr])+(?:(?:(?!\[\\ntr]))*\[\\ntr])*“
您使用哪种编程语言?因为它们各自的
regexpr
稍有不同。此外,您是否正试图仅从该字符串中提取内容?或者regexpr也必须适用于其他类似情况?请尝试
\w{4,}
而不是
[\w]+
。这将消除三个或更少字符的匹配,而且我很确定
\w
已经是一个字符类:不需要括号。在Python中,您可以在选项卡和换行符上执行
re.split
,过滤掉任何少于20个字符的内容。这就像通过参加射击来磨练剑术一样g gallery:)找到一个从纯正则表达式解决方案中真正受益的问题,然后磨练一下。现在,在这个
r'(?:\\[\\ntr])+(?:(?:(?!\[\\ntr]))*\[\\ntr])*“
我想它们不是换行符和制表符,只是看起来像那样的文字。我想它们不是换行符和制表符,只是看起来像那样的文字。你为什么有三个
(?:(?)(?!
?这确实是正则表达式地狱:D@Noobie-这相当于一种语言中的代码块SCPOINT。如果你查看格式化的版本,就很容易破译。我为你添加了一个更新说明。令人震惊。为什么你有三个
(?:(?):(?!
?这确实是正则表达式地狱:D@Noobie-这相当于一种语言中的代码块SCPOINT。如果你查看格式化版本,就很容易破译。我为你添加了一个更新说明。。