Python 正则表达式：使用表格和换行符拆分长字符串_Python_R_Regex

Python 正则表达式：使用表格和换行符拆分长字符串

python r regex

Python 正则表达式：使用表格和换行符拆分长字符串,python,r,regex,Python,R,Regex,考虑以下字符串： 08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER L

考虑以下字符串：

08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet,  Member of the Executive Board of the ECB,  conducted by Pascal Dendooven and Goele De Cort on 3 July 2017,  published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER LANGUAGES\n\t\t\t\t\t\t\t(1)\n\t\t\t\t\t\t\t+\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tSelect your language\n\t\t\t\t\t\t\t\n\t\t\t\t\t\tNederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré,  Member of the Executive Board of the ECB,  conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa),  on 3 July,  published on 7 July 2017ENGLISH"

我想摘录其中的两句话，即：

“2017年7月8日Peter Praet：与欧洲央行执行委员会成员Peter Praet的标准访谈，由Pascal Dendooven和Goele De Cort于2017年7月3日进行，于2017年7月8日出版英文版”

“荷兰NL07/07/2017Benoît Cœuré：与欧洲央行执行委员会成员Benoît Cœuré的《世界报》和《斯坦帕报》访谈，由玛丽·沙雷尔（世界报）和亚历山德罗·巴贝拉（拉斯坦帕）于7月3日主持，2017年7月7日出版英文版”

我试着使用

[\w]+（？！\\t）

，但这捕获了

t（1

中的

这里的正确语法是什么？

谢谢！

假设\n和\t字符实际上是换行符和制表符。请尝试：

([^\n\t]*)

然后将其扩充以摆脱其他语言，等等。

假设\n和\t字符实际上是换行符和制表符。请尝试：

([^\n\t]*)

然后对其进行扩充以摆脱其他语言，等等。在Python中，您可以根据制表符和换行符拆分字符串，然后过滤掉过短的内容

import re

[x for x in re.split('\n\t+', long_string) if len(x) > 20]

在Python中，可以根据制表符和换行符拆分字符串，然后过滤掉太短的内容

import re

[x for x in re.split('\n\t+', long_string) if len(x) > 20]

给你，在这上面分开

r'（？：\\[\\ntr]）+（？：（？：（？！\\[\\ntr]）*\\[\\ntr]）*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

注意
上面的正则表达式最多将文本分成两部分

如果拆分内容包含非转义r、n、t，则可以允许
如果文本低于某个阈值，则进行多次拆分

@物理学家建议长度为20。我给它40，用在
通过在本节

（？：（？！\\[\\ntr]）{0,20}

中为regex指定一个范围

新的正则表达式是

r'（？s）（？：\\[\\ntr]）+（？：\s*（？：（？！\[\\ntr]））{0,40}？\s*\\[\\ntr]）*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

给你，在这上面分开

r'（？：\\[\\ntr]）+（？：（？：（？！\\[\\ntr]）*\\[\\ntr]）*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

注意
上面的正则表达式最多将文本分成两部分

如果拆分内容包含非转义r、n、t，则可以允许
如果文本低于某个阈值，则进行多次拆分

@物理学家建议长度为20。我给它40，用在
通过在本节

（？：（？！\\[\\ntr]）{0,20}

中为regex指定一个范围

新的正则表达式是

r'（？s）（？：\\[\\ntr]）+（？：\s*（？：（？！\[\\ntr]））{0,40}？\s*\\[\\ntr]）*”

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

您使用哪种编程语言？因为它们的

regexpr

对于每种语言都略有不同。另外，您是否正试图仅从该字符串中提取内容？或者regexpr也必须适用于其他类似情况？请尝试

\w{4，}

而不是

[\w]+

。这将消除三个或更少字符的匹配，而且我很确定

\w

已经是一个字符类：不需要括号。在Python中，您可以在选项卡和换行符上执行

re.split

，过滤掉任何少于20个字符的内容。这就像通过参加射击来磨练剑术一样g gallery:）找到一个从纯正则表达式解决方案中真正受益的问题，然后磨练一下。现在，在这个

r'（？：\\[\\ntr]）+（？：（？：（？！\[\\ntr]））*\[\\ntr]）*“

您使用哪种编程语言？因为它们各自的

regexpr

稍有不同。此外，您是否正试图仅从该字符串中提取内容？或者regexpr也必须适用于其他类似情况？请尝试

\w{4，}

而不是

[\w]+

。这将消除三个或更少字符的匹配，而且我很确定

\w

已经是一个字符类：不需要括号。在Python中，您可以在选项卡和换行符上执行

re.split

r'（？：\\[\\ntr]）+（？：（？：（？！\[\\ntr]））*\[\\ntr]）*“

我想它们不是换行符和制表符，只是看起来像那样的文字。我想它们不是换行符和制表符，只是看起来像那样的文字。你为什么有三个

（？：（？）（？！

？这确实是正则表达式地狱：D@Noobie-这相当于一种语言中的代码块SCPOINT。如果你查看格式化的版本，就很容易破译。我为你添加了一个更新说明。令人震惊。为什么你有三个

（？：（？）：（？！

？这确实是正则表达式地狱：D@Noobie-这相当于一种语言中的代码块SCPOINT。如果你查看格式化版本，就很容易破译。我为你添加了一个更新说明。。