Python 正则表达式:使用表格和换行符拆分长字符串
考虑以下字符串:Python 正则表达式:使用表格和换行符拆分长字符串,python,r,regex,Python,R,Regex,考虑以下字符串: 08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER L
08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER LANGUAGES\n\t\t\t\t\t\t\t(1)\n\t\t\t\t\t\t\t+\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tSelect your language\n\t\t\t\t\t\t\t\n\t\t\t\t\t\tNederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré, Member of the Executive Board of the ECB, conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa), on 3 July, published on 7 July 2017ENGLISH"
我想摘录其中的两句话,即:
“2017年7月8日Peter Praet:与欧洲央行执行委员会成员Peter Praet的标准访谈,由Pascal Dendooven和Goele De Cort于2017年7月3日进行,于2017年7月8日出版英文版”
“荷兰NL07/07/2017Benoît Cœuré:与欧洲央行执行委员会成员Benoît Cœuré的《世界报》和《斯坦帕报》访谈,由玛丽·沙雷尔(世界报)和亚历山德罗·巴贝拉(拉斯坦帕)于7月3日主持,2017年7月7日出版英文版”
[\w]+(?!\\t)
,但这捕获了t(1
中的t
这里的正确语法是什么?
谢谢!假设\n和\t字符实际上是换行符和制表符。请尝试:
([^\n\t]*)
然后将其扩充以摆脱其他语言,等等。假设\n和\t字符实际上是换行符和制表符。请尝试:
([^\n\t]*)
然后对其进行扩充以摆脱其他语言,等等。在Python中,您可以根据制表符和换行符拆分字符串,然后过滤掉过短的内容
import re
[x for x in re.split('\n\t+', long_string) if len(x) > 20]
在Python中,可以根据制表符和换行符拆分字符串,然后过滤掉太短的内容
import re
[x for x in re.split('\n\t+', long_string) if len(x) > 20]
给你,在这上面分开
r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr])*\\[\\ntr])*”
解释
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
)* # ---------- 0 to many times
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
(?s) # Modifiers: dot-all
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
\s* # Optional whitespace
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
){0,40}? # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
\s* # Optional whitespace
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
注意
上面的正则表达式最多将文本分成两部分 如果拆分内容包含非转义r、n、t,则可以允许
如果文本低于某个阈值,则进行多次拆分 @物理学家建议长度为20。我给它40,用在
通过在本节
(?:(?!\\[\\ntr]){0,20}
中为regex指定一个范围
新的正则表达式是
r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\[\\ntr])){0,40}?\s*\\[\\ntr])*”
解释
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
)* # ---------- 0 to many times
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
(?s) # Modifiers: dot-all
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
\s* # Optional whitespace
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
){0,40}? # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
\s* # Optional whitespace
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
给你,在这上面分开
r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr])*\\[\\ntr])*”
解释
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
)* # ---------- 0 to many times
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
(?s) # Modifiers: dot-all
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
\s* # Optional whitespace
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
){0,40}? # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
\s* # Optional whitespace
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
注意
上面的正则表达式最多将文本分成两部分 如果拆分内容包含非转义r、n、t,则可以允许
如果文本低于某个阈值,则进行多次拆分 @物理学家建议长度为20。我给它40,用在
通过在本节
(?:(?!\\[\\ntr]){0,20}
中为regex指定一个范围
新的正则表达式是
r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\[\\ntr])){0,40}?\s*\\[\\ntr])*”
解释
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
)* # ---------- 0 to many times
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
(?s) # Modifiers: dot-all
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
\s* # Optional whitespace
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
){0,40}? # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
\s* # Optional whitespace
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
您使用哪种编程语言?因为它们的
regexpr
对于每种语言都略有不同。另外,您是否正试图仅从该字符串中提取内容?或者regexpr也必须适用于其他类似情况?请尝试\w{4,}
而不是[\w]+
。这将消除三个或更少字符的匹配,而且我很确定\w
已经是一个字符类:不需要括号。在Python中,您可以在选项卡和换行符上执行re.split
,过滤掉任何少于20个字符的内容。这就像通过参加射击来磨练剑术一样g gallery:)找到一个从纯正则表达式解决方案中真正受益的问题,然后磨练一下。现在,在这个r'(?:\\[\\ntr])+(?:(?:(?!\[\\ntr]))*\[\\ntr])*“
您使用哪种编程语言?因为它们各自的regexpr
稍有不同。此外,您是否正试图仅从该字符串中提取内容?或者regexpr也必须适用于其他类似情况?请尝试\w{4,}
而不是[\w]+
。这将消除三个或更少字符的匹配,而且我很确定\w
已经是一个字符类:不需要括号。在Python中,您可以在选项卡和换行符上执行re.split
,过滤掉任何少于20个字符的内容。这就像通过参加射击来磨练剑术一样g gallery:)找到一个从纯正则表达式解决方案中真正受益的问题,然后磨练一下。现在,在这个r'(?:\\[\\ntr])+(?:(?:(?!\[\\ntr]))*\[\\ntr])*“
我想它们不是换行符和制表符,只是看起来像那样的文字。我想它们不是换行符和制表符,只是看起来像那样的文字。你为什么有三个(?:(?)(?!
?这确实是正则表达式地狱:D@Noobie-这相当于一种语言中的代码块SCPOINT。如果你查看格式化的版本,就很容易破译。我为你添加了一个更新说明。令人震惊。为什么你有三个(?:(?):(?!
?这确实是正则表达式地狱:D@Noobie-这相当于一种语言中的代码块SCPOINT。如果你查看格式化版本,就很容易破译。我为你添加了一个更新说明。。