从字符串python正则表达式中提取匹配组_Python_Regex_Python 3.x

从字符串python正则表达式中提取匹配组

python regex python-3.x

从字符串python正则表达式中提取匹配组,python,regex,python-3.x,Python,Regex,Python 3.x,我试图从Python字符串中提取匹配组，但遇到了一些问题字符串如下所示 1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc 我需要任何以数字和大写字母开头的标题，并提取标题中的内容这是我期望的结果 1. TITL

我试图从Python字符串中提取匹配组，但遇到了一些问题

字符串如下所示

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

我需要任何以数字和大写字母开头的标题，并提取标题中的内容

这是我期望的结果

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

我试着用下面的正则表达式

(\d\.\s[A-Z\s]*\s)

然后从下面开始

1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

如果我尝试在正则表达式末尾添加。*，匹配的组将受到影响。我想我错过了一些简单的东西。不管我知道什么，都试过了，但解决不了

非常感谢您的帮助。

您可以将

re.findall

与

re.split

一起使用：

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

输出：

['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']

在正则表达式中，字符组中缺少小写字母，因此它只匹配大写单词

你可以简单地使用这个

(\d\.[\s\S]+?)(?=\d+\.|$)

示例代码

import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

输出

注意：-您甚至可以将

[\s\s]+？

替换为

*？

，就好像您正在使用单行标志一样，这样

也将匹配换行符使用

（\d+\.\da-z]*[A-z]+[\s\s]*？（？=\d+\.\124；$）

以下是相关代码

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

下面是使用的每个正则表达式字符的

import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

输出

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title

字符类组中缺少小写字母字符串没有标题。任何以数字开头，后跟所有大写字母文本的内容都假定为标题。这些数据只是一个演示。对于我的案例，标题的数量可以达到1000个。这个解决方案非常好，适用于大多数案例。但是如果内容中有数字，那么它就有问题了。例如，如果文本为“1.标题ABC标题ABC的内容和14天的其他一些文本”，则存在问题。当标题中有数字时，我已编辑了我的答案。如果内容有子点，如2.1等，则不起作用<代码>1。标题ABC标题ABC和其他一些文本的内容2。标题BCD这将包含标题BCD的内容，可能还有其他内容2.2文本部分2.3文本部分3。TITLE CDC的TITLE CDC内容TITLE CDC的任何指针？我也让它与多个子点一起工作，所以

1。

，

1.1

，

1.a

，

1.2.a

，

1.2.3.4.5

都是有效的。我想你的意思是“不需要”。知道了。

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title