Python：使用正则表达式和while循环在PDF中分隔段落_Python_Regex

Python：使用正则表达式和while循环在PDF中分隔段落

python regex

Python：使用正则表达式和while循环在PDF中分隔段落,python,regex,Python,Regex,我有一个包含82个段落的pdf文件，我的目标是使用python将每个段落分解为自己的文本块。我已经使用PyPDF2提取了文本这些段落都以数字和句点（1.42.76等）开头。它适用于下面代码中的大多数段落，但并不总是将句点考虑在内。例如，数字18的匹配输出是：“18（06/”。但这不应该被拾取，因为后面没有句号。有什么建议吗查找位置的代码： i = 1 all_positions = [] found = "found" while found == "found": matches

我有一个包含82个段落的pdf文件，我的目标是使用python将每个段落分解为自己的文本块。我已经使用PyPDF2提取了文本

这些段落都以数字和句点（1.42.76等）开头。它适用于下面代码中的大多数段落，但并不总是将句点考虑在内。例如，数字18的匹配输出是：“18（06/”。但这不应该被拾取，因为后面没有句号。有什么建议吗

查找位置的代码：

i = 1
all_positions = []
found = "found"

while found == "found":
    matches = []
    matches_positions =[]
    standard_length = 0
    substring = str(i) + "."
    matches = re.finditer(substring, text, re.IGNORECASE)
    matches_positions = [match.start() for match in matches]
    standard_length = len(matches_positions)
    if standard_length > 0:
        all_positions.append(matches_positions[0])
        i += 1
    else:
        found = "not found"

打印输出代码：

for i in range(0,len(all_positions)):
     print('---')
     print(text[all_positions[i]:all_positions[i+1]])

您可以使用以下正则表达式来实现您的结果：

^\d+\. ?(.*)

对上述正则表达式的解释：

import re
pattern = re.compile(r"^\d+\. ?(.*)", re.MULTILINE)
match = pattern.findall("1. Hellow World\n"
    "23. This is loremIpsum text\n"
    "9001. Some random textbcjsbcskcbksck sbcksbcksckscsk\n"
    "90 (89. Some other") 
print (match)
# Output - ['Hellow World', 'This is loremIpsum text', 'Some random textbcjsbcskcbksck sbcksbcksckscsk']

^
-表示给定测试字符串的开始

\d+
-匹配数字[0-9]一次或多次

\.
-逐字匹配点。

？
-表示零个或一个空格字符

（.*）
-表示贪婪地捕获段落文本的捕获组

您可以找到regex演示

用PYTHON实现：

import re
pattern = re.compile(r"^\d+\. ?(.*)", re.MULTILINE)
match = pattern.findall("1. Hellow World\n"
    "23. This is loremIpsum text\n"
    "9001. Some random textbcjsbcskcbksck sbcksbcksckscsk\n"
    "90 (89. Some other") 
print (match)
# Output - ['Hellow World', 'This is loremIpsum text', 'Some random textbcjsbcskcbksck sbcksbcksckscsk']

你可以找到上面代码的实现

你能给问题添加一个文本的例子吗？我想你正在使用一个数字加一个点作为子字符串，它将变成一个正则表达式，匹配一个数字，后跟除换行符以外的任何字符。你可以省略匹配数字的惰性部分

？

，因为匹配数字可以n不要越过圆点本身。谢谢同样的@Thefourthbird。更新了同样的。