用Python解析docx文件_Python_Regex_Python Docx

用Python解析docx文件

python regex

用Python解析docx文件,python,regex,python-docx,Python,Regex,Python Docx,我正在尝试从多个docx文件中读取标题。令人烦恼的是，这些标题没有可识别的段落样式。所有段落都有“正常”的段落样式，所以我使用正则表达式。标题采用粗体格式，结构如下： A.猫 B.狗 C.清管器 D.Fox 如果一个文件中有26个以上的标题，则标题前面会加上“AA.”、“BB.”等我有下面的代码，除了前面有“D”的标题外，哪种作品会打印两次，例如。 [猫，狗，猪，狐狸，狐狸] import os from docx import Document import re directory =

我正在尝试从多个docx文件中读取标题。令人烦恼的是，这些标题没有可识别的段落样式。所有段落都有“正常”的段落样式，所以我使用正则表达式。标题采用粗体格式，结构如下：

A.猫

B.狗

C.清管器

D.Fox

如果一个文件中有26个以上的标题，则标题前面会加上“AA.”、“BB.”等

我有下面的代码，除了前面有“D”的标题外，哪种作品会打印两次，例如。 [猫，狗，猪，狐狸，狐狸]

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()

for file in os.listdir(directory):

    document = Document(directory+file)

    head1s = []

    for paragraph in document.paragraphs:

        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

        for run in paragraph.runs:

            if run.bold:

                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)

    print(head1s)

有人能告诉我是不是代码有问题导致了这种情况吗？据我所知，Word文件中这些特定标题的格式或结构没有什么独特之处。

发生的是循环继续经过D.Fox，因此在这个新循环中，即使没有匹配项，它也会打印head1的最后一个值，即D.Fox

我认为这是段落中运行的

。运行：

以某种方式运行了两次，也许还有第二次“运行”在那里但不可见

也许在发现第一个匹配时添加一个中断足以防止触发第二次运行

for file in os.listdir(directory):

document = Document(directory+file)

head1s = []

for paragraph in document.paragraphs:

    heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

    for run in paragraph.runs:

        if run.bold:

            if heading:
                head1 = paragraph.text
                head1 = head1.split('.')[1]
                head1s.append(head1)
                # this break stops the run loop if a match was found.
                break

print(head1s)