Python docx提取的字符串缺少一个单词_Python_Python 3.x_Docx_Python Docx

Python docx提取的字符串缺少一个单词

python python-3.x

Python docx提取的字符串缺少一个单词,python,python-3.x,docx,python-docx,Python,Python 3.x,Docx,Python Docx,我不明白为什么“特拉华”这个词没有从下面的代码中提取出来。其他每个字符都会被提取。任何人都可以提供从下面的Docx文件中提取单词“Delaware”的代码，而不必手动更改文件吗输入： import docx import io import requests url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx' file = io.BytesIO(requests.get(

我不明白为什么“特拉华”这个词没有从下面的代码中提取出来。其他每个字符都会被提取。任何人都可以提供从下面的Docx文件中提取单词“Delaware”的代码，而不必手动更改文件吗

输入：

import docx
import io
import requests

url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)

for text in docx.Document(file).paragraphs:
    print(text.text)

输出：

适用法律本协议应根据州法律进行解释，不包括其法律冲突条款。《联合国国际货物销售合同公约》的规定不适用于本协议

最奇怪的是，如果我对文档中的单词“Delaware”（ee.gg.，粗体/未装订，在单词上方键入）做了任何操作，然后将其保存，那么在下次运行代码时，“Delaware”一词就不再丢失了。但是，仅保存文件而不更改单词并不能解决问题。你可能会说，解决办法是手动修改word，但实际上我处理的是数千个这样的文档，手动逐个修改每个文档是没有意义的

目前的答案似乎为“特拉华州”为何不被提取提供了理由，但并没有提供解决方案。谢谢。

我相信@smci是对的。这很可能是由以下原因解释的：。然而，这并不能提供一个解决方案

我认为在这种情况下，我们唯一的选择是回过头来读取XML文件。从网页中考虑这个函数（简化），例如：

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile
import io
import requests    

def get_docx_text(path):
    """Take the path of a docx file as argument, return the text in unicode."""

    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'

    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [n.text for n in paragraph.getiterator(TEXT) if n.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
print(get_docx_text(file))

我们得到：

APPLICABLE LAW

This Agreement is to be construed and interpreted according to the laws of the State of Delaware, excluding its conflict of laws provisions.  The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.

我也尝试使用Python docx查找电子邮件，但没有成功

pip install docx2txt

这对我有效，可能有一些不必要的'\n'，如果需要，用空格替换它们

import docx2txt
string = docx2txt.process("filepathandname.docx")

是合并字段吗？哪个工具生成了.docx文件？Word女士？图书馆作家？哪个版本？您是否尝试过从另一个工具打开并重新保存它以进行检查？可能重复的“我”在docx和非unicode或ASCII字符方面存在问题，尽管看起来特拉华是正常的，但可能有一些问题隐藏起来。@dshefman不客气。我想你会遇到更多的问题：）。祝你好运