Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 摘录DOCX评论_Python_Xml_Openxml_Google Docs_Docx - Fatal编程技术网

Python 摘录DOCX评论

Python 摘录DOCX评论,python,xml,openxml,google-docs,docx,Python,Xml,Openxml,Google Docs,Docx,我是一名教师。我想要一份所有对我布置的论文发表评论的学生的名单,以及他们所说的话。驱动API的东西对我来说太有挑战性了,但我想我可以下载它们作为zip文件并解析XML 注释在w:comment标记中进行标记,注释文本和文本使用w:t。这应该很容易,但XML(etree)正在折磨我 通过教程(和官方Python文档): 然后我会这样做: children = tree.getiterator() for c in children: print(c.attrib) 因此: {} {'{h

我是一名教师。我想要一份所有对我布置的论文发表评论的学生的名单,以及他们所说的话。驱动API的东西对我来说太有挑战性了,但我想我可以下载它们作为zip文件并解析XML

注释在
w:comment
标记中进行标记,注释文本和文本使用
w:t
。这应该很容易,但XML(etree)正在折磨我

通过教程(和官方Python文档):

然后我会这样做:

children = tree.getiterator()
for c in children:
    print(c.attrib)
因此:

{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
在这之后,我完全陷入了困境。我尝试了
element.get()
element.findall()
,但没有成功。即使我复制/粘贴值(
){http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'
),我得到
None
作为回报


有人能帮忙吗?

考虑到OOXML是一种如此复杂的格式,您已经走得很远了

下面是一些示例Python代码,展示了如何通过XPath访问DOCX文件的注释:

from lxml import etree
import zipfile

ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

def get_comments(docxFileName):
  docxZip = zipfile.ZipFile(docxFileName)
  commentsXML = docxZip.read('word/comments.xml')
  et = etree.XML(commentsXML)
  comments = et.xpath('//w:comment',namespaces=ooXMLns)
  for c in comments:
    # attributes:
    print(c.xpath('@w:author',namespaces=ooXMLns))
    print(c.xpath('@w:date',namespaces=ooXMLns))
    # string value of the comment:
    print(c.xpath('string(.)',namespaces=ooXMLns))
我过去常常从Word文档中提取带有回复的评论。可以找到有关Comments对象的文档。本文档使用Visual Basic for Applications(VBA)。但是我能够使用Python中的函数,只需稍加修改。Word对象模型的唯一问题是,我必须使用pywin32中的Win32 COM包,它在Windows PC上运行良好,但我不确定它是否能在macOS上运行

下面是我用来提取注释和相关回复的示例代码:

    import win32com.client as win32
    from win32com.client import constants

    word = win32.gencache.EnsureDispatch('Word.Application')
    word.Visible = False 
    filepath = "path\to\file.docx"

    def get_comments(filepath):
        doc = word.Documents.Open(filepath) 
        doc.Activate()
        activeDoc = word.ActiveDocument
        for c in activeDoc.Comments: 
            if c.Ancestor is None: #checking if this is a top-level comment
                print("Comment by: " + c.Author)
                print("Comment text: " + c.Range.Text) #text of the comment
                print("Regarding: " + c.Scope.Text) #text of the original document where the comment is anchored 
                if len(c.Replies)> 0: #if the comment has replies
                    print("Number of replies: " + str(len(c.Replies)))
                    for r in range(1, len(c.Replies)+1):
                        print("Reply by: " + c.Replies(r).Author)
                        print("Reply text: " + c.Replies(r).Range.Text) #text of the reply
        doc.Close()

感谢@kjhughes提供了这个惊人的答案,从文档文件中提取了所有注释。为了得到评论所涉及的文本,我和其他人一样面临着同样的问题。我以@kjhughes中的代码为基础,尝试使用pythondocx解决这个问题。这是我的看法

样本文件。

我将摘录该评论及其在文件中引用的段落

from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
    comments_dict={}
    docxZip = zipfile.ZipFile(docxFileName)
    commentsXML = docxZip.read('word/comments.xml')
    et = etree.XML(commentsXML)
    comments = et.xpath('//w:comment',namespaces=ooXMLns)
    for c in comments:
        comment=c.xpath('string(.)',namespaces=ooXMLns)
        comment_id=c.xpath('@w:id',namespaces=ooXMLns)[0]
        comments_dict[comment_id]=comment
    return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
    comments=[]
    for run in paragraph.runs:
        comment_reference=run._r.xpath("./w:commentReference")
        if comment_reference:
            comment_id=comment_reference[0].xpath('@w:id',namespaces=ooXMLns)[0]
            comment=comments_dict[comment_id]
            comments.append(comment)
    return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
    document = Document(docxFileName)
    comments_dict=get_document_comments(docxFileName)
    comments_with_their_reference_paragraph=[]
    for paragraph in document.paragraphs:  
        if comments_dict: 
            comments=paragraph_comments(paragraph,comments_dict)  
            if comments:
                comments_with_their_reference_paragraph.append({paragraph.text: comments})
    return comments_with_their_reference_paragraph
if __name__=="__main__":
    document="test.docx"  #filepath for the input document
    print(comments_with_reference_paragraph(document))
示例文档的输出如下所示

我是在段落层次上这样做的。这也可以在pythondocx运行级别上完成。
希望它能有所帮助。

元素的文本内容位于
text
属性中。
print(c.text)
是否产生了感兴趣的内容?
a=tree.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val')print(a.text)
会导致
AttributeError
对于儿童中的c:print(c.text)
会导致注释!你知道我将如何访问其他字段吗?回答和赞美!非常感谢。我们如何获得评论所涉及的文本?我支持上面的观点-我们如何获得评论所涉及的文本?@Pythonic and lucid_dreamer:如果你需要的帮助超出了这里的答案,请提出一个新问题。完成这项工作应该是真正的答案,因为上一个不处理引用的文本。但是这是windows-focused@lucid_dreamer:Word自动化需要Microsoft Word,在服务器部署中不可靠。如果您需要纯python解决方案中不需要Word的引用文本,请提出一个新问题。
from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
    comments_dict={}
    docxZip = zipfile.ZipFile(docxFileName)
    commentsXML = docxZip.read('word/comments.xml')
    et = etree.XML(commentsXML)
    comments = et.xpath('//w:comment',namespaces=ooXMLns)
    for c in comments:
        comment=c.xpath('string(.)',namespaces=ooXMLns)
        comment_id=c.xpath('@w:id',namespaces=ooXMLns)[0]
        comments_dict[comment_id]=comment
    return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
    comments=[]
    for run in paragraph.runs:
        comment_reference=run._r.xpath("./w:commentReference")
        if comment_reference:
            comment_id=comment_reference[0].xpath('@w:id',namespaces=ooXMLns)[0]
            comment=comments_dict[comment_id]
            comments.append(comment)
    return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
    document = Document(docxFileName)
    comments_dict=get_document_comments(docxFileName)
    comments_with_their_reference_paragraph=[]
    for paragraph in document.paragraphs:  
        if comments_dict: 
            comments=paragraph_comments(paragraph,comments_dict)  
            if comments:
                comments_with_their_reference_paragraph.append({paragraph.text: comments})
    return comments_with_their_reference_paragraph
if __name__=="__main__":
    document="test.docx"  #filepath for the input document
    print(comments_with_reference_paragraph(document))