Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ruby-on-rails/62.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
从python窗口的pptx、ppt、docx、doc和msg文件中提取文本_Python_Powerpoint_Docx - Fatal编程技术网

从python窗口的pptx、ppt、docx、doc和msg文件中提取文本

从python窗口的pptx、ppt、docx、doc和msg文件中提取文本,python,powerpoint,docx,Python,Powerpoint,Docx,有没有办法从windows机器上的pptx、ppt、docx、doc和msg文件中提取文本?我有几百个这样的文件,需要一些编程方式。我更喜欢Python。但我愿意接受其他建议 我在网上搜索并看到了一些讨论,但它们适用于linux机器Word 我尝试用PythonDocx为word编写一些东西,以安装它,并编写pip安装PythonDocx。我有一个叫做example的worddoc,里面有4行文本,它们是以正确的方式抓取的,就像你在下面的输出中看到的那样。 输出(example.docx内容)

有没有办法从windows机器上的pptx、ppt、docx、doc和msg文件中提取文本?我有几百个这样的文件,需要一些编程方式。我更喜欢Python。但我愿意接受其他建议

我在网上搜索并看到了一些讨论,但它们适用于linux机器

Word 我尝试用PythonDocx为word编写一些东西,以安装它,并编写pip安装PythonDocx。我有一个叫做example的worddoc,里面有4行文本,它们是以正确的方式抓取的,就像你在下面的输出中看到的那样。

输出(example.docx内容):

将docx的所有文本合并到一个文件夹中 如上所述,但有选择 在这段代码中,要求您从文件夹中显示的docx文件列表中选择要加入的文件

import os
from docx import Document

files = [f for f in os.listdir() if ".docx" in f]

for n,f in enumerate(files):
    print(n+1,f)
print()
print("Write the numbers of files you need separated by space")
inp = input("Which files do you want to join?")

desired = (inp.split())
desired = map(lambda x: int(x), desired)
list_to_join = []
for n in desired:
    list_to_join.append(files[n-1])


text_collector = []
whole_text = ''
for f in list_to_join:
    doc = Document(f)
    for par in doc.paragraphs:
        text_collector.append(par.text)

for text in text_collector:
    whole_text += text + "\n"

print(whole_text)

我认为Visual Basic for Applications更易于实现,因为它可以让您访问MS的文档对象模型?或者
pythondocx
:?
Titolo
Paragrafo 1 a titolo di esempio
This is an example of text
This is the final part, just 4 rows
import os
from docx import Document

files = [f for f in os.listdir() if ".docx" in f]
text_collector = []
whole_text = ''
for f in files:
    doc = Document(f)
    for par in doc.paragraphs:
        text_collector.append(par.text)

for text in text_collector:
    whole_text += text + "\n"

print(whole_text)
import os
from docx import Document

files = [f for f in os.listdir() if ".docx" in f]

for n,f in enumerate(files):
    print(n+1,f)
print()
print("Write the numbers of files you need separated by space")
inp = input("Which files do you want to join?")

desired = (inp.split())
desired = map(lambda x: int(x), desired)
list_to_join = []
for n in desired:
    list_to_join.append(files[n-1])


text_collector = []
whole_text = ''
for f in list_to_join:
    doc = Document(f)
    for par in doc.paragraphs:
        text_collector.append(par.text)

for text in text_collector:
    whole_text += text + "\n"

print(whole_text)