Python 解析PDF后清理文本文件

Python 解析PDF后清理文本文件,python,parsing,pdf,python-3.x,text,Python,Parsing,Pdf,Python 3.x,Text,我已经解析了一个PDF文件,并尽我所能将其清理干净,但我仍然无法将文本文件中的信息对齐 我的输出如下所示: Zone 1 Report Name ARREST Incident Time 01:41 Location of Occurrence 1300 block Liverpool St Neighborhood Highland Park Incident 14081898 Age 27 Gender M Section 3921(a) 3925 903 Description Theft

我已经解析了一个PDF文件,并尽我所能将其清理干净,但我仍然无法将文本文件中的信息对齐

我的输出如下所示:

Zone
1
Report Name
ARREST
Incident Time
01:41
Location of Occurrence
1300 block Liverpool St
Neighborhood
Highland Park
Incident
14081898
Age
27
Gender
M
Section
3921(a)
3925
903
Description
Theft by Unlawful Taking or Disposition - Movable item
Receiving Stolen Property.
Criminal Conspiracy.
Zone:    1
Report Name:    ARREST
Incident Time:    01:41
Location of Occurrence:    1300 block Liverpool St
Neighborhood:    Highland Park
Incident:    14081898
Age:    27
Gender:    M
Section, Description:
3921(a): Theft by Unlawful Taking or Disposition - Movable item
3925: Receiving Stolen Property.
903: Criminal Conspiracy.
我希望它看起来像这样:

Zone
1
Report Name
ARREST
Incident Time
01:41
Location of Occurrence
1300 block Liverpool St
Neighborhood
Highland Park
Incident
14081898
Age
27
Gender
M
Section
3921(a)
3925
903
Description
Theft by Unlawful Taking or Disposition - Movable item
Receiving Stolen Property.
Criminal Conspiracy.
Zone:    1
Report Name:    ARREST
Incident Time:    01:41
Location of Occurrence:    1300 block Liverpool St
Neighborhood:    Highland Park
Incident:    14081898
Age:    27
Gender:    M
Section, Description:
3921(a): Theft by Unlawful Taking or Disposition - Movable item
3925: Receiving Stolen Property.
903: Criminal Conspiracy.
我试着在列表中枚举,但问题是有些字段不存在。所以这使得它提取错误的信息

下面是解析PDF的代码

import os
import urllib2
import time
from datetime import datetime, timedelta
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

def parsePDF(infile, outfile):

    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    outtype = 'text'
    imagewriter = None
    rotation = 0
    stripcontrol = False
    layoutmode = 'normal'
    codec = 'utf-8'
    pageno = 1
    scale = 1
    caching = True
    showpageno = True
    laparams = LAParams()
    rsrcmgr = PDFResourceManager(caching=caching)

    if outfile:
        outfp = file(outfile, 'w+')
    else:
        outfp = sys.stdout

    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
    fp = file(infile, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp, pagenos,
                                      maxpages=maxpages, password=password,
                                      caching=caching, check_extractable=True):

        interpreter.process_page(page)
    fp.close()
    device.close()
    outfp.close()
    return  


# Set time zone to EST
#os.environ['TZ'] = 'America/New_York'
#time.tzset()

# make sure folder system is set up
if not os.path.exists("../pdf/"):
    os.makedirs("../pdf/")
if not os.path.exists("../txt/"):
    os.makedirs("../txt/")

# Get yesterday's name and lowercase it
yesterday = (datetime.today() - timedelta(1))
yesterday_string = yesterday.strftime("%A").lower()

# Also make a numberical representation of date for filename purposes
yesterday_short = yesterday.strftime("%Y%m%d")

# Get pdf from blotter site, save it in a file
pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read();
f = file("../pdf/" + yesterday_short + ".pdf", "w+")
f.write(pdf)
f.close()

# Convert pdf to text file
parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt")

# Save text file contents in variable
parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read()
这是我到目前为止所拥有的

import os

OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"]    


if not os.path.exists("../out/"):
    os.makedirs("../out/")  
with open("../txt/20140731.txt", 'r') as file:
    blotterList = file.readlines()

with open("../out/test2.txt", 'w') as outfile:
    cleanList = []
    for line in blotterList:
        if not any ([o in line for o in OddsnEnds]):
            cleanList.append(line)
    while '\n' in cleanList:
        cleanList.remove('\n')
    for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']:
        print ('Incident:%s' % cleanList[i])
    for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']:
            print ('Time:%s' % cleanList[i+1])  
但是枚举得到的输出是

Time:16:20

Time:17:40

Time:17:53

Time:18:05

Time:Location of Occurrence
因为那次事件没有时间。另外,旁注是所有字符串都以\n结尾


非常感谢您提供的所有想法和帮助。

我最喜欢的一种方式是从带有-layout选项的实用程序中使用pdftotext,将PDF文件刮取为文本。它在保留文档的原始布局方面非常出色


您可以使用子流程模块从Python中使用此功能。

一般来说,从PDF文件中提取文本,特别是当您希望包含文本的格式/间距/布局时,被认为是一项可能并非总是100%准确的任务。我是从一家公司的技术支持人员那里了解到这一点的,该公司生产了一个流行的库xpdf,用于从PDF中提取文本,不久前,我正在从事该领域的一个项目。当时,我探索了几个从文本中提取PDF的库,包括xpdf和其他一些库。尽管在很多情况下,它们都能给出完美的结果,但它们不能总是给出完美结果的原因有着明确的技术原因;这些原因和PDF格式的性质以及生成PDF的方式有关。从某些PDF中提取文本时,布局和间距可能不会保留,即使使用库的选项(如keep_format=True或等效选项)


这个问题的唯一永久解决方案是不需要从PDF文件中提取文本。相反,始终尝试使用生成PDF文件的数据格式和数据源,并使用该格式进行文本提取/操作。当然,如果你没有这些来源,说起来容易做起来难。

不是我投了反对票,但我也不太理解你的问题。。打印时间是问题所在吗?还是你希望pdf输出作为第二段?我当前的输出就是最好的例子。我试图遍历字符串列表,并将输出更改为第二个示例。我曾尝试使用enumerate遍历列表并更改列表的结构,然后将其输出到文本文件中,但当字段为空时会中断。请添加生成顶部示例的代码好吗?我认为最好在这里更改它我调整了问题,将Roland Smith在兄弟姐妹回答中提到的parse codePoppler包括在内,它是xpdf的一个分支,我在回答中提到过,根据Roland链接到的维基百科文章:理解问题,这就是为什么我试图开发一种算法,将输出修改成我可以使用的东西。对解析器的修改是一个没有实际意义的问题