Python—将PDF数据解析为表格式_Python_Parsing_Pdf_Web Scraping

Python—将PDF数据解析为表格式

python parsing pdf web-scraping

Python—将PDF数据解析为表格式,python,parsing,pdf,web-scraping,Python,Parsing,Pdf,Web Scraping,我正在尝试复制PDF中表格中的数据：我当前的代码只拉第一个表的第二页，即文档中的第11页（标记为第2页）。以下是我正在使用的代码： import io, re import PyPDF2 import requests url = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf' r = requests.get(url) f = io.BytesIO(r.c

我正在尝试复制PDF中表格中的数据：

我当前的代码只拉第一个表的第二页，即文档中的第11页（标记为第2页）。以下是我正在使用的代码：

import io, re
import PyPDF2
import requests

url = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(10).extractText()

data = re.sub( r"([A-Z])", r" \1", contents).split()

csv = open('AWStest.csv', 'w')
csv.write(contents)
csv.close()

目前，我能够以粗略的CSV格式提取数据，但还无法找出如何解析数据，以允许我存储数据以匹配从中提取数据的表。这是它当前的外观，所有间隔为CSV格式的换行符：

科勒伦资料元素姓名日期修改后的FormatLength描述元素服从指南评论条件（分母）推荐门槛成员资格数据内容指导 12/5/2013 3ME003保险类型代码/产品 2013年4月1日查阅桌子

正文 2类型 /产品识别代码报告这个编码定义这个类型保险费在下面哪一个这成员的资格我保养得很好。例子： HM=HMO 代码描述 9Self 支付 11其他不联邦的程序 *（使用这个价值要求披露 toDataManager 先前的（待提交） 12首选供应商组织机构（PPO） *13点服务质量（POS） *14独家供应商组织机构（欧洲专利局） *15赔偿保险 16健康维修组织机构（HMO）医疗保险风险（使用报告医疗保险 PartC/医疗保险优势（计划） 17牙科维修组织机构（DMO） *96哈士奇健康 A97Husky 健康 B98Husky 健康 C99Husky 健康达姆汽车医学的 *切尚普斯（现在（三色） *残疾 *卫生保健维修组织机构 *责任医学的马梅迪卡雷 PartA（医疗保险）费服务（仅限）医疗保险 B部分*（医疗保险）费服务（仅限） MCMedicaid *医疗保险 PartDOFOther 联邦的节目（使用这个价值要求披露 toDataManager 先前的（待提交）电视标题 VVAVeterans 事务计划 *WCW工人协会补偿 *相互定义 *（使用这个价值要求披露 toDataManager 先前的（待提交）全部96.0%

此示例数据表示标题行和第一行数据。我已经能够根据大写字母来分解单词，但不幸的是，它也将完全大写的单词分解为单个字母。我使用了以下代码：

fcsv = open('AWStest.csv', 'r')

for line in fcsv.readlines():
    line = line.strip()
    line.split('[a-zA-Z][^A-Z]*')
    print(re.findall('[A-Z][^A-Z]*', line))

我需要帮助找出以一种允许我将其加载到NoSQL数据库并查询各行的需求以生成报告的格式复制这个完整表的最佳方法。要做到这一点，向代码中添加内容的最佳方法是什么？有没有更好的方法来以更准确的格式废弃PDF？

听起来文本在页面上的位置会对您有很大帮助。我建议使用来提取包含位置数据的文本，以便可以找到一行

下面是一个代码示例，用于获取带有位置的*.csv文本文件。这可以让您开始使用Python挖掘信息

#!python3.3
""" Use PyMuPDF to extract text to *.csv file. """
import csv
import json
import os
import sys

import fitz

assert len(sys.argv) == 2, 'Pass file name as parameter'

srcfilename = sys.argv[1]
assert os.path.isfile(srcfilename), 'File {} does not exist'.format(srcfilename)

dstfilename = '{}.csv'.format(srcfilename)
with open(dstfilename, 'w', encoding='utf-8', errors='ignore', newline='') as dstfile:
    writer = csv.writer(dstfile)
    writer.writerow([
        'PAGE',
        'X1',
        'Y1',
        'X2',
        'Y2',
        'TEXT',
    ])
    document = fitz.open(srcfilename)
    for page_number in range(document.pageCount):
        text_dict = json.loads(document.getPageText(page_number, output='json'))
        for block in text_dict['blocks']:
            if block['type'] != 'text':
                continue
            for line in block['lines']:
                for span in line['spans']:
                    writer.writerow([
                        page_number,
                        span['bbox'][0],
                        span['bbox'][1],
                        span['bbox'][2],
                        span['bbox'][3],
                        span['text'],
                    ])
    document.close()

以下是我编写的一些代码，用于挖掘您的PDF文件，并将其放入格式更好的*.csv文件中：

#!python3.3
import collections
import csv
import json
import os

import fitz  # PyMuPDF package


class MemberEligibility(object):

    """ Row in Member Eligibility Data Contents Guide table. """

    def __init__(self):
        """
        Initialize object. I've made all fields strings but you may want some to
        be dates or integers.
        """
        self.col = ''
        self.element = ''
        self.data_element_name = ''
        self.date_modified = ''
        self.fmt = ''
        self.length = ''
        self.description = ''
        self.comments = ''
        self.condition = ''
        self.recommended_threshold = ''


def get_sorted_list(document, page_number):
    """
    Get text on specified page of document in sorted list. Each list item is a
    (top-left y-coordinate, top-left x-coordinate, text) tuple. List sorted
    top-to-bottom and then left-to-right. Coordinates converted to integers so
    text with slightly different y-coordinates line up.
    """
    text_dict = json.loads(document.getPageText(page_number, output='json'))
    text_list = []
    for block in text_dict['blocks']:
        if block['type'] == 'text':
            for line in block['lines']:
                for span in line['spans']:
                    text_list.append((
                        int(span['bbox'][1]),  # Top-left y-coordinate
                        int(span['bbox'][0]),  # Top-left x-coordinate
                        span['text'],          # Text itself
                    ))
    text_list.sort()
    return text_list


def main():
    # Downloaded PDF to same folder as this script
    script_dir = os.path.dirname(os.path.abspath(__file__))
    pdf_filename = os.path.join(
        script_dir,
        'CT_DSG_-12132014_version_1.2_(with_clarifications).pdf'
    )

    # Mine PDF for data
    document = fitz.open(pdf_filename)
    # Using OrderedDict so iteration will occur in same order as rows appear in
    # PDF
    member_eligibility_dict = collections.OrderedDict()
    for page_number in range(document.pageCount):
        # Page numbers are zero-based. I'm only looking at p. 11 of PDF here.
        if 10 <= page_number <= 10:
            text_list = get_sorted_list(document, page_number)
            for y, x, text in text_list:
                if 115 < y < 575:
                    # Only look at text whose y-coordinates are within the data
                    # portion of the table
                    if 25 < x < 72:
                        # Assuming one row of text per cell in this column but
                        # this doesn't appear to hold on p. 10 of PDF so may
                        # need to be modified if you're going to do whole table
                        row = MemberEligibility()
                        row.col = text
                        member_eligibility_dict[row.col] = row
                    elif 72 < x < 118:
                        row.element += text
                    elif 118 < x < 175:
                        row.data_element_name += text
                    elif 175 < x < 221:
                        row.date_modified += text
                    elif 221 < x < 268:
                        row.fmt += text
                    elif 268 < x < 315:
                        row.length += text
                    elif 315 < x < 390:
                        row.description += text
                    elif 390 < x < 633:
                        row.comments += text
                    elif 633 < x < 709:
                        row.condition += text
                    elif 709 < x < 765:
                        row.recommended_threshold += text
    document.close()

    # Write data to *.csv
    csv_filename = os.path.join(script_dir, 'EligibilityDataContentsGuide.csv')
    with open(csv_filename, 'w', encoding='utf-8', errors='ignore', newline='') as f:
        writer = csv.writer(f)
        writer.writerow([
            'Col',
            'Element',
            'Data Element Name',
            'Date Modified',
            'Format',
            'Length',
            'Description',
            'Element Submission Guideline Comments',
            'Condition (Denominator)',
            'Recommended Threshold'
        ])
        for row in member_eligibility_dict.values():
            writer.writerow([
                row.col,
                row.element,
                row.data_element_name,
                row.date_modified,
                row.fmt,
                row.length,
                row.description,
                row.comments,
                row.condition,
                row.recommended_threshold
            ])


if __name__ == '__main__':
    main()

#！蟒蛇3.3
导入集合
导入csv
导入json
导入操作系统
导入fitz#PyMuPDF包
类成员资格（对象）：
“”“成员资格数据内容指南表中的行。”“”
定义初始化（自）：
"""
初始化对象。我已经创建了所有字段字符串，但您可能需要一些字符串
可以是日期或整数。
"""
self.col=''
self.element=''
self.data\u元素\u名称=“”
self.date_modified=“”
self.fmt=“”
self.length=“”
self.description=“”
self.comments=“”
self.condition=“”
自我推荐的_阈值=“”
def获取排序列表（文档，页码）：
"""
获取排序列表中文档指定页上的文本。每个列表项都是一个
（左上y坐标、左上x坐标、文本）元组。列表排序
从上到下，然后从左到右。坐标转换为整数，所以
y坐标稍有不同的文本对齐。
"""
text\u dict=json.load（document.getPageText（页码，output='json'））
text_list=[]
对于文本中的块_dict['blocks']：
如果块['type']=='text'：
对于块[“行”]中的行：
对于直线['spans']中的跨距：
text\u list.append((
int（跨度['bbox'][1]），#左上y坐标
int（span['bbox'][0]），#左上角x坐标
span['text']，#文本本身
))
text_list.sort（）
返回文本列表
def main（）：
#已将PDF下载到与此脚本相同的文件夹
script_dir=os.path.dirname（os.path.abspath（_文件__））
pdf_filename=os.path.join(
脚本_dir，
'CT_DSG_-12132014_版本_1.2_（带澄清）。pdf'
)
#挖掘PDF以获取数据
document=fitz.open（pdf文件名）
#使用OrderedDict，这样迭代将按照与中的行相同的顺序进行
#PDF
成员资格dict=collections.OrderedDict（）
对于范围内的页码（document.pageCount）：
#页码是以零为基础的。我只是在看p。这里是PDF格式的第11页。
如果我是Python新手，并且被要求尝试创建一个没有实际语言背景的解决方案，那么如果每个单元格中的文本量可能不同，那么有这样的位置会有什么好处？我不知道如何动态地提取数据并将其加载到具有类似结构的新数据库中。谢谢你的帮助！文本的x坐标将告诉您文本属于哪列。y坐标和一些逻辑将告诉您文本属于哪一行。你设想的数据库的结构是什么？我更新了我的答案来演示如何挖掘PDF。看起来一列中的文本有时与另一列中的文本属于同一文本块。您可能需要在我的示例中添加一些代码，才能得到您想要的。