Python 3.x 将PDF转换为CSV-意外参数_Python 3.x_Csv_Pdf

Python 3.x 将PDF转换为CSV-意外参数

python-3.x csv pdf

Python 3.x 将PDF转换为CSV-意外参数,python-3.x,csv,pdf,Python 3.x,Csv,Pdf,我有一个pdf表格，我想转换成csv文件。我使用的项目已经包含了一些我想要的CSV。但不幸的是，它没有2017年的可用数据。每年在上使用提供的脚本时，我都会遇到以下错误： File "convert-pdfs.py", line 54, in <module> main() File "convert-pdfs.py", line 50, in main df = parse_pdf(pdf_path, year) File "convert-pdfs.py"

我有一个pdf表格，我想转换成csv文件。我使用的项目已经包含了一些我想要的CSV。但不幸的是，它没有2017年的可用数据。每年在上使用提供的脚本时，我都会遇到以下错误：

File "convert-pdfs.py", line 54, in <module>
    main()
  File "convert-pdfs.py", line 50, in main
    df = parse_pdf(pdf_path, year)
  File "convert-pdfs.py", line 41, in parse_pdf
    df = pd.concat([ parse_page(page, year) for page in pdf.pages ])
  File "convert-pdfs.py", line 41, in <listcomp>
    df = pd.concat([ parse_page(page, year) for page in pdf.pages ])
  File "convert-pdfs.py", line 29, in parse_page
    table = page.extract_table(v=v, h="gutters", **t)
TypeError: extract_table() got an unexpected keyword argument 'y_tolerance'

谢谢你的帮助

检查

页面。提取表格参数。我猜y\u-tolerance
不是一个有效的参数。它似乎没有接受任何参数TypeError:extract\u-table（）得到一个意外的关键字参数“x\u-tolerance”
TypeError:extract\u-table（）得到一个意外的关键字参数“h”
TypeError:extract\u-table（）获取了一个意外的关键字参数“v”

您是否尝试像这样调用该方法：

page.extract\u table（table\u settings={}）

使用此处定义的设置：是的，似乎代码过时了，因为pdfplumber的

table extraction是为v0.5.0彻底重新设计的，并引入了突破性的变化。

必须稍微调整一下。

#!/usr/bin/env python
import pdfplumber
import pandas as pd
import re
import sys, os

COLUMNS = [
    "club", "last_name", "first_name",
    "position", "base_salary", "guaranteed_compensation"
]

V_SEPARATORS = {
    "narrow": [ 0, 90, 185, 280, 330, 425, 540 ],
    "wide": [ 0, 110, 203, 300, 340, 425, 540 ]
}

NON_MONEY_CHAR_PAT = re.compile(r"[^\d\.]")

def parse_money(money_str):
    stripped = re.sub(NON_MONEY_CHAR_PAT, "", money_str)
    if len(stripped):
        return float(stripped)
    else:
        return None

def parse_page(page, year):
    t = dict(x_tolerance=5, y_tolerance=5)
    v = V_SEPARATORS["narrow" if year == 2007 else "wide"]
    table = page.extract_table(v=v, h="gutters", **t)
    df = pd.DataFrame(table)
    header_i = df[df[0] == "Club"].index[0]
    footer_i = df[df.fillna("").apply(lambda x: "Source" in "".join(x), axis=1)].index[0]
    main = df.loc[header_i + 1:footer_i-1].copy()
    main.columns = COLUMNS
    main["base_salary"] = main["base_salary"].apply(parse_money)
    main["guaranteed_compensation"] = main["guaranteed_compensation"].apply(parse_money)
    return main

def parse_pdf(path, year):
    with pdfplumber.open(path) as pdf:
        df = pd.concat([ parse_page(page, year) for page in pdf.pages ])
    return df

def main():
    HERE = os.path.dirname(os.path.abspath(__file__))
    for year in range(2007, 2018):
        print(year)
        pdf_path = os.path.join(HERE, "../pdfs/mls-salaries-{0}.pdf".format(year))
        csv_path = os.path.join(HERE, "../csvs/mls-salaries-{0}.csv".format(year))
        df = parse_pdf(pdf_path, year)
        df.to_csv(csv_path, index=False)

if __name__ == "__main__":
    main()