Python 3.x 使用tkinter GUI将PDF文件加载到程序中_Python 3.x_Pandas_Pdf_Tkinter_Nltk

Python 3.x 使用tkinter GUI将PDF文件加载到程序中

python-3.x pandas pdf tkinter

Python 3.x 使用tkinter GUI将PDF文件加载到程序中,python-3.x,pandas,pdf,tkinter,nltk,Python 3.x,Pandas,Pdf,Tkinter,Nltk,我正在编写的程序能够接收PDF，找到PDF中所有不间断的单词，在表格中显示所有这些单词以及它们在PDF中出现的频率，然后在web浏览器中显示该表格。到目前为止，当正在读取的PDF与正在执行的程序位于同一文件中时，该程序能够执行此操作。我想让我的代码更加流线型，这样用户就可以决定程序将读取什么PDF，而不管PDF实际上在哪里。为了做到这一点，我尝试使用tkinter，因为我无法使用所有其他GUI，我可以获得显示所需的窗口和按钮，以及打开文件资源管理器，但是，当我“双击”希望读取的PDF文件时，我不

我正在编写的程序能够接收PDF，找到PDF中所有不间断的单词，在表格中显示所有这些单词以及它们在PDF中出现的频率，然后在web浏览器中显示该表格。到目前为止，当正在读取的PDF与正在执行的程序位于同一文件中时，该程序能够执行此操作。我想让我的代码更加流线型，这样用户就可以决定程序将读取什么PDF，而不管PDF实际上在哪里。为了做到这一点，我尝试使用tkinter，因为我无法使用所有其他GUI，我可以获得显示所需的窗口和按钮，以及打开文件资源管理器，但是，当我“双击”希望读取的PDF文件时，我不知道如何实际执行代码

import word_bag_GUI
import PyPDF2
import pandas
import webbrowser
import os
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#Method that a pdf that is read into the program goes through to eliminate any unwanted words or symbols#
def preprocess(text):
    #Filters out punctuation from paragraph witch becomes tokenized to words and punctuation#
    tokenizer = RegexpTokenizer(r'\w+')
    result = tokenizer.tokenize(text)

    #Makes all words lowercase#
    words = [item.lower() for item in result]

    #Removes all remaining tokens that are not alphabetic#
    result = [word for word in words if word.isalpha()]

    #Imports stopwords to be removed from paragraph#
    stop_words = set(stopwords.words("english"))

    #Removes the stop words from the paragraph#
    filtered_sent = []
    for w in result:
        if w not in stop_words:
            filtered_sent.append(w)

    #Return word to root word/chop-off derivational affixes#
    ps = PorterStemmer()
    stemmed_words = []
    for w in filtered_sent:
        stemmed_words.append(ps.stem(w))

    #Lemmatization, which reduces word to their base word, which is linguistically correct lemmas#
    lem = WordNetLemmatizer()
    lemmatized_words = ' '.join([lem.lemmatize(w,'n') and lem.lemmatize(w,'v') for w in filtered_sent])

    #Re-tokenize lemmatized words string#
    tokenized_word = word_tokenize(lemmatized_words)
    return tokenized_word

#Loads in PDF into program#
PDF_file = word_bag_GUI.open_PDF
read_pdf = PyPDF2.PdfFileReader(PDF_file)

#Determines number of pages in PDF file and sets the document content to 'null'#
number_of_pages = read_pdf.getNumPages()
doc_content = ""

#Extract text from the PDF file#
for i in range(number_of_pages):
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    doc_content += page_content

#Turns the text drawn from the PDF file into data the remaining code can understand#
tokenized_words = preprocess(doc_content)

#Determine frequency of words tokenized + lemmatized text#
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_words)
final_list = fdist.most_common(len(fdist))

#Organize data into two columns and export the data to an html that automatically opens#
df = pandas.DataFrame(final_list, columns = ["Word", "Frequency"])
df.to_html('word_frequency.html')
webbrowser.open('file://' + os.path.realpath('word_frequency.html'))

-------------------------------------------------------------------------

#Creats the GUI that will be used to select inputs#
window = tk.Tk()
window.geometry("300x300")
window.resizable(0,0)
window.title("Word Frequency Program")

#Browse through file directory and select PDF to be used in code#
def open_PDF():
    filedialog.askopenfile(initialdir = "/",title = "Select file",filetypes = (("PDF files","*.pdf"),("all files","*.*")))
button1 = ttk.Button(window, text = "Browse Files", command = open_PDF)
button1.grid()

#Quits out of the program when certain button clicked#
button2 = ttk.Button(window, text = "Quit Program", command = window.quit)
button2.grid()

window.mainloop()
window.destroy()

在我实现tkinter GUI之前，我希望程序以同样的方式执行，在web浏览器中打印出一个包含单词和频率的表格，但是当我选择并打开我希望程序读取的PDF时，什么都没有发生

编辑：我似乎已经让它的一些部分工作，但现在我得到了一个例外，其中指出：

TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper

编辑2：我现在得到错误：

AttributeError: 'function' object has not attribute 'seek'

我所做的唯一更改是将open_PDF（）方法更改为：

def open_PDF():
    filename = filedialog.askopenfile(initialdir = "/", title = "Select file", filetypes = (("PDF files","*.pdf"), ("all files","*.*")))
    return filename

我能够将PDF正确加载到程序中并显示在浏览器中。我只需更改两个程序中的两个小代码位。对于第一位，我将open_PDF函数更改为：

def select_PDF():
    filename = filedialog.askopenfilename(initialdir = "/",title = "Select file",filetypes = (("pdf files","*.pdf"),("all files","*.*")))
    return filename

然后我将代码的“将PDF加载到程序”部分更改为：

filepath = word_bag_GUI.select_PDF()
PDF_file = open(filepath, 'rb')
read_pdf = PyPDF2.PdfFileReader(PDF_file)

所以我能够正确地执行它，但是现在出现了两个目录窗口而不是一个，所以这是我现在要解决的问题。

这是因为语句

PDF\u file=word\u bag\u GUI.open\u PDF

将函数引用分配给

PDF\u file

函数，但不是您期望的函数结果。@acw1668那么，在选择要执行的文件（如果有意义的话）之后，如何执行word\u bag\u extraction\u程序呢？我昨晚的想法是，我可以使用另一个tkinter按钮在单击时执行另一个方法，这将导致执行word_bag_提取程序。（我希望这是有意义的，我对所有的编码行话和python都是新手）@acw1668更简单地说，我如何将函数的结果分配给PDF_文件？