Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 使用tkinter GUI将PDF文件加载到程序中_Python 3.x_Pandas_Pdf_Tkinter_Nltk - Fatal编程技术网

Python 3.x 使用tkinter GUI将PDF文件加载到程序中

Python 3.x 使用tkinter GUI将PDF文件加载到程序中,python-3.x,pandas,pdf,tkinter,nltk,Python 3.x,Pandas,Pdf,Tkinter,Nltk,我正在编写的程序能够接收PDF,找到PDF中所有不间断的单词,在表格中显示所有这些单词以及它们在PDF中出现的频率,然后在web浏览器中显示该表格。到目前为止,当正在读取的PDF与正在执行的程序位于同一文件中时,该程序能够执行此操作。我想让我的代码更加流线型,这样用户就可以决定程序将读取什么PDF,而不管PDF实际上在哪里。为了做到这一点,我尝试使用tkinter,因为我无法使用所有其他GUI,我可以获得显示所需的窗口和按钮,以及打开文件资源管理器,但是,当我“双击”希望读取的PDF文件时,我不

我正在编写的程序能够接收PDF,找到PDF中所有不间断的单词,在表格中显示所有这些单词以及它们在PDF中出现的频率,然后在web浏览器中显示该表格。到目前为止,当正在读取的PDF与正在执行的程序位于同一文件中时,该程序能够执行此操作。我想让我的代码更加流线型,这样用户就可以决定程序将读取什么PDF,而不管PDF实际上在哪里。为了做到这一点,我尝试使用tkinter,因为我无法使用所有其他GUI,我可以获得显示所需的窗口和按钮,以及打开文件资源管理器,但是,当我“双击”希望读取的PDF文件时,我不知道如何实际执行代码

import word_bag_GUI
import PyPDF2
import pandas
import webbrowser
import os
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#Method that a pdf that is read into the program goes through to eliminate any unwanted words or symbols#
def preprocess(text):
    #Filters out punctuation from paragraph witch becomes tokenized to words and punctuation#
    tokenizer = RegexpTokenizer(r'\w+')
    result = tokenizer.tokenize(text)

    #Makes all words lowercase#
    words = [item.lower() for item in result]

    #Removes all remaining tokens that are not alphabetic#
    result = [word for word in words if word.isalpha()]

    #Imports stopwords to be removed from paragraph#
    stop_words = set(stopwords.words("english"))

    #Removes the stop words from the paragraph#
    filtered_sent = []
    for w in result:
        if w not in stop_words:
            filtered_sent.append(w)

    #Return word to root word/chop-off derivational affixes#
    ps = PorterStemmer()
    stemmed_words = []
    for w in filtered_sent:
        stemmed_words.append(ps.stem(w))

    #Lemmatization, which reduces word to their base word, which is linguistically correct lemmas#
    lem = WordNetLemmatizer()
    lemmatized_words = ' '.join([lem.lemmatize(w,'n') and lem.lemmatize(w,'v') for w in filtered_sent])

    #Re-tokenize lemmatized words string#
    tokenized_word = word_tokenize(lemmatized_words)
    return tokenized_word

#Loads in PDF into program#
PDF_file = word_bag_GUI.open_PDF
read_pdf = PyPDF2.PdfFileReader(PDF_file)

#Determines number of pages in PDF file and sets the document content to 'null'#
number_of_pages = read_pdf.getNumPages()
doc_content = ""

#Extract text from the PDF file#
for i in range(number_of_pages):
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    doc_content += page_content

#Turns the text drawn from the PDF file into data the remaining code can understand#
tokenized_words = preprocess(doc_content)

#Determine frequency of words tokenized + lemmatized text#
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_words)
final_list = fdist.most_common(len(fdist))

#Organize data into two columns and export the data to an html that automatically opens#
df = pandas.DataFrame(final_list, columns = ["Word", "Frequency"])
df.to_html('word_frequency.html')
webbrowser.open('file://' + os.path.realpath('word_frequency.html'))

-------------------------------------------------------------------------

#Creats the GUI that will be used to select inputs#
window = tk.Tk()
window.geometry("300x300")
window.resizable(0,0)
window.title("Word Frequency Program")

#Browse through file directory and select PDF to be used in code#
def open_PDF():
    filedialog.askopenfile(initialdir = "/",title = "Select file",filetypes = (("PDF files","*.pdf"),("all files","*.*")))
button1 = ttk.Button(window, text = "Browse Files", command = open_PDF)
button1.grid()

#Quits out of the program when certain button clicked#
button2 = ttk.Button(window, text = "Quit Program", command = window.quit)
button2.grid()

window.mainloop()
window.destroy()
在我实现tkinter GUI之前,我希望程序以同样的方式执行,在web浏览器中打印出一个包含单词和频率的表格,但是当我选择并打开我希望程序读取的PDF时,什么都没有发生

编辑:我似乎已经让它的一些部分工作,但现在我得到了一个例外,其中指出:

TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper
编辑2:我现在得到错误:

AttributeError: 'function' object has not attribute 'seek'
我所做的唯一更改是将open_PDF()方法更改为:

def open_PDF():
    filename = filedialog.askopenfile(initialdir = "/", title = "Select file", filetypes = (("PDF files","*.pdf"), ("all files","*.*")))
    return filename

我能够将PDF正确加载到程序中并显示在浏览器中。我只需更改两个程序中的两个小代码位。对于第一位,我将open_PDF函数更改为:

def select_PDF():
    filename = filedialog.askopenfilename(initialdir = "/",title = "Select file",filetypes = (("pdf files","*.pdf"),("all files","*.*")))
    return filename
然后我将代码的“将PDF加载到程序”部分更改为:

filepath = word_bag_GUI.select_PDF()
PDF_file = open(filepath, 'rb')
read_pdf = PyPDF2.PdfFileReader(PDF_file)

所以我能够正确地执行它,但是现在出现了两个目录窗口而不是一个,所以这是我现在要解决的问题。

这是因为语句
PDF\u file=word\u bag\u GUI.open\u PDF
将函数引用分配给
PDF\u file
函数,但不是您期望的函数结果。@acw1668那么,在选择要执行的文件(如果有意义的话)之后,如何执行word\u bag\u extraction\u程序呢?我昨晚的想法是,我可以使用另一个tkinter按钮在单击时执行另一个方法,这将导致执行word_bag_提取程序。(我希望这是有意义的,我对所有的编码行话和python都是新手)@acw1668更简单地说,我如何将函数的结果分配给PDF_文件?