Regex 如何让PyPDF2从范围内的多个连续页面中提取文本?
我试图让PyPDF2按照下面的代码在整个文档中提取特定文本。它正是拉我所需要的,并消除重复,但它不是让我从每一页列表,它似乎只是显示我从最后一页的文本。我做错了什么Regex 如何让PyPDF2从范围内的多个连续页面中提取文本?,regex,python-3.x,pypdf2,Regex,Python 3.x,Pypdf2,我试图让PyPDF2按照下面的代码在整个文档中提取特定文本。它正是拉我所需要的,并消除重复,但它不是让我从每一页列表,它似乎只是显示我从最后一页的文本。我做错了什么 #import PyPDF2 and set extracted text as the page_content variable import PyPDF2 pdf_file = open('enme2.pdf','rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_
#import PyPDF2 and set extracted text as the page_content variable
import PyPDF2
pdf_file = open('enme2.pdf','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
#for loop to get number of pages and extract text from each page
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content = page.extractText()
#initialize the user_input variable
user_input = ""
#function to get the AFE numbers from the pdf document
def get_afenumbers(Y):
#initialize the afe and afelist variables
afe = "A"
afelist = ""
x = ""
#while loop to get only 6 digits after the "A"
while True:
if user_input.upper().startswith("Y") == True:
#Return a list of AFE's
import re
afe = re.findall('[A][0-9]{6}', page_content)
set(afe)
print(set(afe))
break
else:
afe = "No AFE numbers found..."
if user_input.upper().startswith("N") == True:
print("HAVE A GREAT DAY - GOODBYE!!!")
break
#Build a while loop for initial question prompt (when Y or N is not True):
while user_input != "Y" and user_input != "N":
user_input = input('List AFE numbers? Y or N: ').upper()
if user_input not in ["Y","N"]:
print('"',user_input,'"','is an invalid input')
get_afenumbers(user_input)
#FIGURE OUT HOW TO EXTRACT FROM ALL PAGES AND NOT JUST ONE
我对这一点很陌生,今天早些时候我回答了我的问题,刚刚了解了regex。谢谢您的帮助。如果您稍作更改,它似乎可以正常工作
page_content="" # define variable for using in loop.
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText() # concate reading pages.
如果你稍微改变一下,它看起来很好用
page_content="" # define variable for using in loop.
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText() # concate reading pages.
我刚刚将代码修复为您的原始
范围(页数)
来自range(5)
,我在pdf文件中用于测试。谢谢:-)我刚刚将代码修复为您的原始范围(页数)
来自range(5)
,我在pdf文件中用于测试。谢谢:-)