Python 如何确保我的PDF阅读代码不会返回NaN行和重复行?
我正在将CVs格式的PDF文件读入数据帧(pandas)。然而,在阅读完这些文件后,我发现最后一个CV的一行是NaN,另一行是重复的(按字母顺序)。代码中是否有这样做的东西?我似乎不明白为什么。我尝试过改变iloc索引[0]部分和fileIndex值,但没有找到解决方案。感谢所有的帮助Python 如何确保我的PDF阅读代码不会返回NaN行和重复行?,python,pandas,machine-learning,nlp,Python,Pandas,Machine Learning,Nlp,我正在将CVs格式的PDF文件读入数据帧(pandas)。然而,在阅读完这些文件后,我发现最后一个CV的一行是NaN,另一行是重复的(按字母顺序)。代码中是否有这样做的东西?我似乎不明白为什么。我尝试过改变iloc索引[0]部分和fileIndex值,但没有找到解决方案。感谢所有的帮助 dataset = [] pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/" pdf_files = glob.glob("%s
dataset = []
pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
dataset = pd.DataFrame(columns = ['FileName','Text'])
fileIndex = 0
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
dataset = pd.concat([output_data, newRow], ignore_index=True)
通过更改代码的以下部分进行修复:
dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
dataset = pd.concat([dataset, newRow], ignore_index= True )
dataset=pd.DataFrame(列=['FileName','Text'])
对于pdf_文件中的文件:
pdfFileObj=open(文件'rb')#'rb'用于读取二进制模式
pdfReader=PyPDF2.PdfileReader(PdfileObj)
起始页=0
文本=“”
cleanText=“”
而起始页通过更改代码的以下部分进行修复:
dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
dataset = pd.concat([dataset, newRow], ignore_index= True )
dataset=pd.DataFrame(列=['FileName','Text'])
对于pdf_文件中的文件:
pdfFileObj=open(文件'rb')#'rb'用于读取二进制模式
pdfReader=PyPDF2.PdfileReader(PdfileObj)
起始页=0
文本=“”
cleanText=“”
当开始页