Python 如何确保我的PDF阅读代码不会返回NaN行和重复行?

Python 如何确保我的PDF阅读代码不会返回NaN行和重复行?,python,pandas,machine-learning,nlp,Python,Pandas,Machine Learning,Nlp,我正在将CVs格式的PDF文件读入数据帧(pandas)。然而,在阅读完这些文件后,我发现最后一个CV的一行是NaN,另一行是重复的(按字母顺序)。代码中是否有这样做的东西?我似乎不明白为什么。我尝试过改变iloc索引[0]部分和fileIndex值,但没有找到解决方案。感谢所有的帮助 dataset = [] pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/" pdf_files = glob.glob("%s

我正在将CVs格式的PDF文件读入数据帧(pandas)。然而,在阅读完这些文件后,我发现最后一个CV的一行是NaN,另一行是重复的(按字母顺序)。代码中是否有这样做的东西?我似乎不明白为什么。我尝试过改变iloc索引[0]部分和fileIndex值,但没有找到解决方案。感谢所有的帮助

dataset = []

pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

dataset = pd.DataFrame(columns = ['FileName','Text'])
fileIndex = 0

for file in pdf_files:

  pdfFileObj = open(file,'rb')     #'rb' for read binary mode
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  startPage = 0
  text = ''
  cleanText = ''
  while startPage <= pdfReader.numPages-1:
    pageObj = pdfReader.getPage(startPage)
    text += pageObj.extractText()
    startPage += 1
  pdfFileObj.close()
  for myWord in text:
    if myWord != '\n':
      cleanText += myWord
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  dataset = pd.concat([output_data, newRow], ignore_index=True)

通过更改代码的以下部分进行修复:

dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:

  pdfFileObj = open(file,'rb')     #'rb' for read binary mode
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  startPage = 0
  text = ''
  cleanText = ''
  while startPage <= pdfReader.numPages-1:
    pageObj = pdfReader.getPage(startPage)
    text += pageObj.extractText()
    startPage += 1
  pdfFileObj.close()
  for myWord in text:
    if myWord != '\n':
      cleanText += myWord
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  dataset = pd.concat([dataset, newRow], ignore_index= True )
dataset=pd.DataFrame(列=['FileName','Text'])
对于pdf_文件中的文件:
pdfFileObj=open(文件'rb')#'rb'用于读取二进制模式
pdfReader=PyPDF2.PdfileReader(PdfileObj)
起始页=0
文本=“”
cleanText=“”

而起始页通过更改代码的以下部分进行修复:

dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:

  pdfFileObj = open(file,'rb')     #'rb' for read binary mode
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  startPage = 0
  text = ''
  cleanText = ''
  while startPage <= pdfReader.numPages-1:
    pageObj = pdfReader.getPage(startPage)
    text += pageObj.extractText()
    startPage += 1
  pdfFileObj.close()
  for myWord in text:
    if myWord != '\n':
      cleanText += myWord
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  dataset = pd.concat([dataset, newRow], ignore_index= True )
dataset=pd.DataFrame(列=['FileName','Text'])
对于pdf_文件中的文件:
pdfFileObj=open(文件'rb')#'rb'用于读取二进制模式
pdfReader=PyPDF2.PdfileReader(PdfileObj)
起始页=0
文本=“”
cleanText=“”
当开始页