Python 如何确保我的PDF阅读代码不会返回NaN行和重复行？_Python_Pandas_Machine Learning_Nlp

Python 如何确保我的PDF阅读代码不会返回NaN行和重复行？

python pandas machine-learning nlp

Python 如何确保我的PDF阅读代码不会返回NaN行和重复行？,python,pandas,machine-learning,nlp,Python,Pandas,Machine Learning,Nlp,我正在将CVs格式的PDF文件读入数据帧（pandas）。然而，在阅读完这些文件后，我发现最后一个CV的一行是NaN，另一行是重复的（按字母顺序）。代码中是否有这样做的东西？我似乎不明白为什么。我尝试过改变iloc索引[0]部分和fileIndex值，但没有找到解决方案。感谢所有的帮助 dataset = [] pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/" pdf_files = glob.glob("%s

我正在将CVs格式的PDF文件读入数据帧（pandas）。然而，在阅读完这些文件后，我发现最后一个CV的一行是NaN，另一行是重复的（按字母顺序）。代码中是否有这样做的东西？我似乎不明白为什么。我尝试过改变iloc索引[0]部分和fileIndex值，但没有找到解决方案。感谢所有的帮助

dataset = []

pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

dataset = pd.DataFrame(columns = ['FileName','Text'])
fileIndex = 0

for file in pdf_files:

  pdfFileObj = open(file,'rb')     #'rb' for read binary mode
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  startPage = 0
  text = ''
  cleanText = ''
  while startPage <= pdfReader.numPages-1:
    pageObj = pdfReader.getPage(startPage)
    text += pageObj.extractText()
    startPage += 1
  pdfFileObj.close()
  for myWord in text:
    if myWord != '\n':
      cleanText += myWord
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  dataset = pd.concat([output_data, newRow], ignore_index=True)

通过更改代码的以下部分进行修复：

dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:

  pdfFileObj = open(file,'rb')     #'rb' for read binary mode
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  startPage = 0
  text = ''
  cleanText = ''
  while startPage <= pdfReader.numPages-1:
    pageObj = pdfReader.getPage(startPage)
    text += pageObj.extractText()
    startPage += 1
  pdfFileObj.close()
  for myWord in text:
    if myWord != '\n':
      cleanText += myWord
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  dataset = pd.concat([dataset, newRow], ignore_index= True )

dataset=pd.DataFrame（列=['FileName'，'Text']）
对于pdf_文件中的文件：
pdfFileObj=open（文件'rb'）#'rb'用于读取二进制模式
pdfReader=PyPDF2.PdfileReader（PdfileObj）
起始页=0
文本=“”
cleanText=“”
而起始页通过更改代码的以下部分进行修复：
dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:

  pdfFileObj = open(file,'rb')     #'rb' for read binary mode
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  startPage = 0
  text = ''
  cleanText = ''
  while startPage <= pdfReader.numPages-1:
    pageObj = pdfReader.getPage(startPage)
    text += pageObj.extractText()
    startPage += 1
  pdfFileObj.close()
  for myWord in text:
    if myWord != '\n':
      cleanText += myWord
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  dataset = pd.concat([dataset, newRow], ignore_index= True )

dataset=pd.DataFrame（列=['FileName'，'Text']）
对于pdf_文件中的文件：
pdfFileObj=open（文件'rb'）#'rb'用于读取二进制模式
pdfReader=PyPDF2.PdfileReader（PdfileObj）
起始页=0
文本=“”
cleanText=“”
当开始页