Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pdf2txt清理问题_Python_Pdf_Text - Fatal编程技术网

Python pdf2txt清理问题

Python pdf2txt清理问题,python,pdf,text,Python,Pdf,Text,我正在从pdf文件中提取文本,但我面临一些提取后的问题 我从哪里得到的 s = 'Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n' s.replace('\n','') Our offer is 44ProcessingPipelinePipeline2AmazonEC23 但我想得到的是: from pdfminer.pdfparser import PDFParser

我正在从pdf文件中提取文本,但我面临一些提取后的问题

我从哪里得到的

s = 'Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n'
s.replace('\n','')
Our offer is 44ProcessingPipelinePipeline2AmazonEC23
但我想得到的是:

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
import warnings
warnings.filterwarnings("ignore")

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 13.0
laparams.word_margin = 13.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''

for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            extracted_text += lt_obj.get_text()

print(extracted_text)



from nltk import tokenize
#split by sentence
newtext = tokenize.sent_tokenize(extracted_text)
我们的报价是44处理管道2亚马逊EC 2 3

我的代码:

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
import warnings
warnings.filterwarnings("ignore")

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 13.0
laparams.word_margin = 13.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''

for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            extracted_text += lt_obj.get_text()

print(extracted_text)



from nltk import tokenize
#split by sentence
newtext = tokenize.sent_tokenize(extracted_text)
然后,使用输出在服务器上运行替换\n

我的想法是找到**\n的邻居并评估**

如果

\n没有以前的邻居(空白),但有以下替换“(\n+空白”)为(空白)

\n让相邻的两边用(空白)替换“(\n)

\n有大写字母跟随的邻域且没有以前的邻域(空白)时,将“(\n+'uppercase')替换为(空白)

我想我正在深入研究这个问题,可能有人以前做过这件事


“我们的报价是IO)\n4\n4\n处理\n\nPipeline\nPipeline\n2\nA\nm\nA\nz\no\nn\nE\nC\n2\n”

我认为一个解决方案是使用正则表达式。我试图编写一个合适的模式,但我不是正则表达式或模式方面的专家。我不知道为什么它不起作用。这是我能得到的最接近的了。守则:

import re

s = "Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n"
s1 = "A\nb\nc"
w = s.replace('\n',' ')
print(w)
# Our offer is 44ProcessingPipelinePipeline2AmazonEC23


pattern = '([A-Z](\n[a-z])+[\n])|([A-Z](\n[A-Z])+[\n])'

result = re.findall(pattern, s) 

m = re.search(pattern, s)
iter = re.finditer(pattern, s)
indices = [m.start(0) for m in iter]

print(result)
print(indices)
输出:

$ python3 a.py 
Our offer is  4 4 Processing  Pipeline Pipeline 2 A m a z o n E C 2 
[('A\nm\na\nz\no\nn\n', '\nn', '', ''), ('', '', 'E\nC\n', '\nC')]
[50, 62]

祝你好运。

使用正则表达式似乎相当困难。我得到了下面的解决方案,虽然不优雅,但确实有效

s = 'Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n'

prev_c = '\0'

out = ''
for ii,cc in enumerate(s):
  cc = s[ii]
  c = s[ii+1] if ii<len(s)-1 else '\0'
  if cc=='\n':
    if prev_c==' ' or \
       prev_c=='\n' or \
       prev_c.isdigit() and c.isdigit() or \
       prev_c.islower() and c.islower() or \
       prev_c.isupper() and c.isupper() or \
       prev_c.isupper() and c.islower():
      pass
    else:
      out += ' ' 
  else:
    out += cc

  prev_c = cc

print(out)
s='我们的报价是\n4\n4\n处理\n\n管道\n管道\n2\nA\nm\nA\nz\no\nn\nE\nC\n2\n'
上一个c='\0'
out=“”
对于ii,在枚举中抄送:
cc=s[ii]

c=s[ii+1]如果iiHi,Thx对于我的用例来说是非常简单的解决方案,但正如你提到的,这可能会落入其他情况。