Python pdf2txt清理问题
我正在从pdf文件中提取文本,但我面临一些提取后的问题 我从哪里得到的Python pdf2txt清理问题,python,pdf,text,Python,Pdf,Text,我正在从pdf文件中提取文本,但我面临一些提取后的问题 我从哪里得到的 s = 'Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n' s.replace('\n','') Our offer is 44ProcessingPipelinePipeline2AmazonEC23 但我想得到的是: from pdfminer.pdfparser import PDFParser
s = 'Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n'
s.replace('\n','')
Our offer is 44ProcessingPipelinePipeline2AmazonEC23
但我想得到的是:
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
import warnings
warnings.filterwarnings("ignore")
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 13.0
laparams.word_margin = 13.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += lt_obj.get_text()
print(extracted_text)
from nltk import tokenize
#split by sentence
newtext = tokenize.sent_tokenize(extracted_text)
我们的报价是44处理管道2亚马逊EC 2 3
我的代码:
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
import warnings
warnings.filterwarnings("ignore")
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 13.0
laparams.word_margin = 13.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += lt_obj.get_text()
print(extracted_text)
from nltk import tokenize
#split by sentence
newtext = tokenize.sent_tokenize(extracted_text)
然后,使用输出在服务器上运行替换\n
我的想法是找到**\n的邻居并评估**
如果
当\n没有以前的邻居(空白),但有以下替换“(\n+空白”)为(空白)
当\n让相邻的两边用(空白)替换“(\n)
当\n有大写字母跟随的邻域且没有以前的邻域(空白)时,将“(\n+'uppercase')替换为(空白)
我想我正在深入研究这个问题,可能有人以前做过这件事
“我们的报价是IO)\n4\n4\n处理\n\nPipeline\nPipeline\n2\nA\nm\nA\nz\no\nn\nE\nC\n2\n”我认为一个解决方案是使用正则表达式。我试图编写一个合适的模式,但我不是正则表达式或模式方面的专家。我不知道为什么它不起作用。这是我能得到的最接近的了。守则:
import re
s = "Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n"
s1 = "A\nb\nc"
w = s.replace('\n',' ')
print(w)
# Our offer is 44ProcessingPipelinePipeline2AmazonEC23
pattern = '([A-Z](\n[a-z])+[\n])|([A-Z](\n[A-Z])+[\n])'
result = re.findall(pattern, s)
m = re.search(pattern, s)
iter = re.finditer(pattern, s)
indices = [m.start(0) for m in iter]
print(result)
print(indices)
输出:
$ python3 a.py
Our offer is 4 4 Processing Pipeline Pipeline 2 A m a z o n E C 2
[('A\nm\na\nz\no\nn\n', '\nn', '', ''), ('', '', 'E\nC\n', '\nC')]
[50, 62]
祝你好运。使用正则表达式似乎相当困难。我得到了下面的解决方案,虽然不优雅,但确实有效
s = 'Our offer is \n4\n4\nProcessing\n\nPipeline\nPipeline\n2\nA\nm\na\nz\no\nn\nE\nC\n2\n'
prev_c = '\0'
out = ''
for ii,cc in enumerate(s):
cc = s[ii]
c = s[ii+1] if ii<len(s)-1 else '\0'
if cc=='\n':
if prev_c==' ' or \
prev_c=='\n' or \
prev_c.isdigit() and c.isdigit() or \
prev_c.islower() and c.islower() or \
prev_c.isupper() and c.isupper() or \
prev_c.isupper() and c.islower():
pass
else:
out += ' '
else:
out += cc
prev_c = cc
print(out)
s='我们的报价是\n4\n4\n处理\n\n管道\n管道\n2\nA\nm\nA\nz\no\nn\nE\nC\n2\n'
上一个c='\0'
out=“”
对于ii,在枚举中抄送:
cc=s[ii]
c=s[ii+1]如果iiHi,Thx对于我的用例来说是非常简单的解决方案,但正如你提到的,这可能会落入其他情况。