在python中提取pdf的特定部分（如摘要、简介）？_Python_Regex_Text Mining_Pdfminer

在python中提取pdf的特定部分（如摘要、简介）？

python regex

在python中提取pdf的特定部分（如摘要、简介）？,python,regex,text-mining,pdfminer,Python,Regex,Text Mining,Pdfminer,我正在尝试提取PDF文件的特定部分。我正在阅读pdf文件，如下代码所示。此代码一次提取所有信息，包括所有垃圾值。我只想集中讨论一下摘要、引言、方法和结论。我也尝试过使用正则表达式。请让我知道是否有任何方法提取上述信息 def pdfparser(pdffile): with open(pdffile, mode='rb') as f: #fp = open(data, 'rb') rsrcmgr = PDFResourceManager() ret

我正在尝试提取PDF文件的特定部分。我正在阅读pdf文件，如下代码所示。此代码一次提取所有信息，包括所有垃圾值。我只想集中讨论一下摘要、引言、方法和结论。我也尝试过使用正则表达式。请让我知道是否有任何方法提取上述信息

def pdfparser(pdffile):
    with open(pdffile, mode='rb') as f:
    #fp = open(data, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        data =[]
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
        for page in PDFPage.get_pages(f):
            interpreter.process_page(page)
            data = retstr.getvalue()
            #print(data)

        # Cleaning the data
        data = data.lower()
        data = re.sub('\[*?\]', '', data)
        data = re.sub('[%s]' % re.escape(string.punctuation), '', data)
        data = re.sub('\w*\d\w*', '', data)
        data = data.replace("\n", "")

        print(data)

        return data

正则表达式代码：

paragraph = "Abstract"
def abstractExtraction(text,paragraph):

    count = 0
    para=""
    text=text.replace('\n\n+', '\n')
    text=text.replace('\s\s\s+', '\n')
    for i in re.split(r'\n+', text):
        p = re.compile('(?<!\S)'+paragraph, re.IGNORECASE)
        p1 = re.compile('abstract')
        if(str(p1.match(i)))=='None':
            if str(p.match(i))!='None':
                count=1
            if count == 1:
                if str(re.compile('\d' + '.*' + '\s*' + 'Introduction', re.IGNORECASE).match(i))!='None':
                    return para
                elif str(re.compile('X|IV|V?I{0,3}' + '.*' + '\s*' + 'Introduction', re.IGNORECASE).match(i))!='None':
                    return para
                else:
                    para =para+i
                    continue
    if(len(para)>1000):
        return 'None'
    else:
        return para

段落=“摘要”
def abstractExtraction（文本，段落）：
计数=0
para=“”
text=text.replace（'\n\n+'，'\n'）
text=text.replace（'\s\s\s+'，'\n'）
对于重新拆分中的i（r'\n+'，文本）：
p=重新编译（'（？）？