Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在python中提取pdf的特定部分(如摘要、简介)?_Python_Regex_Text Mining_Pdfminer - Fatal编程技术网

在python中提取pdf的特定部分(如摘要、简介)?

在python中提取pdf的特定部分(如摘要、简介)?,python,regex,text-mining,pdfminer,Python,Regex,Text Mining,Pdfminer,我正在尝试提取PDF文件的特定部分。我正在阅读pdf文件,如下代码所示。此代码一次提取所有信息,包括所有垃圾值。我只想集中讨论一下摘要、引言、方法和结论。我也尝试过使用正则表达式。请让我知道是否有任何方法提取上述信息 def pdfparser(pdffile): with open(pdffile, mode='rb') as f: #fp = open(data, 'rb') rsrcmgr = PDFResourceManager() ret

我正在尝试提取PDF文件的特定部分。我正在阅读pdf文件,如下代码所示。此代码一次提取所有信息,包括所有垃圾值。我只想集中讨论一下摘要、引言、方法和结论。我也尝试过使用正则表达式。请让我知道是否有任何方法提取上述信息

def pdfparser(pdffile):
    with open(pdffile, mode='rb') as f:
    #fp = open(data, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        data =[]
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
        for page in PDFPage.get_pages(f):
            interpreter.process_page(page)
            data = retstr.getvalue()
            #print(data)

        # Cleaning the data
        data = data.lower()
        data = re.sub('\[*?\]', '', data)
        data = re.sub('[%s]' % re.escape(string.punctuation), '', data)
        data = re.sub('\w*\d\w*', '', data)
        data = data.replace("\n", "")

        print(data)

        return data

正则表达式代码:

paragraph = "Abstract"
def abstractExtraction(text,paragraph):

    count = 0
    para=""
    text=text.replace('\n\n+', '\n')
    text=text.replace('\s\s\s+', '\n')
    for i in re.split(r'\n+', text):
        p = re.compile('(?<!\S)'+paragraph, re.IGNORECASE)
        p1 = re.compile('abstract')
        if(str(p1.match(i)))=='None':
            if str(p.match(i))!='None':
                count=1
            if count == 1:
                if str(re.compile('\d' + '.*' + '\s*' + 'Introduction', re.IGNORECASE).match(i))!='None':
                    return para
                elif str(re.compile('X|IV|V?I{0,3}' + '.*' + '\s*' + 'Introduction', re.IGNORECASE).match(i))!='None':
                    return para
                else:
                    para =para+i
                    continue
    if(len(para)>1000):
        return 'None'
    else:
        return para
段落=“摘要”
def abstractExtraction(文本,段落):
计数=0
para=“”
text=text.replace('\n\n+','\n')
text=text.replace('\s\s\s+','\n')
对于重新拆分中的i(r'\n+',文本):
p=重新编译('(?)?