用Python提取科技论文信息？_Python_Dictionary_Text_Web Scraping_Text Mining

用Python提取科技论文信息？

python dictionary text web-scraping

用Python提取科技论文信息？,python,dictionary,text,web-scraping,text-mining,Python,Dictionary,Text,Web Scraping,Text Mining,我刚刚接触Python，碰巧我需要从几篇科学论文中提取一些信息如果给定纯文本，如：简介一些长文方法学一些长文结果一些长文我怎么能像下面这样把一篇论文放到字典里呢 paper_1 = { 'Introduction': some long writings, 'Methodology': some long writings, 'Results': some long writings } 非常

我刚刚接触Python，碰巧我需要从几篇科学论文中提取一些信息

如果给定纯文本，如：

简介
一些长文

方法学
一些长文

结果
一些长文

我怎么能像下面这样把一篇论文放到字典里呢

paper_1 = {
           'Introduction': some long writings,
           'Methodology': some long writings,
           'Results': some long writings
          }

非常感谢：-）

在尝试之后，我运行了一些代码，但效果并不理想：

text = 'introduction This is the FIRST part.' \
       'Methodologies This is the SECOND part.' \
       'results This is the THIRD part.'

import re
from re import finditer

d={}
first =[]
second =[]
title_list=[]
all =[]

for match in finditer("Methodology|results|methodologies|introduction|", text, re.IGNORECASE):
    if match.group() is not '':
        title = match.group()
        location = match.span()
        first.append(location[0])
        second.append(location[1])
        title_list.append(title)

all.append(first)
all.append(second)

a=[]
for i in range(2):
    j = i+1
    section = text[all[1][i]:all[0][j]]
    a.append(section)

for i in zip(title_list, a):
    d[i[0]] = i[1]
print (d)

这将产生以下结果：

{
'introduction': ' This is the FIRST part.', 
'Methodologies': ' This is the SECOND part.'
}

但是,

i）它无法提取最后一位，这是结果部分

(ii)。在循环中，我给range（）函数输入了2，因为我知道只有3个部分（简介、方法和结果），但在一些论文中，人们会添加更多的部分，我如何能自动为range（）指定正确的值？例如，一些论文可能有以下部分：

简介
一些长文

关于某事的一般背景
一些长文

某种类型的章节标题
一些长文

方法学
一些长文

结果
一些长文

(iii)。有没有更有效的方法可以在每个循环中构建字典？所以我不需要使用第二个循环

2018年3月30日更新：

代码更新如下：

def section_detection(text):
    title_list=[]
    all =[[],[]]
    dic={}
    count = 0
    pattern = '\d\. [A-Z][a-z]*'

    for match in finditer(pattern, text, re.IGNORECASE):
        if match.group() is not '':
            all[0].append(match.span()[0])
            all[1].append(match.span()[1])
            title_list.append(match.group())
            count += 1

    for i in range(count):
        j = i+1
        try:
            dic[title_list[i]]=text[all[1][i]:all[0][j]]
        except IndexError:
            dic[title_list[i]]=text[all[1][i]:]

    return dic

import re
from re import finditer
text = '1. introduction This is the FIRST part.' \
       '2. Methodologies This is the SECOND part.' \
       '3. results This is the THIRD part.'\
       '4. somesection This SOME section'

dic = section_detection(text)
print(dic)

如果按以下方式执行：

def section_detection(text):
    title_list=[]
    all =[[],[]]
    dic={}
    count = 0
    pattern = '\d\. [A-Z][a-z]*'

    for match in finditer(pattern, text, re.IGNORECASE):
        if match.group() is not '':
            all[0].append(match.span()[0])
            all[1].append(match.span()[1])
            title_list.append(match.group())
            count += 1

    for i in range(count):
        j = i+1
        try:
            dic[title_list[i]]=text[all[1][i]:all[0][j]]
        except IndexError:
            dic[title_list[i]]=text[all[1][i]:]

    return dic

import re
from re import finditer
text = '1. introduction This is the FIRST part.' \
       '2. Methodologies This is the SECOND part.' \
       '3. results This is the THIRD part.'\
       '4. somesection This SOME section'

dic = section_detection(text)
print(dic)

给出：

{'1. introduction': ' This is the FIRST part.', '2. Methodologies': ' This is the SECOND part.', '3. results': ' This is the THIRD part.', '4. somesection': ' This SOME section'}

非常感谢你们大家！：-）

试试这个：

text = 'introduction This is the FIRST part. ' \
       'Methodologies This is the SECOND part. ' \
       'results This is the THIRD part. ' \

import re

kw = ['methodology', 'results', 'methodologies', 'introduction']

pat = re.compile(r'(%s)' % '|'.join(kw), re.IGNORECASE)

sp = [x for x  in re.split(pat, text) if x]
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}

print(dic)

但这只是你的例子，在现实世界的文档中，不要太多。您还没有指定，“简介”之前的文本是什么，以及有人在纯文本中提到“结果”是什么？

非常喜欢@Franz Forstmayr编写的正则表达式。我只是想指出一些打破它的方法

text = '''
introduction This is the FIRST part.
introductionMethodologies This is the SECOND part.
results This is the THIRD part.
'''

import re
#### Regex based on https://stackoverflow.com/a/49546458/8083313
kw = ['methodology', 'results', 'methodologies', 'introduction']
pat = re.compile(r'(%s)' % '|'.join(kw), re.IGNORECASE)

sp = [x for x  in re.split(pat, text) if x]
print sp
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}

print(dic)


# {'\n': 'introduction',
#  'Methodologies': ' This is the SECOND part.\n',
#  ' This is the FIRST part.\n': 'introduction', 
#  'results': ' This is the THIRD part.\n'}

您可以看到列表由于字符\n而移位，并且字典已损坏。因此，我建议放置一个硬切片

out = re.split(pat, text)
lead = out[0:1]; ### Keep the lead available in case needed
sp = out[1:]

print sp
dic = {k:v for k,v in zip(sp[0::2],sp[1::2])}

print(dic)

# {'introduction': '',
#  'Methodologies': ' This is the SECOND part.\n',
#  'results': ' This is the THIRD part.\n'}

听起来是个有趣的问题：开始编码。如果遇到问题，请带着代码回来，我们可能会提供帮助。所以是关于修正你的代码，而不是实现你的想法。请反复阅读，如果您有问题，请提供您的代码。如果遇到错误，请将错误消息逐字复制并粘贴到问题中。避免使用屏幕截图-我们无法复制和粘贴这些截图来修复您的代码。嗨，Patrick，非常感谢您的建议，我已经上传了代码。这叫做-查找有关这方面的问题和文章。如果您有特定的编码问题，请返回。另外，arxiv.org可能是一个很好的起点。非常有趣的是，您刚刚导入了

finditer

，并将其放在那里……感谢您的提示。我从上面的例子开始。删除它：）