使用Python识别文本引用（APA、MLA、哈佛、温哥华等）_Python_Citations

使用Python识别文本引用（APA、MLA、哈佛、温哥华等）

python

使用Python识别文本引用（APA、MLA、哈佛、温哥华等）,python,citations,Python,Citations,我试图识别pdf格式的期刊文章中文本引用中包含的所有句子。我将.pdf转换为.txt，希望找到所有包含引用的句子，可能采用以下格式之一：史密斯（1990）说就……达成了一项协议。。。（史密斯，1990年）就……达成了一项协议。。。（2005年4月；史密斯，1990年）上述各项的混合物我首先将txt标记为句子： import nltk from nltk.tokenize import sent_tokenize ss = sent_tokenize(text) 这将生成类型（ss

我试图识别pdf格式的期刊文章中文本引用中包含的所有句子。我将.pdf转换为.txt，希望找到所有包含引用的句子，可能采用以下格式之一：

史密斯（1990）说

就……达成了一项协议。。。（史密斯，1990年）

就……达成了一项协议。。。（2005年4月；史密斯，1990年）

上述各项的混合物

我首先将txt标记为句子：

import nltk
from nltk.tokenize import sent_tokenize
ss = sent_tokenize(text)

这将生成类型（ss）列表，因此我将列表转换为str以使用re-findall：

def listtostring(s):
    str1 = ' '
    return (str1. join(s))
ee = listtostring(ss)

然后，我的想法是识别包含四位数的句子：

import re
for sentence in ee:
    zz = re.findall(r'\d{4}', ee)
    if zz:
        print (zz)

然而，这只提取年份，而不提取包含年份的句子

import re
l = ['This is 1234','Hello','Also 1234']

for sentence in l:
    if re.findall(r'\d{4}',sentence):
        print(sentence)

输出

This is 1234
Also 1234

使用regex，可以在尝试避免不适当的匹配（

\d{4}

可能会给您一些）的同时有不错的回忆的东西（）是

然后将给出一个python示例（使用spaCy而不是NLTK）

import re
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("One statement. Then according to (Smith, 1990) everything will be all right. Or maybe not.")

l = [sent.text for sent in doc.sents]

for sentence in l:
    if re.findall(r'\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)', sentence):
        print(sentence)

这回答了你的问题吗？

import re
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("One statement. Then according to (Smith, 1990) everything will be all right. Or maybe not.")

l = [sent.text for sent in doc.sents]

for sentence in l:
    if re.findall(r'\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)', sentence):
        print(sentence)