Python 使用节和CoreNLPClient提取名词短语_Python_Nlp_Stanford Nlp_Stanford Stanza

Python 使用节和CoreNLPClient提取名词短语

python nlp stanford-nlp

Python 使用节和CoreNLPClient提取名词短语,python,nlp,stanford-nlp,stanford-stanza,Python,Nlp,Stanford Nlp,Stanford Stanza,我正在尝试使用Stanza（斯坦福CoreNLP）从句子中提取名词短语。这只能通过第节中的CoreNLPClient模块完成 # Import client module from stanza.server import CoreNLPClient # Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001 client = CoreNLPCli

我正在尝试使用Stanza（斯坦福CoreNLP）从句子中提取名词短语。这只能通过第节中的CoreNLPClient模块完成

# Import client module
from stanza.server import CoreNLPClient
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse'], memory='4G', endpoint='http://localhost:9001')

下面是一个句子示例，我正在使用客户端中的

tregrex

函数获取所有名词短语

Tregex

函数返回python中dict的

dict

。因此，我需要先处理

tregrex

的输出，然后再将其传递给NLTK中的

树.fromstring

函数，以正确地将名词短语提取为字符串

pattern = 'NP'
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
matches = client.tregrex(text, pattern) ``

因此，我提出了方法

stanza_短语

，它必须循环通过dicts的

dict

，这是

tregrex

的输出，并正确格式化NLTK中的

Tree.fromstring

def stanza_phrases(matches):
  Nps = []
  for match in matches:
    for items in matches['sentences']:
      for keys,values in items.items():
        s = '(ROOT\n'+ values['match']+')'
        Nps.extend(extract_phrase(s, pattern))
  return set(Nps)

生成NLTK使用的树

from nltk.tree import Tree
def extract_phrase(tree_str, label):
    phrases = []
    trees = Tree.fromstring(tree_str)
    for tree in trees:
        for subtree in tree.subtrees():
            if subtree.label() == label:
                t = subtree
                t = ' '.join(t.leaves())
                phrases.append(t)

    return phrases

以下是我的输出：

{'Albert Einstein', 'He', 'a German-born theoretical physicist', 'relativity',  'the theory', 'the theory of relativity'}

是否有一种方法可以使代码更高效，行数更少（特别是

节短语

和

提取短语

方法）

这应该适用于Stanford CoreNLP 4.0.0和节1.0.1

from stanza.server import CoreNLPClient

# get noun phrases with tregex
def noun_phrases(_client, _text, _annotators=None):
    pattern = 'NP'
    matches = _client.tregex(_text,pattern,annotators=_annotators)
    print("\n".join(["\t"+sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence]))

# English example
with CoreNLPClient(timeout=30000, memory='16G') as client:
    englishText = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
    print('---')
    print(englishText)
    noun_phrases(client,englishText,_annotators="tokenize,ssplit,pos,lemma,parse")

# French example
with CoreNLPClient(properties='french', timeout=30000, memory='16G') as client:
    frenchText = "Je suis John."
    print('---')
    print(frenchText)
    noun_phrases(client,frenchText,_annotators="tokenize,ssplit,mwt,pos,lemma,parse")