python spacy向后查找块（在引用之前）_Python_Nlp_Grammar_Spacy_Chunks

python spacy向后查找块（在引用之前）

python nlp

python spacy向后查找块（在引用之前）,python,nlp,grammar,spacy,chunks,Python,Nlp,Grammar,Spacy,Chunks,我正在使用spacy进行NLP项目。使用Spacy创建文档时，您可以通过以下方式找到文本中的名词块（也称为“名词短语”）： import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(u"The companies building cars do not want to spend more money in improving diesel engines because the government will not subsidi

我正在使用spacy进行NLP项目。使用Spacy创建文档时，您可以通过以下方式找到文本中的名词块（也称为“名词短语”）：

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"The companies building cars do not want to spend more money in improving diesel engines because the government will not subsidise such engines anymore.")
for chunk in doc.noun_chunks:
    print(chunk.text)

这将给出一个名词短语列表

例如，在这种情况下，第一个名词短语是“公司”

假设您有一个文本，其中名词组块被一个数字引用

比如：

假设我有代码来识别引用，例如标记它们：

myprocessedtext=the Window <ref>(23)</ref> is closed because the wall <ref>(34)</ref> of the beautiful building <ref>(45)</ref> is not covered by the insurance <ref>(45)</ref>

myprocessedtext=窗户（23）关闭，因为美丽建筑（45）的墙壁（34）不在保险范围内（45）

我怎样才能得到紧跟在引用之前的名词块（名词短语）

我的想法是：将每个引用之前的10个单词传递给spacy doc对象，提取名词块并得到最后一个。这是非常低效的，因为创建文档对象非常耗时

不必创建额外的nlp对象，还有其他想法吗

谢谢。

您可以分析整个文档，然后通过标记位置或字符偏移量在每个引用之前找到名词块。名词块中最后一个标记的标记偏移量为

noun\u chunk[-1]。i

，最后一个标记的开头字符偏移量为

noun\u chunk[-1]。idx

。（检查分析是否不受引用字符串的影响；您的示例

（1）

样式的引用似乎被分析为同位语，这很好。）

如果分析受到引用字符串的影响，请将它们从文档中删除，同时跟踪它们的字符偏移量，分析整个文档，然后找到保存位置之前的名词块。

Thank@abb，我也在想类似的事情，但你说到点子上了。使用所有名词组块的位置和引用的位置，将使名词组块位于引用之前！！！！太好了，谢谢

myprocessedtext=the Window <ref>(23)</ref> is closed because the wall <ref>(34)</ref> of the beautiful building <ref>(45)</ref> is not covered by the insurance <ref>(45)</ref>