Python 数字前后测量单位上的空间规则匹配器

Python 数字前后测量单位上的空间规则匹配器,python,nlp,spacy,Python,Nlp,Spacy,我是spacy的新手,我正在尝试匹配一些文本中的一些测量值。我的问题是,度量单位有时在值之前,有时在值之后。在其他一些情况下,有一个不同的名称。下面是一些代码: nlp = spacy.load('en_core_web_sm') # case 1: text = "the surface is 31 sq" # case 2: # text = "the surface is sq 31" # case 3: # text = "the surface is square meters 31

我是spacy的新手,我正在尝试匹配一些文本中的一些测量值。我的问题是,度量单位有时在值之前,有时在值之后。在其他一些情况下,有一个不同的名称。下面是一些代码:

nlp = spacy.load('en_core_web_sm')

# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"

pattern = [
    {"IS_STOP": True}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"}, 
    {"LOWER": "sq", "OP": "?"},
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"IS_DIGIT": True}, 
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"LOWER": "sq", "OP": "?"} 
]

doc = nlp(text)

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)
我有两个问题: 1-模式应该能够匹配所有情况1到5,但在我的情况1中,输出是

4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq 
在我看来,这是一个重复的匹配

2-案例6不应该匹配,但与我的模式匹配。 有没有关于如何改进的建议

编辑: 是否可以在模式中构建一个或条件?差不多

pattern = [
    {"POS": "DET", "OP": "?"}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    [
      [{"LOWER": "sq", "OP": "?"},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True}]
     OR
      [{"LIKE_NUM": True},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"LOWER": "sq", "OP": "?"} ]
    ]
]

不能使用或类似的方式,但可以为同一标签定义单独的模式。因此,您需要两种模式,一种模式将数字与前面的
sq
square
meters
或这些单词的组合匹配,另一种模式将数字与后面的至少一个单词匹配

代码段:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
     "the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
      {"LIKE_NUM": True}
    ]
pattern2 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True},
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
    ]

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)

for text in texts:
  doc = nlp(text)
  matches = matcher(doc)
  for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)
输出:

4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square
{“TEXT”:{“REGEX”:“^(?i:sq(?:uare)?m(?:et(?:er(?)re)s?)$”,“OP”:“+”}
部分匹配一个或多个与REGEX匹配的标记(由于
“OP”:“+”
):

  • ^
    -令牌的开始
  • (?i:
    -不区分大小写修改器组的开始:
    • sq(?:uare)?
      -
      sq
      square
    • |
      -或
    • m(?:et(?:er | re)s?
      -
      m
      /
      /
  • -组结束
  • $
    -字符串的结尾(此处为标记)

您好,非常感谢。我接受了答案,它是有效的,但仍有一些不清楚的地方。尝试示例时,案例4和5不显示完整的
平方米
字符串,只显示
平方米
。因此,我试图从输入字符串中删除
,但正如预期的那样,它仍然匹配,有没有办法强制匹配完整的
平方米
字符串?@DarioB Add
,“OP”:“+”
模式2
正则表达式标记,与
模式1
中的标记相同。我明白了,是的,它可以工作,是否有一种方法可以删除重复匹配,或者有一种好的方法来处理它?我的新输出是
48981624354646287487表面0 6该表面约为31平方米48981624354646287487表面0 6该表面约为31平方米
thanks@DarioB因为您有相同的起始索引,我认为在这里也会有所帮助。@DarioB它正是传统的Python
re
regex。只匹配标记文本,而不是文档或句子。使用
^(:metri | quadrati | m?q)$
。关键是您不能将多个令牌与此正则表达式匹配。