Python 空间规则匹配器从匹配的句子中提取值_Python_Nlp_Spacy

Python 空间规则匹配器从匹配的句子中提取值

python nlp

Python 空间规则匹配器从匹配的句子中提取值,python,nlp,spacy,Python,Nlp,Spacy,我在spacy中有一个自定义规则匹配，并且我能够匹配文档中的一些句子。我现在想从匹配的句子中提取一些数字。然而，匹配的句子并不总是具有相同的形状和形式。最好的方法是什么 # case 1: texts = ["the surface is 31 sq", "the surface is sq 31" ,"the surface is square meters 31" ,"the surface is 31 square meters" ,"the surface is about 31,2 s

我在spacy中有一个自定义规则匹配，并且我能够匹配文档中的一些句子。我现在想从匹配的句子中提取一些数字。然而，匹配的句子并不总是具有相同的形状和形式。最好的方法是什么

# case 1:
texts = ["the surface is 31 sq",
"the surface is sq 31"
,"the surface is square meters 31"
,"the surface is 31 square meters"
,"the surface is about 31,2 square"
,"the surface is 31 kilograms"]

pattern = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
]

pattern_1 = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$", "OP": "+"}}
]

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern, pattern_1)

for index, text in enumerate(texts):
    print(f"Case {index}")
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)

我的输出将是

Case 0
4898162435462687487 Surface 1 5 surface is 31 sq
Case 1
4898162435462687487 Surface 1 5 surface is sq 31
Case 2
4898162435462687487 Surface 1 6 surface is square meters 31
Case 3
4898162435462687487 Surface 1 5 surface is 31 square
Case 4
4898162435462687487 Surface 1 6 surface is about 31,2 square
Case 5

我只想返回数字（平方米）。类似于[31,31,31,31,31.2]的内容，而不是全文。在spacy中，正确的方法是什么

由于每个匹配都包含一个单独出现的

LIKE_NUM

实体，您可以只解析匹配子树并返回此类标记的第一个出现：

value = [token for token in span.subtree if token.like_num][0]

测试：

仅供参考：在这种情况下，您可以始终依赖正则表达式。甚至简单的方法也可以，比如

re.search（r'[-+]？\d+（？：\.\d+”，span.text）.group（）

你如何从匹配中取出令牌？@DarioB不确定你的意思如果感兴趣的话，另一个挑战：）-

results = []
for text in texts:
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # The matched span
        results.append([token for token in span.subtree if token.like_num][0])

print(results) # => [31, 31, 31, 31, 31,2]