Stanford nlp 对于Tokensregex，规则是否需要为token类型才能使用Annotate？_Stanford Nlp

Stanford nlp 对于Tokensregex，规则是否需要为token类型才能使用Annotate？

stanford-nlp

Stanford nlp 对于Tokensregex，规则是否需要为token类型才能使用Annotate？,stanford-nlp,Stanford Nlp,我正在查看一些旧代码令牌REGEX代码，我面临的情况是，一些字符没有被PTBTokenizer标记。我特别关注的是货币符号。因此，例如，₱将不是代币，而其他一些则是$would 我想尝试编写一个文本类型规则，而不是令牌类型，以尝试在捕获组中捕获这个符号，然后执行类似于Annotate（$0，ner，“MONEY”）的操作来捕获一个字符串，例如₱240 当我尝试这样做时，我得到：。。。49更多原因：java.lang.ClassCastException：无法强制转换edu.stanford.

我正在查看一些旧代码令牌REGEX代码，我面临的情况是，一些字符没有被PTBTokenizer标记。我特别关注的是货币符号。因此，例如，₱将不是代币，而其他一些则是$would

我想尝试编写一个文本类型规则，而不是令牌类型，以尝试在捕获组中捕获这个符号，然后执行类似于

Annotate（$0，ner，“MONEY”）

的操作来捕获一个字符串，例如₱240

当我尝试这样做时，我得到：

。。。49更多原因：java.lang.ClassCastException：无法强制转换edu.stanford.nlp.ling.tokensregex.TokenSequencePattern 到java.lang.String edu.stanford.nlp.ling.tokensregex.SequenceMatchRules$TextPatternExtractRuleCreator.create（SequenceMatchRules.java:666）在 edu.stanford.nlp.ling.tokensregex.SequenceMatchRules.createExtractionRule（SequenceMatchRules.java:331）在 edu.stanford.nlp.ling.tokensregex.SequenceMatchRules.createRule（SequenceMatchRules.java:321）在 edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.Rule（TokenSequenceParser.java:141）在 edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.RuleList（TokenSequenceParser.java:125）在 edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.updateExpressionExtractor（TokenSequenceParser.java:32）在 edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor.createExtractorFromFiles（CoreMapExpressionExtractor.java:292） ... 52多

所以我可以做上面的工作，创建一个MONEY ner注释。如果标记器缺少货币符号

示例

文本规则尝试执行我想要的操作（为包含比索货币值的字符串创建名为CURRENCY的注释）

代币规则成功地实现了我想要的（因为日元是重记账的代币）。这将创建一个带有货币注释的日元货币字符串

ENV.defaults["ruleType"] = "tokens"
ENV.defaults["matchWithResults"] = TRUE

# Set default string pattern flags (to case-insensitive)
ENV.defaultStringPatternFlags = 2

ENV.defaults["stage"] = 0

# Ex: ¥3000
{   
pattern:  ([{ word: "¥" }] $NUMBER_COMMA_SEP $LARGE_NUMBERS?),
action: (Annotate($0, ner, "CURRENCY"))
}

ner的定义如下：

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

然后：

$NUMBER_COMMA_SEP = "$NUMBER_NON_CD | ([{ tag: /CD/ } & $VALID_NUMERIC_CHARS] [{ tag: /CD/; word: /,\d+(\.\d+)?/ }]*)"
$LARGE_NUMBERS = "/thousand|million|mil|mn|billion|bil|bn|trillion/"

您需要确保标记器没有删除不可修改的标记

命令：

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,tokensregex -tokensregex.rules example-rules.txt -props StanfordCoreNLP-spanish.properties -tokenize.options "untokenizable=allKeep" -file example.txt -outputFormat text

示例-rules.txt

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: ( /₱/ /[0-9]+/ ) , action: (Annotate($0, ner, "CURRENCY") ) }

如果您使用正确配置的标记器在文本上运行该符号，它将为该符号创建一个不同的标记。

请发布您尝试使用的完整规则文件。我认为我无法做到这一点，但我将在上面的问题中包括更多的示例。谢谢，这就是我的想法。没有令牌，没有生成ner注释。当我为我的规则文件运行tokensregex并返回报告时，我将尝试

allKeep

选项（以及上面我可能没有使用的任何其他选项）。顺便说一下，我检查并看到之前我使用

noneDelete

的untokenizable选项。是的，效果很好，谢谢。由于这是管道的一部分，因此我必须分析权衡是什么。我认为很久以前我们想抛弃那些不可改变的东西是有原因的。

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: ( /₱/ /[0-9]+/ ) , action: (Annotate($0, ner, "CURRENCY") ) }