Machine learning 如何在NER模型上设置空白标记器？_Machine Learning_Nlp_Stanford Nlp

Machine learning 如何在NER模型上设置空白标记器？

machine-learning nlp stanford-nlp

Machine learning 如何在NER模型上设置空白标记器？,machine-learning,nlp,stanford-nlp,Machine Learning,Nlp,Stanford Nlp,我正在使用CoreNLP 3.6.0创建一个自定义NER模型我的道具是： # location of the training file trainFile = /home/damiano/stanford-ner.tsv # location where you would like to save (serialize) your # classifier; adding .gz at the end automatically gzips the file, # making i

我正在使用CoreNLP 3.6.0创建一个自定义NER模型

我的道具是：

# location of the training file 
trainFile = /home/damiano/stanford-ner.tsv 
# location where you would like to save (serialize) your 
# classifier; adding .gz at the end automatically gzips the file, 
# making it smaller, and faster to load 
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that 
# the word is in column 0 and the correct answer is in column 1 
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features 
# apply at most to a class pair of previous class and current class 
# or current class and next class. 
maxLeft=1

# these are the features we'd like to train with 
# some are discussed below, the rest can be 
# understood by looking at NERFeatureFactory 
useClassFeature=true 
useWord=true 
# word character ngrams will be included up to length 6 as prefixes 
# and suffixes only  
useNGrams=true 
noMidNGrams=true 
maxNGramLeng=6 
usePrev=true 
useNext=true 
useDisjunctive=true 
useSequences=true 
usePrevSequences=true 
# the last 4 properties deal with word shape features 
useTypeSeqs=true 
useTypeSeqs2=true 
useTypeySequences=true 
wordShape=chris2useLC

我使用以下命令生成：

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop /home/damiano/stanford-ner.prop

问题是当我使用这个模型检索文本文件中的实体时。命令是：

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt

其中file.txt是：

Hello!
my
name
is
John.

输出为：

你好/好/O 我的/O姓名/O是/O约翰/O个人。/O

如您所见，它将“Hello！”拆分为两个标记。“约翰”也是一样

我必须使用空白标记器

我如何设置它

为什么CoreNlp将这些单词拆分为两个标记？

Upd。如果您想在此处使用空白标记器，只需将

tokenize.whitespace=true

添加到属性文件中即可。看

但是，在回答您的第二个问题“为什么CoreNlp将这些单词拆分为两个标记？”时，我建议保留默认的标记器（），因为它只允许获得更好的结果。通常，切换到空白标记化的原因是对处理速度的高要求或（通常）对标记化质量的低要求。既然你打算用它来做进一步的研究，我怀疑这是你的情况

即使在您的示例中，如果您在标记化后有标记

John.

，它也不能被宪报或火车示例捕获。

可以找到标记化没有那么简单的更多细节和原因

您可以通过将类名指定给

tokenizerFactory

标志/属性来设置自己的标记器：

tokenizerFactory=edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory

您可以指定实现

Tokenizer

接口的任何类，但是包含的

WhitespaceTokenizer

听起来像您想要的。如果标记器具有选项，您可以使用

tokenizerOptions

指定它们，例如，如果您还指定：

tokenizerOptions=tokenizeNLs=true

然后输入中的换行符将保留在输入中（对于不总是将内容转换为每行一个标记格式的输出选项）

注意：

tokenize.whitespace=true等选项适用于CoreNLP级别。如果提供给单个组件（如CRFClassizer），则不会对其进行解释（您会收到一条警告，说明该选项已被忽略）
正如Nikita Astrakhantsev所指出的，这不一定是一件好事。只有在训练数据也是空格分隔的情况下，在测试时这样做才是正确的，否则会对性能产生不利影响。而且，使用从空格分隔中获得的标记不利于后续NLP处理，如解析。
我在这个功能上花了1个小时。谢谢你，克里斯。