Nlp 自然语言处理中的性别识别_Nlp_Stanford Nlp

Nlp 自然语言处理中的性别识别

nlp stanford-nlp

Nlp 自然语言处理中的性别识别,nlp,stanford-nlp,Nlp,Stanford Nlp,我使用斯坦福nlp软件包编写了以下代码 GenderAnnotator myGenderAnnotation = new GenderAnnotator(); myGenderAnnotation.annotate(annotation); 但对于“安妮上学”这句话，它无法确定安妮的性别应用程序的输出为： [Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie Name

我使用斯坦福nlp软件包编写了以下代码

GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);

但对于“安妮上学”这句话，它无法确定安妮的性别

应用程序的输出为：

     [Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie NamedEntityTag=PERSON] 
     [Text=goes CharacterOffsetBegin=6 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=go NamedEntityTag=O] 
     [Text=to CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=TO Lemma=to NamedEntityTag=O] 
     [Text=school CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NN Lemma=school NamedEntityTag=O] 
     [Text=. CharacterOffsetBegin=20 CharacterOffsetEnd=21 PartOfSpeech=. Lemma=. NamedEntityTag=O]

获取性别的正确方法是什么？

如果您的命名实体识别器为令牌输出

PERSON

，您可以使用（或构建，如果您没有）基于名字的性别分类器。例如，请参见NLTK库教程页面中的部分。它们使用以下功能：

名字的最后一个字母
名字的第一个字母
名称的长度（字符数）
字符单字符表示（布尔值，无论名称中是否有字符）

不过，我有一种预感，使用字符n-gram频率（可能高达字符三角形）会给您带来非常好的效果。

性别注释器不会将信息添加到文本输出中，但您仍然可以通过代码访问它，如以下代码片段所示：

Properties=newproperties（）；
props.setProperty（“注释器”、“标记化、ssplit、pos、解析、性别”）；
StanfordCoreNLP管道=新的StanfordCoreNLP（道具）；
注释文件=新注释（“安妮上学”）；
管道注释（文件）；
for（CoreMap语句：document.get（coreanotations.SentencesAnnotation.class））{
for（CoreLabel标记：句子.get（CoreAnnotations.TokensAnotation.class））{
System.out.print（token.value（））；
系统输出打印（，性别：）；
System.out.println（token.get（machineradingannotations.genderanotation.class））；
}
}

输出：

Annie, Gender: FEMALE
goes, Gender: null
to, Gender: null
school, Gender: null

有很多方法，其中一种在中概述

基本上，您可以构建一个分类器，从名称中提取一些特征（第一个、最后一个字母、前两个、最后两个字母等等），并基于这些特征进行预测

import nltk
import random

def extract_features(name):
    name = name.lower()
    return {
        'last_char': name[-1],
        'last_two': name[-2:],
        'last_three': name[-3:],
        'first': name[0],
        'first2': name[:1]
    }

f_names = nltk.corpus.names.words('female.txt')
m_names = nltk.corpus.names.words('male.txt')

all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
random.shuffle(all_names)

test_set = all_names[500:]
train_set= all_names[:500]

test_set_feat = [(extract_features(n), g) for n, g in test_set]
train_set_feat= [(extract_features(n), g) for n, g in train_set]

classifier = nltk.NaiveBayesClassifier.train(train_set_feat)

print nltk.classify.accuracy(classifier, test_set_feat)

这项基本测试的准确率约为77%。

尽管前面的答案与预期接近，但它似乎不适用于斯坦福德NLP的当前版本

该代码段的更新和工作示例如下所示

Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation("Annie goes to school");

pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
  for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
    System.out.print(token.value());
    System.out.print(", Gender: ");
    System.out.println(token.get(CoreAnnotations.GenderAnnotation.class));
  }
}

我在五个特征中的每一个都加了一个“#”，例如：“#‘last_char’：name[-1]”，所以不应该有任何提取的特征，运行代码的准确率为62-63%，为什么没有特征预测比掷硬币（50%）好得多？@KubiK888原因可能是数据集不平衡（63%为男性）在学习了NaiveBayes之后，他们决定最好的方法就是一直选择雄性。