Python 从自定义转储创建webnet使用的信息内容语料库_Python_Nltk_Wordnet_Corpus_Pos Tagger

Python 从自定义转储创建webnet使用的信息内容语料库

python

Python 从自定义转储创建webnet使用的信息内容语料库,python,nltk,wordnet,corpus,pos-tagger,Python,Nltk,Wordnet,Corpus,Pos Tagger,我正在使用Brown语料库ic-Brown.dat，使用wordnet nltk库计算单词的信息内容。但结果并不好看。我想知道如何构建自己的custome.dat（信息内容文件）在（…）/nltk_data/corpora/wordnet_ic/中，您会发现ic compute.sh包含一些对一些Perl脚本的调用，以从给定的语料库生成ic dat文件。我发现指令很复杂，并且没有所需的Perl脚本，因此我决定通过分析dat文件结构和wordnet.ic（）函数来创建python脚本您可以通过

我正在使用Brown语料库ic-Brown.dat，使用wordnet nltk库计算单词的信息内容。但结果并不好看。我想知道如何构建自己的custome.dat（信息内容文件）

在（…）/nltk_data/corpora/wordnet_ic/中，您会发现ic compute.sh包含一些对一些Perl脚本的调用，以从给定的语料库生成ic dat文件。我发现指令很复杂，并且没有所需的Perl脚本，因此我决定通过分析dat文件结构和wordnet.ic（）函数来创建python脚本

您可以通过对语料库阅读器对象调用wordnet.IC（）函数来计算自己的IC计数。事实上，您只需要一个具有返回语料库中所有单词的word（）函数的对象。有关更多详细信息，请查看文件..../nltk/corpus/reader/wordnet.py中的ic函数（第1729至1789行）

例如，对于BNC语料库的XML版本（2007年）：

为了生成.dat文件，我创建了以下函数：

def is_root(synset_x):
    if synset_x.root_hypernyms()[0] == synset_x:
        return True
    return False

def generate_ic_file(IC, output_filename):
    """Dump in output_filename the IC counts.
    The expected format of IC is a dict 
    {'v':defaultdict, 'n':defaultdict, 'a':defaultdict, 'r':defaultdict}"""
    with codecs.open(output_filename, 'w', encoding='utf-8') as fid:
        # Hash code of WordNet 3.0
        fid.write("wnver::eOS9lXC6GvMWznF1wkZofDdtbBU"+"\n")

        # We only stored nouns and verbs because those are the only POS tags
        # supported by wordnet.ic() function
        for tag_type in ['v', 'n']:#IC:
            for key, value in IC[tag_type].items():
                if key != 0:
                    synset_x = wn.of2ss(of="{:08d}".format(key)+tag_type)
                    if is_root(synset_x):
                        fid.write(str(key)+tag_type+" "+str(value)+" ROOT\n")
                    else:
                        fid.write(str(key)+tag_type+" "+str(value)+"\n")
    print("Done")

generate_ic_file(bnc_ic, "../custom.dat")

然后，只需调用函数：

custom_ic = wordnet_ic.ic('../custom.dat')

所需的进口是：

import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
import codecs

custom_ic = wordnet_ic.ic('../custom.dat')

import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
import codecs