Python 阿拉伯语nltk词性标记中的未知符号

Python 阿拉伯语nltk词性标记中的未知符号,python,nlp,nltk,stanford-nlp,pos-tagger,Python,Nlp,Nltk,Stanford Nlp,Pos Tagger,我用nltk标记了一些阿拉伯语文本 然而,我最终得到了一些结果,比如 (u'an阿拉伯文字符/单词','''''`')) 或 (u'an阿拉伯语字符/单词',':') 但是,它们没有在文档中提供``或` 因此,我想知道这是什么 from nltk.toeknize.punkt import PunktWordTokenizer z = "أنا تسلق شجرة" tkn = PunkWordTokenizer sen = tkn.tokenize(z) tokens = nltk.pos

我用nltk标记了一些阿拉伯语文本

然而,我最终得到了一些结果,比如

(u'an阿拉伯文字符/单词','''''`')) 或 (u'an阿拉伯语字符/单词',':')

但是,它们没有在文档中提供``或`

因此,我想知道这是什么

from nltk.toeknize.punkt import PunktWordTokenizer 

z = "أنا تسلق شجرة"
tkn = PunkWordTokenizer
sen = tkn.tokenize(z)
tokens = nltk.pos_tag(sent)

print tokens

默认的NLTK POS标记是针对英语文本进行训练的,应该用于英语文本处理,请参阅。文件:

An off-the-shelf tagger is available.  It uses the Penn Treebank tagset:

    >>> from nltk.tag import pos_tag  # doctest: +SKIP
    >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
    'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
    ('.', '.')]
以及
位置标签的代码

from nltk.data import load


# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]

    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)
这对我来说很有用,可以让斯坦福工具在Ubuntu 14.4.1上用python工作:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
$ unzip stanford-postagger-full-2015-01-29.zip
$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
$ unzip /stanford-segmenter-2015-01-29.zip
$ python
然后:

from nltk.tag.stanford import POSTagger
path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'

artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)
[out]:

$ python3 test.py
[('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]

如果您在使用Stanford POS标记器时遇到java问题,请参阅DELPH-IN wiki:

默认的NLTK POS标记是针对英文文本进行训练的,应该用于英文文本处理,请参阅。文件:

An off-the-shelf tagger is available.  It uses the Penn Treebank tagset:

    >>> from nltk.tag import pos_tag  # doctest: +SKIP
    >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
    'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
    ('.', '.')]
以及
位置标签的代码

from nltk.data import load


# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]

    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)
这对我来说很有用,可以让斯坦福工具在Ubuntu 14.4.1上用python工作:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
$ unzip stanford-postagger-full-2015-01-29.zip
$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
$ unzip /stanford-segmenter-2015-01-29.zip
$ python
然后:

from nltk.tag.stanford import POSTagger
path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'

artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)
[out]:

$ python3 test.py
[('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]

如果您在使用Stanford POS标记器时遇到java问题,请参阅DELPH-IN wiki:

默认的NLTK POS标记是针对英文文本进行训练的,应该用于英文文本处理,请参阅。文件:

An off-the-shelf tagger is available.  It uses the Penn Treebank tagset:

    >>> from nltk.tag import pos_tag  # doctest: +SKIP
    >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
    'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
    ('.', '.')]
以及
位置标签的代码

from nltk.data import load


# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]

    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)
这对我来说很有用,可以让斯坦福工具在Ubuntu 14.4.1上用python工作:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
$ unzip stanford-postagger-full-2015-01-29.zip
$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
$ unzip /stanford-segmenter-2015-01-29.zip
$ python
然后:

from nltk.tag.stanford import POSTagger
path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'

artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)
[out]:

$ python3 test.py
[('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]

如果您在使用Stanford POS标记器时遇到java问题,请参阅DELPH-IN wiki:

默认的NLTK POS标记是针对英文文本进行训练的,应该用于英文文本处理,请参阅。文件:

An off-the-shelf tagger is available.  It uses the Penn Treebank tagset:

    >>> from nltk.tag import pos_tag  # doctest: +SKIP
    >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
    'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
    ('.', '.')]
以及
位置标签的代码

from nltk.data import load


# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]

    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)
这对我来说很有用,可以让斯坦福工具在Ubuntu 14.4.1上用python工作:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
$ unzip stanford-postagger-full-2015-01-29.zip
$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
$ unzip /stanford-segmenter-2015-01-29.zip
$ python
然后:

from nltk.tag.stanford import POSTagger
path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'

artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)
[out]:

$ python3 test.py
[('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]

如果您在使用Stanford POS tagger时遇到java问题,请参阅DELPH-IN wiki:

您可以发布您尝试过的真实阿拉伯字符和输入文本吗。以及您在NLTK中尝试过的pos标记器。您使用的代码是什么。@alvas我已经完成了,您可以发布您尝试过的真实阿拉伯字符和输入文本。以及您在NLTK中尝试过的pos标记器。您使用的代码是什么。@alvas我已经完成了,您可以发布您尝试过的真实阿拉伯字符和输入文本。以及您在NLTK中尝试过的pos标记器。您使用的代码是什么。@alvas我已经完成了,您可以发布您尝试过的真实阿拉伯字符和输入文本。以及您在NLTK中尝试过的pos标记器。您使用的代码是什么。@alvas我已经完成了soi检查,nltk实际上支持阿拉伯语。我是否可以将设置设置为阿拉伯语。我知道standford nlp有一个阿拉伯语实现,nltk类似于itNLTk的python包装器,nltk有一个斯坦福阿拉伯语工具的包装器。但是它没有本地的。NLTK中的
pos_tag
仅适用于英语。我将在晚上空闲时尝试为包装器编写一个示例脚本。我检查了nltk是否支持阿拉伯语。我是否可以将设置设置为阿拉伯语。我知道standford nlp有一个阿拉伯语实现,nltk类似于itNLTk的python包装器,nltk有一个斯坦福阿拉伯语工具的包装器。但是它没有本地的。NLTK中的
pos_tag
仅适用于英语。我将在晚上空闲时尝试为包装器编写一个示例脚本。我检查了nltk是否支持阿拉伯语。我是否可以将设置设置为阿拉伯语。我知道standford nlp有一个阿拉伯语实现,nltk类似于itNLTk的python包装器,nltk有一个斯坦福阿拉伯语工具的包装器。但是它没有本地的。NLTK中的
pos_tag
仅适用于英语。我将在晚上空闲时尝试为包装器编写一个示例脚本。我检查了nltk是否支持阿拉伯语。我是否可以将设置设置为阿拉伯语。我知道standford nlp有一个阿拉伯语实现,nltk类似于itNLTk的python包装器,nltk有一个斯坦福阿拉伯语工具的包装器。但是它没有本地的。NLTK中的
pos_tag
仅适用于英语。晚上有空的时候,我将尝试为包装器编写一个示例脚本