Python 基于NLTK的阿拉伯文本词义消歧
NLTK允许我使用Python 基于NLTK的阿拉伯文本词义消歧,python,nlp,nltk,arabic,word-sense-disambiguation,Python,Nlp,Nltk,Arabic,Word Sense Disambiguation,NLTK允许我使用NLTK.wsd.lesk消除文本歧义,例如 >>> from nltk.corpus import wordnet as wn >>> from nltk.wsd import lesk >>> sent = "I went to the bank to deposit money" >>> ambiguous = "deposit" >>> lesk(sent, ambiguous,
NLTK.wsd.lesk
消除文本歧义,例如
>>> from nltk.corpus import wordnet as wn
>>> from nltk.wsd import lesk
>>> sent = "I went to the bank to deposit money"
>>> ambiguous = "deposit"
>>> lesk(sent, ambiguous, pos='v')
Synset('deposit.v.02')
也一样,但只适用于英文文本
NLTK支持来自的阿拉伯语wordnet,例如 此外,还为阿拉伯语的语法集编制了索引:
>>> wn.synsets(u'أَوْدَعَ', lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
但是,如何使用nltk消除阿拉伯语文本的歧义并从查询中提取概念呢
我想知道是否可以使用Lesk算法通过nltk处理阿拉伯语文本?这有点棘手,但可能会奏效:
alvas@ubi:~$wget-O translate.shhttp://pastebin.com/raw.php?i=aHgFzmMU
--2015-08-05 23:32:46-- http://pastebin.com/raw.php?i=aHgFzmMU
正在解析pastebin.com(pastebin.com)。。。190.93.241.15, 190.93.240.15, 141.101.112.16, ...
连接到pastebin.com(pastebin.com)| 190.93.241.15 |:80。。。有联系的。
HTTP请求已发送,正在等待响应。。。200行
长度:未指定[文本/纯文本]
保存到:“translate.sh”
[]212--.-K/s在0s中
2015-08-05 23:32:47(9.99 MB/s)-“translate.sh”已保存[212]
alvas@ubi:~$python
Python 2.7.6(默认值,2015年6月22日,17:58:13)
[GCC 4.8.2]关于linux2
有关详细信息,请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>导入操作系统
>>>导入nltk
>>>从nltk.corpus导入wordnet作为wn
>>>文本='
>>>cmd='echo“{}”| bash translate.sh'。格式(文本)
>>>translation=os.popen(cmd.read())
%总接收百分比%x平均速度时间电流
数据加载上载总左速度
100 193 0 40 100 153 21 83 0:00:01 0:00:01 --:--:-- 83
>>>翻译
“他已经把钱存入银行了。”
>>>模棱两可=u'أَودَعَ'
>>>wn.synset(不明确,lang='arb')
[Synset('trust.v.02')、Synset('deposit.v.02')、Synset('commit.v.03')、Synset('trust.v.01')、Synset('commit.v.02')]
>>>nltk.wsd.lesk(translation_stems',,synsets=wn.synsets(歧义,lang='arb'))
Synset('trust.v.02')
但正如您所看到的,存在许多限制:
- 访问MT系统并不总是容易的(上面使用ibmapi的bash脚本不会永远持续下去,它来自于)
- 机器翻译永远不会100%准确
- 在开放的多语言WordNet中寻找正确的引理并不像示例中所示的那么容易,词干有屈折和其他语素变体
- WordNet永远不会完整,尤其是当它不是英语的时候
- WSD并不是人类所期望的100%(即使在人类之间,我们的“感觉”也会有所不同,在上面的例子中,有些人可能会说WSD是正确的,有些人说最好使用
)Synset('deposit.v.02')
>>> wn.synsets(u'أَوْدَعَ', lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
alvas@ubi:~$ wget -O translate.sh http://pastebin.com/raw.php?i=aHgFzmMU
--2015-08-05 23:32:46-- http://pastebin.com/raw.php?i=aHgFzmMU
Resolving pastebin.com (pastebin.com)... 190.93.241.15, 190.93.240.15, 141.101.112.16, ...
Connecting to pastebin.com (pastebin.com)|190.93.241.15|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘translate.sh’
[ <=> ] 212 --.-K/s in 0s
2015-08-05 23:32:47 (9.99 MB/s) - ‘translate.sh’ saved [212]
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> text = 'لديه يودع المال في البنك'
>>> cmd = 'echo "{}" | bash translate.sh'.format(text)
>>> translation = os.popen(cmd).read()
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 193 0 40 100 153 21 83 0:00:01 0:00:01 --:--:-- 83
>>> translation
'He has deposited the money in the bank. '
>>> ambiguous = u'أَوْدَعَ'
>>> wn.synsets(ambiguous, lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
>>> nltk.wsd.lesk(translation_stems, '', synsets=wn.synsets(ambiguous,lang='arb'))
Synset('entrust.v.02')