Python Polyglot不检测多种语言_Python_Language Detection_Polyglot

Python Polyglot不检测多种语言

python

Python Polyglot不检测多种语言,python,language-detection,polyglot,Python,Language Detection,Polyglot,我正在用Python测试polyglot包，以检测混合语言文档中的语言我并不期望从中得到最准确的预测，但从软件包开始，它只返回一种语言作为答案，即使是包含2或3种语言的文本我使用的文本平均有20个单词，如下所示： text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.' from polyglot.detect import Detector te

我正在用Python测试

polyglot

包，以检测混合语言文档中的语言

我并不期望从中得到最准确的预测，但从软件包开始，它只返回一种语言作为答案，即使是包含2或3种语言的文本

我使用的文本平均有20个单词，如下所示：

text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'

from polyglot.detect import Detector

text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'

answer = Detector(text)

print(answer)

我总是会得到如下结果-没有多种语言的答案：

Prediction is reliable: True
Language 1: name: English     code: en       confidence:  98.0 read bytes:   682
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

在其文档中，它远未接近：

> China (simplified Chinese: 中国; traditional Chinese: 中國),
> 
> name: English     code: en       confidence:  71.0 read bytes:   887
> name: Chinese     code: zh_Hant  confidence:  11.0 read bytes:  1755
> name: un          code: un       confidence:   0.0 read bytes:     0

尽管说实话，当我用上面的中英文示例运行检测器时，我确实得到了一个混合语言的答案

代码如下所示：

text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'

from polyglot.detect import Detector

text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'

answer = Detector(text)

print(answer)

为什么会这样

附言

此外，在检测一个（甚至非常常见）单词的语言时，

polyglot

是非常糟糕的。例如，对于单词

quantita

（意大利语），它返回英语

我知道这些软件包中的许多在有大量文本时基本上是成功的，但令人惊讶的是，它们甚至不能捕获这些简单的单词

Textblob

似乎对单个单词也很好，但是你可以向它发送数量非常有限的请求（在这两种情况下，可能是因为它使用谷歌API）。

我认为Polyglot通过读取文本中使用的字符来检测语言。你上面提到的例子都是用英语写的（音译）。不管单词是法语、意大利语、西班牙语、汉语等。。兰盖格。它将全部被检测为英语，因为它是使用英语语言的字符集编写的

因此，Polyglot只适用于使用非拉丁字符的语言，如希腊语、俄语、阿拉伯语或汉语

这就是为什么在下面的例子中，您也可以使用中文，信心很低，因为中文字符很少，而拉丁字符更多：

中文（简体中文：中国; 繁体中文：中國),

名称：英文代码：en置信度：71.0读取字节：887 名称：中文代码：zh_Hant置信度：11.0读取字节：1755 名称：联合国代码：联合国可信度：0.0读取字节：0

你能分享代码吗？@satishsilveri，好的，但没什么特别的。