Python-我可以检测unicode字符串语言代码吗？_Python_Unicode_Internationalization_Detection

Python-我可以检测unicode字符串语言代码吗？

python unicode internationalization

Python-我可以检测unicode字符串语言代码吗？,python,unicode,internationalization,detection,Python,Unicode,Internationalization,Detection,我面临的情况是，我正在阅读一系列文本，我需要检测语言代码en、de、fr、es等在python中有一种简单的方法可以做到这一点吗？尝试将chardet模块的一个端口从Firefox移植到python。如果可能的语言数量有限，您可以使用一组词典，可能只包含每种语言中最常见的单词，然后对照词典检查输入中的单词。请查看：尝试确定选定Unicode utf-8文本的自然语言但正如名字所说，它猜测语言。你不能期望100%的正确结果编辑：我猜语言是无法维护的。但是有一个叉子支持python3:查看

我面临的情况是，我正在阅读一系列文本，我需要检测语言代码en、de、fr、es等

在python中有一种简单的方法可以做到这一点吗？

尝试将chardet模块的一个端口从Firefox移植到python。

如果可能的语言数量有限，您可以使用一组词典，可能只包含每种语言中最常见的单词，然后对照词典检查输入中的单词。

请查看：

尝试确定选定Unicode utf-8文本的自然语言

但正如名字所说，它猜测语言。你不能期望100%的正确结果

编辑：

我猜语言是无法维护的。但是有一个叉子支持python3:

查看和寻找想法

我想知道贝叶斯过滤器是否能够正确使用语言，但我现在无法编写概念证明。

如果您需要检测语言以响应用户操作，则可以使用：

输出默认限制为每天100000个字符，每次不超过5000个字符

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2

from operator import itemgetter

def detect_language_v2(chunks, api_key):
    """
    chunks: either string or sequence of strings

    Return list of corresponding language codes
    """
    if isinstance(chunks, basestring):
        chunks = [chunks] 

    url = 'https://www.googleapis.com/language/translate/v2'

    data = urllib.urlencode(dict(
        q=[t.encode('utf-8') if isinstance(t, unicode) else t 
           for t in chunks],
        key=api_key,
        target="en"), doseq=1)

    # the request length MUST be < 5000
    if len(data) > 5000:
        raise ValueError("request is too long, see "
            "http://code.google.com/apis/language/translate/terms.html")

    #NOTE: use POST to allow more than 2K characters
    request = urllib2.Request(url, data,
        headers={'X-HTTP-Method-Override': 'GET'})
    d = json.load(urllib2.urlopen(request))
    if u'error' in d:
        raise IOError(d)
    return map(itemgetter('detectedSourceLanguage'), d['data']['translations'])

输出

在我的例子中，我只需要确定两种语言，所以我只需检查第一个字符：

import unicodedata

def is_greek(term):
    return 'GREEK' in unicodedata.name(term.strip()[0])


def is_hebrew(term):
    return 'HEBREW' in unicodedata.name(term.strip()[0])

这里指出，这是在python中检测语言的最佳选择

这篇文章展示了三种解决方案的速度和精度比较：

或者它的python端口

我在langdetect上浪费了时间，现在我正在切换到CLD，它比langdetect快16倍，准确率为98.8%

这是一个很好的库，但它给了我编码而不是语言环境，我不需要它。不过，谢谢。您可以将编码映射到区域设置。@İsmail'cartman'Dönmez：只有当语言有自己的字符集时，这才可能。许多语言共享相同的字母表。ascii映射到哪个语言环境？@pafcu，是的，但在一段文本上，您只能检测编码，而不能检测语言环境，这取决于系统。我假设sa125的意思是语言，而不是语言环境。+1：很好地利用了一些好的现有工具的功能。@ShimonDoodkin：您可以从不同的提供商那里尝试类似的服务，例如。，.你知道自从你回答这个问题以来，langdetect有没有进步？

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2

from operator import itemgetter

def detect_language_v2(chunks, api_key):
    """
    chunks: either string or sequence of strings

    Return list of corresponding language codes
    """
    if isinstance(chunks, basestring):
        chunks = [chunks] 

    url = 'https://www.googleapis.com/language/translate/v2'

    data = urllib.urlencode(dict(
        q=[t.encode('utf-8') if isinstance(t, unicode) else t 
           for t in chunks],
        key=api_key,
        target="en"), doseq=1)

    # the request length MUST be < 5000
    if len(data) > 5000:
        raise ValueError("request is too long, see "
            "http://code.google.com/apis/language/translate/terms.html")

    #NOTE: use POST to allow more than 2K characters
    request = urllib2.Request(url, data,
        headers={'X-HTTP-Method-Override': 'GET'})
    d = json.load(urllib2.urlopen(request))
    if u'error' in d:
        raise IOError(d)
    return map(itemgetter('detectedSourceLanguage'), d['data']['translations'])

def detect_language_v2(chunks, api_key):
    """
    chunks: either string or sequence of strings

    Return list of corresponding language codes
    """
    if isinstance(chunks, basestring):
        chunks = [chunks] 

    url = 'https://www.googleapis.com/language/translate/v2/detect'

    data = urllib.urlencode(dict(
        q=[t.encode('utf-8') if isinstance(t, unicode) else t
           for t in chunks],
        key=api_key), doseq=True)

    # the request length MUST be < 5000
    if len(data) > 5000:
        raise ValueError("request is too long, see "
            "http://code.google.com/apis/language/translate/terms.html")

    #NOTE: use POST to allow more than 2K characters
    request = urllib2.Request(url, data,
        headers={'X-HTTP-Method-Override': 'GET'})
    d = json.load(urllib2.urlopen(request))

    return [sorted(L, key=itemgetter('confidence'))[-1]['language']
            for L in d['data']['detections']]

print detect_language_v2(
    ["Python - can I detect unicode string language code?",
     u"матрёшка",
     u"打水"], api_key=open('api_key.txt').read().strip())

[u'en', u'ru', u'zh-CN']

import unicodedata

def is_greek(term):
    return 'GREEK' in unicodedata.name(term.strip()[0])


def is_hebrew(term):
    return 'HEBREW' in unicodedata.name(term.strip()[0])