Unicode Python 2.7：从文本中检测表情符号_Unicode_Emoji

Unicode Python 2.7：从文本中检测表情符号

unicode

Unicode Python 2.7：从文本中检测表情符号,unicode,emoji,Unicode,Emoji,我希望能够在文本中检测表情符号并查找它们的名称我在使用Unicode数据模块时运气不好，我怀疑我不是了解UTF-8公约我想我需要将我的文档作为utf-8加载，然后将unicode“字符串”分解为unicode符号。迭代这些并查找它们 #new example loaded using pandas and encoding UTF-8 'A man tried to get into my car\U0001f648' typ

我希望能够在文本中检测表情符号并查找它们的名称

我在使用Unicode数据模块时运气不好，我怀疑我不是了解UTF-8公约

我想我需要将我的文档作为utf-8加载，然后将unicode“字符串”分解为unicode符号。迭代这些并查找它们

#new example loaded using pandas and encoding UTF-8                     
'A man tried to get into my car\U0001f648'          

type(test) = unicode

import unicodedata as uni
uni.name(test[0])
Out[89]: 'LATIN CAPITAL LETTER A'

uni.name(test[-3])
Out[90]: 'LATIN SMALL LETTER R'    

uni.name(test[-1])
ValueError                                Traceback (most recent call last)
<ipython-input-105-417c561246c2> in <module>()
----> 1 uni.name(test[-1])
ValueError: no such name

# just to be clear
uni.name(u'\U0001f648')
ValueError: no such name

#使用熊猫和编码UTF-8加载的新示例
“一个人试图进入我的车\U0001f648”
类型（测试）=unicode
将Unicode数据导入为uni
统一名称（测试[0]）
Out[89]：“拉丁文大写字母A”
单位名称（测试[-3]）
Out[90]：“拉丁文小写字母R”
单位名称（测试[-1]）
ValueError回溯（最近一次调用上次）
在（）
---->1单位名称（测试[-1]）
ValueError:没有这样的名称
#我只是想说清楚
单位名称（u'\U0001f648'）
ValueError:没有这样的名称

我通过谷歌查到了unicode符号，这是一个合法的符号。也许Unicode数据模块不是很全面

我在考虑自己做一张查表。

对其他想法感兴趣…这一个似乎可行。

以下是阅读您提供的链接的方法。它是从Python2翻译过来的，所以可能有一两个小故障

import re
import urllib2
rexp = re.compile(r'U\+([0-9A-Za-z]+)[^#]*# [^)]*\) *(.*)')
mapping = {}
for line in urllib2.urlopen('ftp://ftp.unicode.org/Public/emoji/1.0/emoji-data.txt'):
    line = line.decode('utf-8')
    m = rexp.match(line)
    if m:
        mapping[chr(int(m.group(1), 16))] = m.group(2)

我的问题是在unicodedata模块中使用Python2.7。我使用Conda创建了一个Python3.3环境，现在unicodedata可以工作了正如预期的那样，我已经放弃了所有我正在研究的奇怪的黑客

# using python 3.3
import unicodedata as uni

In [2]: uni.name('\U0001f648')
Out[2]: 'SEE-NO-EVIL MONKEY'

感谢马克·兰瑟姆指出，我最初是从非政府组织得到莫吉贝克的

正确导入我的数据。再次感谢您的帮助。

该字符串不包含您所认为的内容。试着打印它。是的，我没有在这里寻找表情符号，我只是抓取了一些东西……但我的意思是，它不是Unicode字符串。这是一个字节字符串，看起来它包含一些

unicodedata

如果你给它添加垃圾，它将不起作用。mojibake..好的，就是这样…再次感谢..我将用更好的例子更新…我还必须首先避免mojibake..可能，

unicodedata

没有最近添加的一些字符的记录。您可能需要在其周围放置一个

try

，除了

。谢谢马克，这个正则表达式很有用。我很难让我的实验处理unicode。我将在今天晚些时候对此进行研究，希望能添加更多内容。