Python：对字符进行编码，但仍然可以使用列表_Python_Unicode_Text Mining_Named Entity Recognition

Python：对字符进行编码，但仍然可以使用列表

python unicode

Python：对字符进行编码，但仍然可以使用列表,python,unicode,text-mining,named-entity-recognition,Python,Unicode,Text Mining,Named Entity Recognition,因此，对于文本挖掘任务，我们尝试收集tweet（尤其是文本），并运行stanford-NER标记器，以确定是否有提及的人员或位置。这也可以通过检查hashtag来实现，但其思想是使用一些文本挖掘工具假设我们从一个cPickle文件中加载了数据，这个文件在空白处保存、加载和分割 hil_text = [[u'Man', u'is', u'not', u'a', u'issue', u'cah', u'me', u'pum', u'pum', u'tun', u'up', u'#Internati

因此，对于文本挖掘任务，我们尝试收集tweet（尤其是文本），并运行stanford-NER标记器，以确定是否有提及的人员或位置。这也可以通过检查hashtag来实现，但其思想是使用一些文本挖掘工具

假设我们从一个cPickle文件中加载了数据，这个文件在空白处保存、加载和分割

hil_text = [[u'Man', u'is', u'not', u'a', u'issue', u'cah', u'me', u'pum', u'pum', u'tun', u'up', u'#InternationalWomensDay', u'#cham', u'#empowerment', u'#Clinton2016', u'#PiDay2016'], [u'Shonda', u'Land', u'came', u'out', u'with', u'a', u'great', u'ad', u'for', u'Clinton:https://t.co/Vfg9lAKNaH#Clinton2016'], [u'RT', u'@BeaverforBernie:', u'Trump', u'and', u'the', u"#Clinton's", u'are', u'the', u'same.', u'They', u'worship', u'$$$$$.', u'https://t.co/yUXoJaL6mJ'], [u'.@GloriaLaRiva', u'on', u'#Clinton,', u'Reagans', u'&amp;', u'#AIDS:', u'\u201cClinton', u'just', u're-wrote', u'history\u201d', u'https://t.co/L3YuIyFjxo', u'Clinton', u'incapable', u'of', u'telling', u'truth.'], [u'#KKK', u'Leader', u'Gets', u'Behind', u'This', u'Democratic', u'Candidate', u'https://t.co/p9yTQ2sXmV', u'How', u'fitting!', u'#Hillary2016', u'#HillaryClinton', u'#Hillary', u'#Killary', u'#tcot'], [u'#KKK', u'Leader', u'Gets', u'Behind', u'This', u'Democratic', u'Candidate', u'https://t.co/p9yTQ2sXmV', u'How', u'fitting!', u'#Hillary2016', u'#HillaryClinton', u'#Hillary', u'#Killary', u'#tcot'], [u'RT', u'@jvlibrarylady:', u'President', u'Clinton', u'at', u'rally', u'for', u'Hillary', u'at', u'Teamsters', u'Local', u'245', u'in', u'Springfield,', u'Mo.', u'#HillaryClintonForPresident', u'https://t.\u2026'], [u'RT', u'@jvlibrarylady:', u'President', u'Clinton', u'at', u'rally', u'for', u'Hillary', u'at', u'Teamsters', u'Local', u'245', u'in', u'Springfield,', u'Mo.', u'#HillaryClintonForPresident', u'https://t.\u2026']]

标记器不接受unicode，因此在尝试使其工作时，我们尝试执行以下操作

for word in hil_text:
    for x in word:
        print x.encode('utf-8',errors='ignore')
        print tagger.tag(x.encode('utf-8',errors='ignore')

这导致x是打印的单词，但标记器分别标记每个字母

有没有办法将其编码并作为一个单词通过标记器发送？或者换句话说，对列表中的部分进行编码，但仍将该部分保留在列表中

为什么标记器标记每个字母而不仅仅是整个x？

看起来像是

tagger。tag

需要一系列字符串。但是您传递的是单个字符串，python将其视为字符序列。要解决此问题，请尝试以下方法：

for section in hil_text:
    # encode each word in the section, and put them in a new list
    words = [word.encode('utf-8') for word in section]
    # pass the list of encoded words to the tagger
    print tagger.tag(words)