Python “NLTK标记化”;不易损坏类型:';列表'&引用;

Python “NLTK标记化”;不易损坏类型:';列表'&引用;,python,pandas,nltk,Python,Pandas,Nltk,下面这个例子: CSV收件人:df['Country','Responses'] 'Country' Italy Italy France Germany 'Responses' "Loren ipsum..." "Loren ipsum..." "Loren ipsum..." "Loren ipsum..." 标记“响应”中的文本 删除100个最常见的单词(基于brown.corpus) 找出剩下的100个最常用的单词 我可以完成步骤1和2,但在步骤3中出现错误: TypeError:

下面这个例子:

CSV收件人:df['Country','Responses']

'Country'
Italy
Italy
France
Germany

'Responses' 
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
  • 标记“响应”中的文本
  • 删除100个最常见的单词(基于brown.corpus)
  • 找出剩下的100个最常用的单词
  • 我可以完成步骤1和2,但在步骤3中出现错误:

    TypeError: unhashable type: 'list'
    
    我相信这是因为我在一个数据帧中工作,并进行了以下(可能是有害的)修改:

    原始示例:

    #divide to words
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(tweets)
    
    我的代码:

    #divide to words
    tokenizer = RegexpTokenizer(r'\w+')
    df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
    
    我的完整代码:

    df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
    
    tokenizer = RegexpTokenizer(r'\w+')
    df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
    
    words =  df['tokenized_sents']
    
    #remove 100 most common words based on Brown corpus
    fdist = FreqDist(brown.words())
    mostcommon = fdist.most_common(100)
    mclist = []
    for i in range(len(mostcommon)):
        mclist.append(mostcommon[i][0])
    words = [w for w in words if w not in mclist]
    
    Out: ['the',
     ',',
     '.',
     'of',
     'and',
    ...]
    
    #keep only most common words
    fdist = FreqDist(words)
    mostcommon = fdist.most_common(100)
    mclist = []
    for i in range(len(mostcommon)):
        mclist.append(mostcommon[i][0])
    words = [w for w in words if w not in mclist]
    
    TypeError: unhashable type: 'list'
    
    有很多问题都列在不好的清单上,但据我所知,没有一个问题是完全相同的。 有什么建议吗?谢谢


    回溯

    TypeError                                 Traceback (most recent call last)
    <ipython-input-164-a0d17b850b10> in <module>()
      1 #keep only most common words
    ----> 2 fdist = FreqDist(words)
      3 mostcommon = fdist.most_common(100)
      4 mclist = []
      5 for i in range(len(mostcommon)):
    
    /home/*******/anaconda3/envs/*******/lib/python3.5/site-packages/nltk/probability.py in __init__(self, samples)
        104         :type samples: Sequence
        105         """
    --> 106         Counter.__init__(self, samples)
        107 
        108     def N(self):
    
    /home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in __init__(*args, **kwds)
        521             raise TypeError('expected at most 1 arguments, got %d' % len(args))
        522         super(Counter, self).__init__()
    --> 523         self.update(*args, **kwds)
        524 
        525     def __missing__(self, key):
    
    /home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in update(*args, **kwds)
        608                     super(Counter, self).update(iterable) # fast path when counter is empty
        609             else:
    --> 610                 _count_elements(self, iterable)
        611         if kwds:
        612             self.update(kwds)
    
    TypeError: unhashable type: 'list'
    
    TypeError回溯(最近一次调用)
    在()
    1.只保留最常用的单词
    ---->2 fdist=频率分布(字)
    3最常见=最常见(100)
    4 mclist=[]
    5表示范围内的i(len(mostcommon)):
    /home/*********/anaconda3/envs/*********/lib/python3.5/site-packages/nltk/probability.py in_uuu_uinit_u_u_u(自我,样本)
    104:类型样本:序列
    105         """
    -->106计数器.\uuuuuuuuuuuuuuuuuuuuuuuu(自身,样本)
    107
    108 def N(自):
    /home/********/anaconda3/envs/*****/lib/python3.5/collections/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
    521 raise TypeError('最多需要1个参数,得到%d个'%len(args))
    522超级(计数器,自身)。\uuuu初始化
    -->523自我更新(*args,**kwds)
    524
    525 def_uuu缺失(自身,钥匙):
    /home/********/anaconda3/envs/*****/lib/python3.5/collections/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
    608超级(计数器,自身)。更新(可编辑)#计数器为空时的快速路径
    609其他:
    -->610 _计数_元素(自身,可编辑)
    611如果KWD:
    612自我更新(kwds)
    TypeError:不可损坏的类型:“列表”
    
    函数
    FreqDist
    接受一个可哈希对象的iterable(设置为字符串,但它可能与任何对象一起工作)。出现的错误是因为传入了一个列表iterable。正如您所建议的,这是因为您所做的更改:

    df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
    
    如果我理解正确,那一行正在将
    nltk.word\u tokenize
    函数应用于某个系列。
    word tokenize
    返回单词列表

    作为解决方案,在尝试应用
    FreqDist
    之前,只需将列表添加到一起,如下所示:

    allWords = []
    for wordList in words:
        allWords += wordList
    FreqDist(allWords)
    
    如果您只需要识别第二组100,请注意,
    mclist
    将在第二次识别时显示

    df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
    
    tokenizer = RegexpTokenizer(r'\w+')
    df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
    
    lists =  df['tokenized_sents']
    words = []
    for wordList in lists:
        words += wordList
    
    #remove 100 most common words based on Brown corpus
    fdist = FreqDist(brown.words())
    mostcommon = fdist.most_common(100)
    mclist = []
    for i in range(len(mostcommon)):
        mclist.append(mostcommon[i][0])
    words = [w for w in words if w not in mclist]
    
    Out: ['the',
     ',',
     '.',
     'of',
     'and',
    ...]
    
    #keep only most common words
    fdist = FreqDist(words)
    mostcommon = fdist.most_common(100)
    mclist = []
    for i in range(len(mostcommon)):
        mclist.append(mostcommon[i][0])
    # mclist contains second-most common set of 100 words
    words = [w for w in words if w in mclist]
    # this will keep ALL occurrences of the words in mclist
    

    单词中的元素类型是什么您可以打印并检查它们的字符串(英语句子)。似乎您在其中有列表请注意,行
    words=[w代表words中的w,如果w不在mclist中]
    不应该在for循环中。单词显示为列表:
    type(words)=列出
    谢谢。我会试一试。我想知道这是否能让我剔除100个最常见的100个英语单词,然后找出课文中最常用的100个?我会试试看……是的,它不是把单词设置成一个,只是把它们都放在一个列表中。所以,使用你的列表就很容易了ension
    words=[w代表words中的w,如果w不在大多数常用英语单词中]
    ,然后
    最常用的
    功能与您一样。谢谢。我尝试将列表添加到一起,但得到一个空集:
    words[]
    。我的理解是,根据Brown的说法,第一个函数删除了最常见的单词,将剩余的单词返回给第二个函数,以再次识别100个最常见的单词,但这次将它们放入
    mclist[]
    。我会继续尝试……我确实尝试了
    words=df['Responses']。应用(nltk.word\u tokenize)
    但经过很长时间(例如20分钟)后,它没有返回任何数据。
    itertools
    您提到的另一个问题可能是另一种方法。