Python 使用计数器创建字典_Python_Dictionary_Counter_Tokenize

Python 使用计数器创建字典

python dictionary

Python 使用计数器创建字典,python,dictionary,counter,tokenize,Python,Dictionary,Counter,Tokenize,我有一个单词输出，我想用它创建一个字典，其中keys=word；值=单词的频率代码如下： import pandas as pd import numpy as np import datetime import sys import codecs import re import urllib, urllib2 import nltk # Natural Language Processing from nltk.corpus import stopwords # list of wo

我有一个单词输出，我想用它创建一个字典，其中keys=word；值=单词的频率

代码如下：

import pandas as pd
import numpy as np
import datetime
import sys
import codecs
import re
import urllib, urllib2

import nltk  # Natural Language Processing

from nltk.corpus import stopwords # list of words
import string  # list(string.punctuation) - produces a list of punctuations
from collections import Counter  # optimized way to do this

#wordToken = ['happy', 'thursday', 'from', 'my', 'big', 'sweater', 'and', 'this', 'ART', '@', 'East', 'Village', ',', 'Manhattan', 'https', ':', '//t.co/5k8PUInmqK', 'RT', '@', 'MayorKev', ':', 'IM', 'SO', 'HYPEE', '@', 'calloutband', '@', 'FreakLikeBex', '#', 'Callout', '#', 'TheBitterEnd', '#', 'Manhattan', '#', 'Music']

# this is the output from wordToken = [token.encode('utf-8') for tweetL in tweetList for token in nltk.tokenize.word_tokenize(tweetL)]

wordTokenLw = ' '.join(map(str, wordToken))
wordTokenLw = wordTokenLw.lower()

tweetD = {}

#c = Counter(wordTokenLw)

c = Counter(word.lower() for word in wordToken) # TRYING the suggested answer

#tweetD = dict(c.most_common())
tweetD = dict(c)

print tweetD

但是，我的输出完全错误：

{'\x80': 2, 'j': 4, ' ': 192, '#': 21, "'": 1, '\xa6': 2, ',': 1, '/': 37, '.': 13, '1': 1, '0': 5, '3': 2, '2': 4, '5': 3, '7': 2, '9': 2, '8': 1, ';': 1, ':': 18, '@': 14, 'b': 17, 'a': 83, 'c': 36, '\xe2': 2, 'e': 63, 'd': 16, 'g': 10, 'f': 12, 'i': 37, 'h': 33, 'k': 12, '&': 1, 'm': 38, 'l': 22, 'o': 37, 'n': 49, 'q': 5, 'p': 33, 's': 32, 'r': 44, 'u': 20, 't': 104, 'w': 11, 'v': 14, 'y': 21, 'x': 8, 'z': 5}

我认为问题在于我的数据文件的格式（我使用空格作为连接函数的分隔符）。我使用JOIN函数的原因是使用lower（）以小写形式获取所有内容。然而，如果有更好的方法来帮助我的最终结果，那么听到这个消息就太棒了

这对我来说是一个新的领域，非常感谢你的帮助

尝试后的输出：

c = Counter(word.lower() for word in wordToken)

{'over': 1, 'hypee': 1, '//t.co/0\xe2\x80\xa6': 1, ',': 1, 'thursday': 1, 'day': 1, 'to': 2, 'dreams': 1, 'main': 1, '@': 14, 'automotive': 1, 'tbt': 1, 'positivital': 1, '2ma': 1, 'amp': 1, 'traveiplaces': 1, '//t.co/vmbal\xe2\x80\xa6': 1, '//t.co/c9ezuknraq': 1, 'motorcycles': 1, 'river': 1, 'view': 1, '//t.co/kpeunlzoyf': 1, 'art': 1, 'reillyhunter': 1, '//t.co/5pcxnzpwhw': 1, 'mayorkev': 1, 'rt': 5, '#': 21, 'pinterest': 1, 'away': 1, 'traveltuesday': 1, 'ice': 1, '//t.co/simhceefqy': 1, 'state': 1, 'fog': 1, ';': 1, '3d': 1, 'be': 1, 'run': 1, '//t.co/xrqaa7cb3e': 1, 'taevision': 1, 'by': 1, 'on': 1, 'livemusic': 1, 'bmwmotorradusa': 1, 'taking': 1, 'calloutband': 1, 'jersey': 1, 'uber': 1, 'bell': 1, 'freaklikebex': 1, 'village': 1, '.': 1, 'from': 2, '//t.co/5k8puinmqk': 1, '//t.co/gappxrvuql': 1, '&': 1, '500px': 1, 'sweater': 1, 'callout': 1, 'next': 1, 'appears': 1, 'music': 1, 'https': 5, ':': 18, 'happy': 1, 'park': 1, 'mercedesbenz': 1, 'amcafee': 1, 'foggy': 1, 'east': 2, '7pm': 1, 'this': 2, 'of': 1, 'taxis': 1, 'my': 1, 'and': 2, 'bridge': 1, 'centralpark': 1, '//t.co/ujdzsywt0u': 1, 'toughrides': 1, '10/22': 1, 'am': 1, 'thebitterend': 1, 'bmwmotorrad': 1, 'im': 1, 'at': 2, 'in': 3, 'cream': 1, 'nj': 1, '//t.co/hnxktmvrsc': 1, 'ny': 2, 'big': 1, 'nyc': 3, 'rides': 1, 'manhattan': 10, 'nice': 1, 'week': 1, 'blue': 1, 'http': 7, 'effect': 1, 'paleteria': 1, "'m": 1, 'a': 1, '//t.co/ucgfcwp9j2': 1, 'i': 2, 'so': 1, 'bmw': 1}

当您再次加入单个字符串时，

计数器开始计数字母而不是单词（因为您给它一个字母数）。相反，您应该直接从wordToken
列表中创建计数器
；将每个项目放入计数器时，可以使用生成器表达式对其调用lower
：
c = Counter(word.lower() for word in wordToken)

这是处理字符串时常见的错误之一。在Python中，字符串是可iterable的，有时当函数接受一个可iterable并最终给出字符串时，我们会发现函数作用于字符串的元素，这些元素是构成字符串的字符
class collections.Counter([iterable-or-mapping])

在您的情况下，您只需像这样在wordToken
上执行计数器
Counter(map(lambda w: w.lower(), wordToken)

我试过这个。首先问一个问题：这是将所有内容转换为小写还是临时执行以执行计数器？输出更有意义，但有点奇怪。我把问题放在代码部分。几乎所有的单词似乎只有一次。这是可能的，但有些东西感觉不对劲，这只会使进入柜台的单词小写。如果您想降低所有内容，只需将初始列表理解更改为token.encode（..）.lower（）
.awesome！谢谢你的精彩解释！实际上，出于某种原因，它为下一行提供了一个语法错误（我也尝试将w改为word，但仍然存在错误）