Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Word_标记化twitter数据_Python_Python 3.x_Nltk - Fatal编程技术网

Python Word_标记化twitter数据

Python Word_标记化twitter数据,python,python-3.x,nltk,Python,Python 3.x,Nltk,我正试图从文本文件中提取英语单词,以实现一个简单的词频目标。如何筛选列表中的其他字符串 from nltk.tokenize import word_tokenize words = word_tokenize(message.replace('\n',' ')) print(words) 给出如下输出: ['Amazon', 'b', 'maji_opai', 'am\\xcd\\x9ca\\xcd\\x89zon\\xe2\\x80\\xa6', '\\xcb\\x99\\xea\\x9

我正试图从文本文件中提取英语单词,以实现一个简单的词频目标。如何筛选列表中的其他字符串

from nltk.tokenize import word_tokenize
words = word_tokenize(message.replace('\n',' '))

print(words)
给出如下输出:

['Amazon', 'b', 'maji_opai', 'am\\xcd\\x9ca\\xcd\\x89zon\\xe2\\x80\\xa6', '\\xcb\\x99\\xea\\x92\\xb3\\xe2\\x80\\x8b\\xcb\\x99', 'Amazon', "b'RT", 'WorkingGIrl', 'For', 'people', 'love', 'REAL', 'paperbacks', 'THE', 'PARIS', 'EFFECT', '10', 'right', 'https', '//', 'https', 'Amazon', "b'RT", 'AbsentiaSeries', 'ABSENTIA', 'IS', 'HERE', '\\xf0\\x9f\\x91\\x81', '\\xf0\\x9f\\x91\\x81', '\\xf0\\x9f\\x91\\x81', '\\xf0\\x9f\\x91\\x81', '\\xf0\\x9f\\x91\\x81', 'US', 'UK', 'Australia', 'Germany', 'Ireland', 'Italy', 'Netherlands', 'go', 'https', 'Amazon', "b'RT", 

如果您有要查找的特定单词列表,可以使用简单的列表理解,如下所示:

words = word_tokenize(message.replace('\n',' '))
word_list = ['amazon', 'b']
filtered_words = [x for x in words if x in word_list]
如果您经常使用python,那么您应该深入了解列表理解,它会出现很多问题


nltk
中有一个手工制作的推特标记器:

>>> from nltk.tokenize import TweetTokenizer
>>> tt = TweetTokenizer()
>>> tweet = 'Thanks to the historic TAX CUTS that I signed into law, your paychecks are going way UP, your taxes are going way DOWN, and America is once again OPEN FOR BUSINESS! #FakeNews'
>>> tt.tokenize(tweet)
['Thanks', 'to', 'the', 'historic', 'TAX', 'CUTS', 'that', 'I', 'signed', 'into', 'law', ',', 'your', 'paychecks', 'are', 'going', 'way', 'UP', ',', 'your', 'taxes', 'are', 'going', 'way', 'DOWN', ',', 'and', 'America', 'is', 'once', 'again', 'OPEN', 'FOR', 'BUSINESS', '!', '#FakeNews']

过滤这些词的标准是什么?任务的目的只是看看哪些词在关于品牌的推文中最常被提及