在Python中标记来自训练数据的tweet
我正在尝试使用python中的nltk将推文标记为肯定或否定。 我有3个文件“train_posi_tweets.txt”包含4000条正面推文“train_nega_tweets.txt”包含8000条负面推文,“unlabeled_tweets.txt”包含51647条推文,我需要标记这些推文…还有一条推文是西班牙语 关于GitHub上的victorneo,我现在有了这段代码,但它不起作用。有人能帮我解决这个问题吗???我在这一行得到一个错误:“对于pos_tweets+neg_tweets中的(单词、情绪):太多值无法解包异常”在Python中标记来自训练数据的tweet,python,twitter,classification,nltk,sentiment-analysis,Python,Twitter,Classification,Nltk,Sentiment Analysis,我正在尝试使用python中的nltk将推文标记为肯定或否定。 我有3个文件“train_posi_tweets.txt”包含4000条正面推文“train_nega_tweets.txt”包含8000条负面推文,“unlabeled_tweets.txt”包含51647条推文,我需要标记这些推文…还有一条推文是西班牙语 关于GitHub上的victorneo,我现在有了这段代码,但它不起作用。有人能帮我解决这个问题吗???我在这一行得到一个错误:“对于pos_tweets+neg_tweets
在
read\u tweets
函数中的这一行“对于pos\u tweets+neg\u tweets中的(单词、情绪):太多值无法解包异常”,您正在向其传递一个文件fname
以及t\u type
<代码>t_类型不会出现在函数中的任何位置。尝试返回tweets,t_type
# -*- coding: utf-8 -*-
"""
Created on Fri May 16 16:34:46 2014
@author: shyam
"""
import nltk
import json
from nltk.classify.naivebayes import NaiveBayesClassifier
import re
def get_words_in_tweets(tweets):
all_words = []
for (words, sentiment) in tweets:
all_words.extend(words)
return all_words
def get_word_features(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = wordlist.keys()
return word_features
def read_tweets(fname, t_type):
tweets = []
f = open(fname, 'r')
for line in f.readlines():
tweet = json.loads(line)
text = tweet['text'].strip().encode('ascii', errors='ignore')
text = re.sub(r"\n", " ", text) # remove newlines from text
tweets.append(text)
f.close()
return tweets
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
def classify_tweet(tweet):
return \
classifier.classify(extract_features(nltk.word_tokenize(tweet)))
# read in postive and negative training tweets
pos_tweets = read_tweets('train_posi_tweets.txt', 'positive')
neg_tweets = read_tweets('train_nega_tweets.txt', 'negative')
# filter away words that are less than 3 letters to form the training data
tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
tweets.append((words_filtered, sentiment))
# extract the word features out from the training data
word_features = get_word_features(\
get_words_in_tweets(tweets))
# get the training set and train the Naive Bayes Classifier
training_set = nltk.classify.util.apply_features(extract_features, tweets)
classifier = NaiveBayesClassifier.train(training_set)
# read in the test tweets and check accuracy
# to add your own test tweets, add them in the respective files
test_tweets = read_tweets('unlabeled_tweetss.txt', 'unlabled')
total = accuracy = float(len(test_tweets))
for tweet in test_tweets:
if classify_tweet(tweet[0]) != tweet[1]:
accuracy -= 1
print('Total accuracy: %f%% (%d/20).' % (accuracy / total * 100, accuracy))