Python 分类精度太低(Word2Vec)
我正在研究一个由word2vec解决的多标签情感分类问题。这是我从几个教程中学到的代码。现在准确度很低。大约0.02,它告诉我代码中有错误。但是我找不到它。我为TF-IDF和BOW(显然除了word2vec部分)尝试了这段代码,我得到了更好的精度分数,比如0.28,但这段代码似乎有点错误:Python 分类精度太低(Word2Vec),python,nlp,classification,word2vec,emotion,Python,Nlp,Classification,Word2vec,Emotion,我正在研究一个由word2vec解决的多标签情感分类问题。这是我从几个教程中学到的代码。现在准确度很低。大约0.02,它告诉我代码中有错误。但是我找不到它。我为TF-IDF和BOW(显然除了word2vec部分)尝试了这段代码,我得到了更好的精度分数,比如0.28,但这段代码似乎有点错误: np.set_printoptions(threshold=sys.maxsize) wv = gensim.models.KeyedVectors.load_word2vec_format("E:
np.set_printoptions(threshold=sys.maxsize)
wv = gensim.models.KeyedVectors.load_word2vec_format("E:\\GoogleNews-vectors-negative300.bin", binary=True)
wv.init_sims(replace=True)
#Pre-Processor Function
pre_processor = TextPreProcessor(
omit=['url', 'email', 'percent', 'money', 'phone', 'user',
'time', 'url', 'date', 'number'],
normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
'time', 'url', 'date', 'number'],
segmenter="twitter",
corrector="twitter",
unpack_hashtags=True,
unpack_contractions=True,
tokenizer=SocialTokenizer(lowercase=True).tokenize,
dicts=[emoticons]
)
#Averaging Words Vectors to Create Sentence Embedding
def word_averaging(wv, words):
all_words, mean = set(), []
for word in words:
if isinstance(word, np.ndarray):
mean.append(word)
elif word in wv.vocab:
mean.append(wv.syn0norm[wv.vocab[word].index])
all_words.add(wv.vocab[word].index)
if not mean:
logging.warning("cannot compute similarity with no input %s", words)
# FIXME: remove these examples in pre-processing
return np.zeros(wv.vector_size,)
mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
return mean
def word_averaging_list(wv, text_list):
return np.vstack([word_averaging(wv, post) for post in text_list ])
#Secondary Word-Averaging Method
def get_mean_vector(word2vec_model, words):
# remove out-of-vocabulary words
words = [word for word in words if word in word2vec_model.vocab]
if len(words) >= 1:
return np.mean(word2vec_model[words], axis=0)
else:
return []
#Loading data
raw_train_tweets = pandas.read_excel('E:\\train.xlsx').iloc[:,1] #Loading all train tweets
train_labels = np.array(pandas.read_excel('E:\\train.xlsx').iloc[:,2:13]) #Loading corresponding train labels (11 emotions)
raw_test_tweets = pandas.read_excel('E:\\test.xlsx').iloc[:,1] #Loading 300 test tweets
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx').iloc[:,2:13]) #Loading corresponding test labels (11 emotions)
print("please wait")
#Pre-Processing
train_tweets=[]
test_tweets=[]
for tweets in raw_train_tweets:
train_tweets.append(pre_processor.pre_process_doc(tweets))
for tweets in raw_test_tweets:
test_tweets.append(pre_processor.pre_process_doc(tweets))
#Vectorizing
train_array = word_averaging_list(wv,train_tweets)
test_array = word_averaging_list(wv,test_tweets)
#Predicting and Evaluating
clf = LabelPowerset(LogisticRegression(solver='lbfgs', C=1, class_weight=None))
clf.fit(train_array,train_labels)
predicted= clf.predict(test_array)
intersect=0
union=0
accuracy=[]
for i in range(0,3250): #i have 3250 test tweets.
for j in range(0,11): #11 emotions
if predicted[i,j]&test_gold_labels[i,j]==1:
intersect+=1
if predicted[i,j]|test_gold_labels[i,j]==1:
union+=1
accuracy.append(intersect/union) if union !=0 else accuracy.append(0.0)
intersect=0
union=0
print(np.mean(accuracy))
结果是:
0.4674498168498169
我打印了预测变量(对于tweet 0到10),看看它是什么样子:
(0, 0) 1
(0, 2) 1
(2, 0) 1
(2, 2) 1
(3, 4) 1
(3, 6) 1
(4, 0) 1
(4, 2) 1
(5, 0) 1
(5, 2) 1
(6, 0) 1
(6, 2) 1
(7, 0) 1
(7, 2) 1
(8, 4) 1
(8, 6) 1
(9, 3) 1
(9, 8) 1
如你所见,它只显示1。例如,(6,2)表示在推特6中,情感2是1。(9,8)表示在9号推文中,8号情感是1。其他情绪被认为是0。但你们可以这样想象,为了更好地理解我在精确方法中所做的工作:
gold emotion for tweet 0: [1 1 0 0 0 0 1 0 0 0 1]
predicted emotion for tweet 0: [1 0 1 0 0 0 0 0 0 0 0]
我对索引逐一使用了并集和交集。1比1。1比1。0到1,直到黄金情绪11到预测情绪11。我在两个for循环中对所有tweet都这样做
在我的推文上创建Word2Vec向量:
现在我想使用gensim在我的推特数据集上创建Word2Vec向量。我将上述代码的某些部分更改如下:
#Averaging Words Vectors to Create Sentence Embedding
def word_averaging(wv, words):
all_words, mean = set(), []
for word in words:
if isinstance(word, np.ndarray):
mean.append(word)
elif word in wv.vocab:
mean.append(wv.syn0norm[wv.vocab[word].index])
all_words.add(wv.vocab[word].index)
if not mean:
logging.warning("cannot compute similarity with no input %s", words)
# FIXME: remove these examples in pre-processing
return np.zeros(wv.vector_size,)
mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
return mean
def word_averaging_list(wv, text_list):
return np.vstack([word_averaging(wv, post) for post in text_list ])
#Loading data
raw_aggregate_tweets = pandas.read_excel('E:\\aggregate.xlsx').iloc[:,0] #Loading all train tweets
raw_train_tweets = pandas.read_excel('E:\\train.xlsx').iloc[:,1] #Loading all train tweets
train_labels = np.array(pandas.read_excel('E:\\train.xlsx').iloc[:,2:13]) #Loading corresponding train labels (11 emotions)
raw_test_tweets = pandas.read_excel('E:\\test.xlsx').iloc[:,1] #Loading 300 test tweets
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx').iloc[:,2:13]) #Loading corresponding test labels (11 emotions)
print("please wait")
#Pre-Processing
aggregate_tweets=[]
train_tweets=[]
test_tweets=[]
for tweets in raw_aggregate_tweets:
aggregate_tweets.append(pre_processor.pre_process_doc(tweets))
for tweets in raw_train_tweets:
train_tweets.append(pre_processor.pre_process_doc(tweets))
for tweets in raw_test_tweets:
test_tweets.append(pre_processor.pre_process_doc(tweets))
print(len(aggregate_tweets))
#Vectorizing
w2v_model = gensim.models.Word2Vec(aggregate_tweets, min_count = 10, size = 300, window = 8)
print(w2v_model.wv.vectors.shape)
train_array = word_averaging_list(w2v_model.wv,train_tweets)
test_array = word_averaging_list(w2v_model.wv,test_tweets)
但我得到了这个错误:
TypeError Traceback (most recent call last)
<ipython-input-1-8a5fe4dbf144> in <module>
110 print(w2v_model.wv.vectors.shape)
111
--> 112 train_array = word_averaging_list(w2v_model.wv,train_tweets)
113 test_array = word_averaging_list(w2v_model.wv,test_tweets)
114
<ipython-input-1-8a5fe4dbf144> in word_averaging_list(wv, text_list)
70
71 def word_averaging_list(wv, text_list):
---> 72 return np.vstack([word_averaging(wv, post) for post in text_list ])
73
74 #Averaging Words Vectors to Create Sentence Embedding
<ipython-input-1-8a5fe4dbf144> in <listcomp>(.0)
70
71 def word_averaging_list(wv, text_list):
---> 72 return np.vstack([word_averaging(wv, post) for post in text_list ])
73
74 #Averaging Words Vectors to Create Sentence Embedding
<ipython-input-1-8a5fe4dbf144> in word_averaging(wv, words)
58 mean.append(word)
59 elif word in wv.vocab:
---> 60 mean.append(wv.syn0norm[wv.vocab[word].index])
61 all_words.add(wv.vocab[word].index)
62
TypeError: 'NoneType' object is not subscriptable
TypeError回溯(最近一次调用)
在里面
110打印(w2v_model.wv.vectors.shape)
111
-->112 train_array=单词平均列表(w2v_model.wv,train_tweets)
113测试数组=单词平均列表(w2v\u model.wv,测试tweets)
114
在单词列表中(wv,文本列表)
70
71 def单词列表(wv,文本列表):
--->72返回np.vstack([word\u平均值(wv,post)用于文本列表中的post])
73
74#平均单词向量以创建句子嵌入
英寸(.0)
70
71 def单词列表(wv,文本列表):
--->72返回np.vstack([word\u平均值(wv,post)用于文本列表中的post])
73
74#平均单词向量以创建句子嵌入
字内平均值(wv,字)
58平均值。追加(单词)
59 wv.vocab中的elif单词:
--->60 mean.append(wv.syn0norm[wv.vocab[word].index])
61所有单词。添加(wv.vocab[word]。索引)
62
TypeError:“非类型”对象不可下标
不清楚您的文本预处理器
或SocialTokenizer
类可以做什么。你应该编辑你的问题,要么显示他们的代码,要么显示一些结果文本的示例,以确保它符合你的预期。(例如:显示所有tweets
的前几个和最后几个条目)
您的线路all_tweets=train_tweets.append(test_tweets)
不太可能达到您的预期效果。(它将把整个列表test_tweets
作为all_tweets
的最后一个元素,但随后返回分配给all_tweets
的None
。然后你的Word2Vec
模型可能是空的-你应该启用信息日志来观察它的进程,检查输出是否有异常,并在训练后添加代码。)正在打印有关模型的一些详细信息,以确认进行了有用的培训。)
您确定train\u tweets
是管道.fit()
的正确格式吗?(发送到Word2Vec
培训的文本似乎已通过.split()
标记化,但pandas.Series
train\u tweets
中的文本可能从未标记化。)
通常,一个好主意是启用日志记录,并在每个步骤后添加更多代码,通过检查属性值或打印较长集合的摘录,确认每个步骤都达到了预期效果 谢谢你帮助我。我已经更改了源代码(嵌入部分)。请看一看。我还打印了我所有的结果。有什么想法吗?谢谢你的补充问题。我不确定主要问题,但需要考虑以下几点:(1)你的
word\u averacing()
可能做得比你需要的更多-特别是,使用\u norm
向量&然后将结果强制为unit vec可能会丢失有价值的量值信息。(2) 不清楚为什么要将train_tweets
和test_tweets
更改回pandas.Series
——这不是后续步骤需要的格式…(3) 您确定所有分类器都适用于多个标签吗?您能否查看并展示正在进行的多标签“预测”?(4) accurity\u score()
是否仅当所有标签都正确时才将预测视为正确的,或者是否将每个标签视为一个要测试的预测?(5) 错误是否有一种模式:某些标签预测得好,而其他标签预测得不好?成功是否与有更多/更好培训数据的项目相关?感谢您花时间帮助我,我的处境真的很困难。让我回答你的问题。(1) 老实说,我从一个教程中找到了这个平均函数。如果你有关于省略某一部分的建议,请告诉我。(2) 忘了这部分吧。我尝试了没有它的代码,结果是一样的。现在我删除了.Series部分。