Python 分类精度太低（Word2Vec）_Python_Nlp_Classification_Word2vec_Emotion

Python 分类精度太低（Word2Vec）

python nlp

Python 分类精度太低（Word2Vec）,python,nlp,classification,word2vec,emotion,Python,Nlp,Classification,Word2vec,Emotion,我正在研究一个由word2vec解决的多标签情感分类问题。这是我从几个教程中学到的代码。现在准确度很低。大约0.02，它告诉我代码中有错误。但是我找不到它。我为TF-IDF和BOW（显然除了word2vec部分）尝试了这段代码，我得到了更好的精度分数，比如0.28，但这段代码似乎有点错误： np.set_printoptions(threshold=sys.maxsize) wv = gensim.models.KeyedVectors.load_word2vec_format("E:

我正在研究一个由word2vec解决的多标签情感分类问题。这是我从几个教程中学到的代码。现在准确度很低。大约0.02，它告诉我代码中有错误。但是我找不到它。我为TF-IDF和BOW（显然除了word2vec部分）尝试了这段代码，我得到了更好的精度分数，比如0.28，但这段代码似乎有点错误：

np.set_printoptions(threshold=sys.maxsize)
wv = gensim.models.KeyedVectors.load_word2vec_format("E:\\GoogleNews-vectors-negative300.bin", binary=True)
wv.init_sims(replace=True)

#Pre-Processor Function
pre_processor = TextPreProcessor(
    omit=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'url', 'date', 'number'],
    
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'url', 'date', 'number'],
     
    segmenter="twitter", 
    
    corrector="twitter", 
    
    unpack_hashtags=True,
    unpack_contractions=True,
    
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    
    dicts=[emoticons]
)

#Averaging Words Vectors to Create Sentence Embedding
def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

#Secondary Word-Averaging Method
def get_mean_vector(word2vec_model, words):
# remove out-of-vocabulary words
words = [word for word in words if word in word2vec_model.vocab]
if len(words) >= 1:
    return np.mean(word2vec_model[words], axis=0)
else:
    return []

#Loading data
raw_train_tweets = pandas.read_excel('E:\\train.xlsx').iloc[:,1] #Loading all train tweets
train_labels = np.array(pandas.read_excel('E:\\train.xlsx').iloc[:,2:13]) #Loading corresponding train labels (11 emotions)

raw_test_tweets = pandas.read_excel('E:\\test.xlsx').iloc[:,1] #Loading 300 test tweets
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx').iloc[:,2:13]) #Loading corresponding test labels (11 emotions)
print("please wait")

#Pre-Processing
train_tweets=[]
test_tweets=[]
for tweets in raw_train_tweets:
    train_tweets.append(pre_processor.pre_process_doc(tweets))

for tweets in raw_test_tweets:
    test_tweets.append(pre_processor.pre_process_doc(tweets))

#Vectorizing 
train_array = word_averaging_list(wv,train_tweets)
test_array = word_averaging_list(wv,test_tweets)

#Predicting and Evaluating    
clf = LabelPowerset(LogisticRegression(solver='lbfgs', C=1, class_weight=None))
clf.fit(train_array,train_labels)
predicted= clf.predict(test_array)
intersect=0
union=0
accuracy=[]
for i in range(0,3250): #i have 3250 test tweets.
    for j in range(0,11): #11 emotions
        if predicted[i,j]&test_gold_labels[i,j]==1:
            intersect+=1
        if predicted[i,j]|test_gold_labels[i,j]==1:
            union+=1
    
    accuracy.append(intersect/union) if union !=0 else accuracy.append(0.0)
    intersect=0
    union=0
print(np.mean(accuracy))

结果是：

0.4674498168498169

我打印了预测变量（对于tweet 0到10），看看它是什么样子：

  (0, 0)    1
  (0, 2)    1
  (2, 0)    1
  (2, 2)    1
  (3, 4)    1
  (3, 6)    1
  (4, 0)    1
  (4, 2)    1
  (5, 0)    1
  (5, 2)    1
  (6, 0)    1
  (6, 2)    1
  (7, 0)    1
  (7, 2)    1
  (8, 4)    1
  (8, 6)    1
  (9, 3)    1
  (9, 8)    1

如你所见，它只显示1。例如，（6,2）表示在推特6中，情感2是1。（9,8）表示在9号推文中，8号情感是1。其他情绪被认为是0。但你们可以这样想象，为了更好地理解我在精确方法中所做的工作：

gold emotion for tweet 0:      [1 1 0 0 0 0 1 0 0 0 1]
predicted emotion for tweet 0: [1 0 1 0 0 0 0 0 0 0 0]

我对索引逐一使用了并集和交集。1比1。1比1。0到1，直到黄金情绪11到预测情绪11。我在两个for循环中对所有tweet都这样做

在我的推文上创建Word2Vec向量：现在我想使用gensim在我的推特数据集上创建Word2Vec向量。我将上述代码的某些部分更改如下：

#Averaging Words Vectors to Create Sentence Embedding
def word_averaging(wv, words):
    all_words, mean = set(), []

    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

#Loading data
raw_aggregate_tweets = pandas.read_excel('E:\\aggregate.xlsx').iloc[:,0] #Loading all train tweets

raw_train_tweets = pandas.read_excel('E:\\train.xlsx').iloc[:,1] #Loading all train tweets
train_labels = np.array(pandas.read_excel('E:\\train.xlsx').iloc[:,2:13]) #Loading corresponding train labels (11 emotions)

raw_test_tweets = pandas.read_excel('E:\\test.xlsx').iloc[:,1] #Loading 300 test tweets
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx').iloc[:,2:13]) #Loading corresponding test labels (11 emotions)
print("please wait")

#Pre-Processing
aggregate_tweets=[]
train_tweets=[]
test_tweets=[]
for tweets in raw_aggregate_tweets:
    aggregate_tweets.append(pre_processor.pre_process_doc(tweets))

for tweets in raw_train_tweets:
    train_tweets.append(pre_processor.pre_process_doc(tweets))

for tweets in raw_test_tweets:
    test_tweets.append(pre_processor.pre_process_doc(tweets))
    
print(len(aggregate_tweets))
#Vectorizing 
w2v_model = gensim.models.Word2Vec(aggregate_tweets, min_count = 10, size = 300, window = 8)

print(w2v_model.wv.vectors.shape)

train_array = word_averaging_list(w2v_model.wv,train_tweets)
test_array = word_averaging_list(w2v_model.wv,test_tweets)

但我得到了这个错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-1-8a5fe4dbf144> in <module>
    110 print(w2v_model.wv.vectors.shape)
    111 
--> 112 train_array = word_averaging_list(w2v_model.wv,train_tweets)
    113 test_array = word_averaging_list(w2v_model.wv,test_tweets)
    114 

<ipython-input-1-8a5fe4dbf144> in word_averaging_list(wv, text_list)
     70 
     71 def  word_averaging_list(wv, text_list):
---> 72     return np.vstack([word_averaging(wv, post) for post in text_list ])
     73 
     74 #Averaging Words Vectors to Create Sentence Embedding

<ipython-input-1-8a5fe4dbf144> in <listcomp>(.0)
     70 
     71 def  word_averaging_list(wv, text_list):
---> 72     return np.vstack([word_averaging(wv, post) for post in text_list ])
     73 
     74 #Averaging Words Vectors to Create Sentence Embedding

<ipython-input-1-8a5fe4dbf144> in word_averaging(wv, words)
     58             mean.append(word)
     59         elif word in wv.vocab:
---> 60             mean.append(wv.syn0norm[wv.vocab[word].index])
     61             all_words.add(wv.vocab[word].index)
     62 

TypeError: 'NoneType' object is not subscriptable

TypeError回溯（最近一次调用）
在里面
110打印（w2v_model.wv.vectors.shape）
111
-->112 train_array=单词平均列表（w2v_model.wv，train_tweets）
113测试数组=单词平均列表（w2v\u model.wv，测试tweets）
114
在单词列表中（wv，文本列表）
70
71 def单词列表（wv，文本列表）：
--->72返回np.vstack（[word\u平均值（wv，post）用于文本列表中的post]）
73
74#平均单词向量以创建句子嵌入
英寸（.0）
70
71 def单词列表（wv，文本列表）：
--->72返回np.vstack（[word\u平均值（wv，post）用于文本列表中的post]）
73
74#平均单词向量以创建句子嵌入
字内平均值（wv，字）
58平均值。追加（单词）
59 wv.vocab中的elif单词：
--->60 mean.append（wv.syn0norm[wv.vocab[word].index]）
61所有单词。添加（wv.vocab[word]。索引）
62
TypeError:“非类型”对象不可下标

不清楚您的

文本预处理器

或

SocialTokenizer

类可以做什么。你应该编辑你的问题，要么显示他们的代码，要么显示一些结果文本的示例，以确保它符合你的预期。（例如：显示

所有tweets

的前几个和最后几个条目）

您的线路

all_tweets=train_tweets.append（test_tweets）

不太可能达到您的预期效果。（它将把整个列表

test_tweets

作为

all_tweets

的最后一个元素，但随后返回分配给

all_tweets

的

None

。然后你的

Word2Vec

模型可能是空的-你应该启用信息日志来观察它的进程，检查输出是否有异常，并在训练后添加代码。）正在打印有关模型的一些详细信息，以确认进行了有用的培训。）

您确定

train\u tweets

是管道

.fit（）

的正确格式吗？（发送到

Word2Vec

培训的文本似乎已通过

.split（）

标记化，但

pandas.Series

train\u tweets

中的文本可能从未标记化。）

通常，一个好主意是启用日志记录，并在每个步骤后添加更多代码，通过检查属性值或打印较长集合的摘录，确认每个步骤都达到了预期效果

谢谢你帮助我。我已经更改了源代码（嵌入部分）。请看一看。我还打印了我所有的结果。有什么想法吗？谢谢你的补充问题。我不确定主要问题，但需要考虑以下几点：（1）你的

word\u averacing（）

可能做得比你需要的更多-特别是，使用

\u norm

向量&然后将结果强制为unit vec可能会丢失有价值的量值信息。（2）不清楚为什么要将

train_tweets

和

test_tweets

更改回

pandas.Series

——这不是后续步骤需要的格式…（3）您确定所有分类器都适用于多个标签吗？您能否查看并展示正在进行的多标签“预测”？（4）

accurity\u score（）

是否仅当所有标签都正确时才将预测视为正确的，或者是否将每个标签视为一个要测试的预测？（5）错误是否有一种模式：某些标签预测得好，而其他标签预测得不好？成功是否与有更多/更好培训数据的项目相关？感谢您花时间帮助我，我的处境真的很困难。让我回答你的问题。（1）老实说，我从一个教程中找到了这个平均函数。如果你有关于省略某一部分的建议，请告诉我。（2）忘了这部分吧。我尝试了没有它的代码，结果是一样的。现在我删除了.Series部分。