Python 新数据预测的测试_Python_Keras_Theano_Sentiment Analysis

Python 新数据预测的测试

python keras

Python 新数据预测的测试,python,keras,theano,sentiment-analysis,Python,Keras,Theano,Sentiment Analysis,我使用Yelp数据挑战数据训练了一个模型，得到了一个pickle文件399850by50reviews\u words\u index.pkl，但我研究了如何使用这个pickle文件测试keras中的新数据这是我用来训练数据并保存到模型创建中的代码如何将此模型用于测试数据这里我使用Keras 1.0.0和Theano ''' train cnn mode for sentiment classification on yelp data set author: hao peng ''' i

我使用Yelp数据挑战数据训练了一个模型，得到了一个pickle文件399850by50reviews\u words\u index.pkl，但我研究了如何使用这个pickle文件测试keras中的新数据

这是我用来训练数据并保存到模型创建中的代码

如何将此模型用于测试数据

这里我使用Keras 1.0.0和Theano

'''
train cnn mode for sentiment classification on yelp data set
author: hao peng
'''
import pickle
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from Word2VecUtility import Word2VecUtility
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.convolutional import Convolution1D, MaxPooling1D


def get_volcabulary_and_list_words(data):
    reviews_words = []
    volcabulary = []
    for review in data["text"]:
        review_words = Word2VecUtility.review_to_wordlist(
            review, remove_stopwords=True)
        reviews_words.append(review_words)
        for word in review_words:
            volcabulary.append(word)
    volcabulary = set(volcabulary)
    return volcabulary, reviews_words

def get_reviews_word_index(reviews_words, volcabulary, max_words, max_length):
    word2index = {word: i for i, word in enumerate(volcabulary)}
    # use w in volcabulary to limit index within max_words
    reviews_words_index = [[start] + [(word2index[w] + index_from) for w in review] for review in reviews_words]
    # in word2vec embedding, use (i < max_words + index_from) because we need the exact index for each word, in order to map it to its vector. And then its max_words is 5003 instead of 5000.
    reviews_words_index = [[i if (i < max_words) else oov for i in index] for index in reviews_words_index]
    # padding with 0, each review has max_length now.
    reviews_words_index = sequence.pad_sequences(reviews_words_index, maxlen=max_length, padding='post', truncating='post')
    return reviews_words_index

def vectorize_labels(labels, nums):
    labels = np.asarray(labels, dtype='int32')
    length = len(labels)
    Y = np.zeros((length, nums))
    for i in range(length):
        Y[i, (labels[i]-1)] = 1.
    return Y
# data processing para
max_words = 5000
max_length = 50

# model training parameters
batch_size = 32
embedding_dims = 100
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 2

# index trick parameters
index_from = 3
start = 1
# padding = 0
oov = 2

data = pd.read_csv(
    'review_sub_399850.tsv', header=0, delimiter="\t", quoting=3, encoding='utf-8')
print('get volcabulary...')
volcabulary, reviews_words = get_volcabulary_and_list_words(data)
print('get reviews_words_index...')
reviews_words_index = get_reviews_word_index(reviews_words, volcabulary, max_words, max_length)

print reviews_words_index[:20, :12]
print reviews_words_index.shape

labels = data["stars"]

pickle.dump((reviews_words_index, labels), open("399850by50reviews_words_index.pkl", 'wb'))

(reviews_words_index, labels) = pickle.load(open("399850by50reviews_words_index.pkl", 'rb'))

index = np.arange(reviews_words_index.shape[0])
train_index, valid_index = train_test_split(
    index, train_size=0.8, random_state=100)

labels = vectorize_labels(labels, 5)
train_data = reviews_words_index[train_index]
valid_data = reviews_words_index[valid_index]
train_labels = labels[train_index]
valid_labels = labels[valid_index]
print train_data.shape
print valid_data.shape
print train_labels[:10]

del(labels, train_index, valid_index)

print "start training model..."

model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_words + index_from, embedding_dims, \
                    input_length=max_length))
model.add(Dropout(0.25))

# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:

# filter_length is like filter size, subsample_length is like step in 2D CNN.
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
# we use standard max pooling (halving the output of the previous layer):
model.add(MaxPooling1D(pool_length=2))

# We flatten the output of the conv layer,
# so that we can add a vanilla dense layer:
model.add(Flatten())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.25))
model.add(Activation('relu'))

# We project onto 5 unit output layer, and activate it with softmax:
model.add(Dense(5))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              class_mode='categorical')
model.fit(train_data, train_labels, batch_size=batch_size,
          nb_epoch=nb_epoch, show_accuracy=True,
          validation_data=(valid_data, valid_labels))

“”
yelp数据集情感分类的训练cnn模式
作者：郝鹏
'''
进口泡菜
作为pd进口熊猫
将numpy作为np导入
从sklearn.cross\u验证导入序列测试\u分割
从Word2VecUtility导入Word2VecUtility
从keras.preprocessing导入序列
从keras.models导入顺序
从keras.layers.core导入致密、脱落、激活、展平
从keras.layers.embedings导入嵌入
从keras.layers.convolution导入卷积1d，MaxPoolog1d
def get_volcabulary_和_list_单词（数据）：
评论（单词=[]
容量=[]
在数据[“文本”]中查看：
review\u words=Word2VecUtility.review\u to\u单词列表(
查看，删除（stopwords=True）
复习单词。追加（复习单词）
对于评论中的单词\u单词：
附加（单词）
volcabulary=set（volcabulary）
返回任意列表、评论和单词
def get_reviews_word_索引（评论词、词汇表、最大词、最大长度）：
word2index={word:i代表i，枚举中的单词（volcabulary）}
#在volcabulary中使用w将索引限制在max_字以内
评论文字索引=[[start]+[[word2index[w]+索引文字]用于评论文字中的评论文字]
#在word2vec嵌入中，使用（i

测试输入数据的形状必须与

train\u data

和

valid\u data

完全相同，但第一个维度是批量大小

因此，您必须使用要测试的输入数据创建一个numpy数组，并确保该数组的结构与

train\u data

完全相同，即

yourTestArray.shape[1://code>与train\u data.shape[1://code>完全相同，也等于有效的\u数据.shape[1://code>
拥有该数组后，应该使用results=model.predict（yourTestArray）