Python MovieLens数据中的类似用户

Python MovieLens数据中的类似用户,python,python-2.7,numpy,machine-learning,cosine-similarity,Python,Python 2.7,Numpy,Machine Learning,Cosine Similarity,我试图在python中使用numpy在电影镜头数据中找到类似的用户,以便所有计算都很快。然而,我不能得到最终的代码来使用矩阵、多重应用等来寻找相似性 代码如下: import pandas as pd import numpy as np # pass in column names for each CSV and read them using pandas. # Column names available in the readme file def train_test_spli

我试图在python中使用numpy在电影镜头数据中找到类似的用户,以便所有计算都很快。然而,我不能得到最终的代码来使用矩阵、多重应用等来寻找相似性

代码如下:

import pandas as pd
import numpy as np

# pass in column names for each CSV and read them using pandas. 
# Column names available in the readme file

def train_test_split(ratings):
    test = np.zeros(ratings.shape)
    train = ratings.copy()
    for user in xrange(ratings.shape[0]):
        test_ratings = np.random.choice(ratings[user, :].nonzero()[0],size=10,replace=False)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]
    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test
def similarity(ratings, kind='user', epsilon=1e-9):
    # epsilon -> small number for handling dived-by-zero errors
    if kind == 'user':
        sim = ratings.dot(ratings.T) + epsilon
    elif kind == 'item':
        sim = ratings.T.dot(ratings) + epsilon
    #Till here I have found the matrix which contains users multplications and squares with themselves..
    #What shall be written to return from here ??


if __name__ == '__main__':

#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
encoding='latin-1')

#Reading ratings file:
r_cols = ['user_id', 'item_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
encoding='latin-1')

#Reading items file:
i_cols = ['item_id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols,
encoding='latin-1')
r_cols = ['user_id', 'item_id', 'rating', 'unix_timestamp']
ratings_base = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')

n_users = ratings.user_id.unique().shape[0]
n_items = ratings.item_id.unique().shape[0]

rating_array = np.zeros((n_users, n_items))
for row in ratings.itertuples():
    rating_array[row[1]-1, row[2]-1] = row[3]
print rating_array
sparsity = float(len(rating_array.nonzero()[0]))
sparsity = sparsity / (rating_array.shape[0] * rating_array.shape[1])
sparsity = sparsity * 100
print 'Sparsity: {:4.2f}%'.format(sparsity)
train,test=train_test_split(rating_array)
train=train[:6][:6]
user_similarity = fast_similarity(train, kind='user')
#print user_similarity[:6][:6]

因此,在编写矩阵计算及其解释方面的任何帮助都将有助于理解如何求解此类方程。

代码取自,但我无法解开快速相似性代码。。这里的任何帮助都是值得赞赏的,因为代码的自我实现存在限制。有各种各样的库,可以很容易地进行计算。“重塑车轮”