Python 在大型数据集上正确计算低内存的余弦相似性?
我在这里学习本教程只是为了了解一点关于内容推荐者的知识: 但是在运行教程的“基于内容”部分时,我遇到了一个Python 在大型数据集上正确计算低内存的余弦相似性?,python,pandas,numpy,scikit-learn,Python,Pandas,Numpy,Scikit Learn,我在这里学习本教程只是为了了解一点关于内容推荐者的知识: 但是在运行教程的“基于内容”部分时,我遇到了一个内存错误。经过阅读,我发现这与使用的数据集有多大有关。我真的找不到一个确切的方法来解决这个具体案例,如何在内存不足的情况下运行它,所以我对它做了一点修改,将原始数据帧分成6个部分,对每个分割的数据帧运行这个余弦相似性计算,将结果合并在一起,最后运行一次以获得最终结果。这是我的密码: import pandas as pd import numpy as np from sklearn.fea
内存错误。经过阅读,我发现这与使用的数据集有多大有关。我真的找不到一个确切的方法来解决这个具体案例,如何在内存不足的情况下运行它,所以我对它做了一点修改,将原始数据帧分成6个部分,对每个分割的数据帧运行这个余弦相似性计算,将结果合并在一起,最后运行一次以获得最终结果。这是我的密码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, indices, cosine_sim, final=False):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
if not final:
return metadata.iloc[movie_indices, :]
else:
return metadata['title'].iloc[movie_indices]
# Load Movies Metadata
metadata = pd.read_csv('dataset/movies_metadata.csv', low_memory=False)
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
split_db = np.array_split(metadata, 6)
source_db = None
search_db = None
db_remove_idx = None
new_db_list = list()
for x, db in enumerate(split_db):
search = db.loc[db['title'] == 'The Dark Knight Rises']
if not search.empty:
source_db = db
new_db_list.append(source_db)
search_db = search
db_remove_idx = x
break
split_db.pop(db_remove_idx)
for x, db in enumerate(split_db):
new_db_list.append(db.append(search_db, ignore_index=True))
del(split_db)
refined_db = None
for db in new_db_list:
small_db = db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(small_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(small_db.index, index=small_db['title']).drop_duplicates()
result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim))
if type(refined_db) != pd.core.frame.DataFrame:
refined_db = result.append(search_db, ignore_index=True)
else:
refined_db = refined_db.append(result, ignore_index=True)
final_db = refined_db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(final_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(final_db.index, index=final_db['title']).drop_duplicates()
final_result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim, final=True))
print(final_result)
我认为这会起作用,但结果甚至与教程中给出的结果不符:
11 Dracula: Dead and Loving It
13 Nixon
12 Balto
15 Casino
20 Get Shorty
18 Ace Ventura: When Nature Calls
14 Cutthroat Island
16 Sense and Sensibility
19 Money Train
17 Four Rooms
Name: title, dtype: object
谁能解释一下我做错了什么?我认为,由于数据集太大,所以将其拆分,首先运行这个“余弦相似性”过程作为一种细化,然后使用生成的数据并再次运行该过程将得到类似的结果,但为什么我得到的结果与预期的结果如此不同
这是我用它来对付的数据: