Python 在大型数据集上正确计算低内存的余弦相似性?

Python 在大型数据集上正确计算低内存的余弦相似性?,python,pandas,numpy,scikit-learn,Python,Pandas,Numpy,Scikit Learn,我在这里学习本教程只是为了了解一点关于内容推荐者的知识: 但是在运行教程的“基于内容”部分时,我遇到了一个内存错误。经过阅读,我发现这与使用的数据集有多大有关。我真的找不到一个确切的方法来解决这个具体案例,如何在内存不足的情况下运行它,所以我对它做了一点修改,将原始数据帧分成6个部分,对每个分割的数据帧运行这个余弦相似性计算,将结果合并在一起,最后运行一次以获得最终结果。这是我的密码: import pandas as pd import numpy as np from sklearn.fea

我在这里学习本教程只是为了了解一点关于内容推荐者的知识:

但是在运行教程的“基于内容”部分时,我遇到了一个
内存错误。经过阅读,我发现这与使用的数据集有多大有关。我真的找不到一个确切的方法来解决这个具体案例,如何在内存不足的情况下运行它,所以我对它做了一点修改,将原始数据帧分成6个部分,对每个分割的数据帧运行这个余弦相似性计算,将结果合并在一起,最后运行一次以获得最终结果。这是我的密码:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity

# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, indices, cosine_sim, final=False):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    if not final:
        return metadata.iloc[movie_indices, :]
    else:
        return metadata['title'].iloc[movie_indices]

# Load Movies Metadata
metadata = pd.read_csv('dataset/movies_metadata.csv', low_memory=False)

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

split_db = np.array_split(metadata, 6)

source_db = None
search_db = None
db_remove_idx = None
new_db_list = list()
for x, db in enumerate(split_db):
    search = db.loc[db['title'] == 'The Dark Knight Rises']
    if not search.empty:
        source_db = db
        new_db_list.append(source_db)
        search_db = search
        db_remove_idx = x
        break

split_db.pop(db_remove_idx)

for x, db in enumerate(split_db):
    new_db_list.append(db.append(search_db, ignore_index=True))

del(split_db)

refined_db = None

for db in new_db_list:
    small_db = db.reset_index()
    #Construct the required TF-IDF matrix by fitting and transforming the data
    tfidf_matrix = tfidf.fit_transform(small_db['overview'])
    
    # Compute the cosine similarity matrix
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    #cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    
    #Construct a reverse map of indices and movie titles
    indices = pd.Series(small_db.index, index=small_db['title']).drop_duplicates()
    
    result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim))
    if type(refined_db) != pd.core.frame.DataFrame:
        refined_db = result.append(search_db, ignore_index=True)
    else:
        refined_db = refined_db.append(result, ignore_index=True)

final_db = refined_db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(final_db['overview'])

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

#Construct a reverse map of indices and movie titles
indices = pd.Series(final_db.index, index=final_db['title']).drop_duplicates()

final_result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim, final=True))
print(final_result)
我认为这会起作用,但结果甚至与教程中给出的结果不符:

11       Dracula: Dead and Loving It
13                             Nixon
12                             Balto
15                            Casino
20                        Get Shorty
18    Ace Ventura: When Nature Calls
14                  Cutthroat Island
16             Sense and Sensibility
19                       Money Train
17                        Four Rooms
Name: title, dtype: object
谁能解释一下我做错了什么?我认为,由于数据集太大,所以将其拆分,首先运行这个“余弦相似性”过程作为一种细化,然后使用生成的数据并再次运行该过程将得到类似的结果,但为什么我得到的结果与预期的结果如此不同

这是我用它来对付的数据: