Python 计算数据帧的余弦距离

Python 计算数据帧的余弦距离,python,pandas,dataframe,pyspark,pyspark-sql,Python,Pandas,Dataframe,Pyspark,Pyspark Sql,我有一个熊猫数据框(比如df)的形状(70000 x 10)。数据框的头部如下所示: 0_x 1_x 2_x ... 7_x 8_x 9_x userid ... 1000010249674395648 0.000007 0.9999

我有一个熊猫数据框(比如df)的形状(70000 x 10)。数据框的头部如下所示:

                          0_x       1_x       2_x  ...       7_x       8_x       9_x
userid                                             ...                              
1000010249674395648  0.000007  0.999936  0.000007  ...  0.000007  0.000007  0.000007
1000282310388932608  0.000060  0.816790  0.000060  ...  0.000060  0.000060  0.000060
1000290654755450880  0.000050  0.000050  0.000050  ...  0.000050  0.191159  0.000050
1000304603840241665  0.993157  0.006766  0.000010  ...  0.000010  0.000010  0.000010
1000600081165438977  0.000064  0.970428  0.000064  ...  0.000064  0.000064  0.000064 
我想找出userid之间的成对余弦距离。例如:

余弦距离(1000010249674395648,1000282310388932608)=0.9758776214797362

我使用了以下提到的方法,但由于CPU内存有限,在计算余弦距离时都会出现内存不足错误:

  • scikit学习的余弦_相似性:

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_sim = cosine_similarity(df)
    
  • 在线找到更快的矢量化解决方案:

    def get_cosine_sim_df(df):
          topic_vectors = df.values
          norm_topic_vectors = topic_vectors / np.linalg.norm(topic_vectors, axis=-1)[:, np.newaxis]
          cosine_sim = np.dot(norm_topic_vectors, norm_topic_vectors.T)
          cosine_sim_df = pd.DataFrame(data = cosine_sim, index=df.index, columns=df.index)
          return cosine_sim_df
    
    cosine_sim = get_cosine_sim_df(df)
    
  • 系统硬件概述:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro11,4
      Processor Name: Quad-Core Intel Core i7
      Processor Speed: 2.2 GHz
      Number of Processors: 1
      Total Number of Cores: 4
      L2 Cache (per Core): 256 KB
      L3 Cache: 6 MB
      Hyper-Threading Technology: Enabled
      Memory: 16 GB
    
    我正在寻找一种高效、快速的方法来计算CPU内存限制内的成对余弦距离,类似于pyspark数据帧或pandas批处理技术,而不是一次处理所有数据帧

    欢迎提出任何建议/方法


    仅供参考-我正在使用Python 3.7

    我正在使用spark 2.4和Python 3.7

    # build spark session
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                        .master("local") \
                        .appName("cos_sim") \
                        .config("spark.some.config.option", "some-value") \
                        .getOrCreate()
    
    将您的熊猫df转换为spark df

    # Pandas to Spark
    df = spark_session.createDataFrame(pand_df)
    
    我生成了一些随机数据

    import random
    import pandas as pd
    from pyspark.sql.functions import monotonically_increasing_id 
    
    def generate_random_data(num_usrs = 20, num_cols = 10):
        cols = [str(i)+"_x" for i in range(num_cols)]
        usrsdata = [ [random.random() for i in range(num_cols)] for i in range(num_usrs)]
    #     return pd.DataFrame(usrsdata, columns = cols)
        return spark.createDataFrame(data = usrsdata, schema = cols)
    
    df = generate_random_data()
    df = df.withColumn("uid", monotonically_increasing_id())
    df.limit(5).toPandas()   # just for nice display of df (df not actually changed)
    

    将df列转换为特征向量

    规范化

    计算成对余弦相似性

    参考资料:

    from pyspark.ml.feature import VectorAssembler
    assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
    assembled = assembler.transform(df).select(['uid', 'features'])
    assembled.limit(2).toPandas()
    
    from pyspark.ml.feature import Normalizer
    normalizer = Normalizer(inputCol="features", outputCol="norm")
    data = normalizer.transform(assembled)
    data.limit(2).toPandas()
    
    from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
    mat = IndexedRowMatrix(data.select("uid", "norm").rdd\
            .map(lambda row: IndexedRow(row.uid, row.norm.toArray()))).toBlockMatrix()
    dot = mat.multiply(mat.transpose())
    dot.toLocalMatrix().toArray()[:2]  # displaying first 2 users only