Apache spark 如何以表格格式输出坐标矩阵？_Apache Spark_Apache Spark Mllib

Apache spark 如何以表格格式输出坐标矩阵？

apache-spark

Apache spark 如何以表格格式输出坐标矩阵？,apache-spark,apache-spark-mllib,Apache Spark,Apache Spark Mllib,我需要生成movielens评级数据子集的输出表。我已将数据帧转换为坐标矩阵： from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix mat = CoordinateMatrix(ratings.map( lambda r: MatrixEntry(r.user, r.product, r.rating))) 但是，我看不出如何以表格格式打印输出。我可以打印条目： mat.ent

我需要生成movielens评级数据子集的输出表。我已将数据帧转换为坐标矩阵：

from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix

mat = CoordinateMatrix(ratings.map( 
        lambda r: MatrixEntry(r.user, r.product, r.rating)))

但是，我看不出如何以表格格式打印输出。我可以打印条目：

mat.entries.collect()

哪些产出：

[MatrixEntry(1, 1, 5.0),
 MatrixEntry(5, 6, 2.0),
 MatrixEntry(6, 1, 4.0),
 MatrixEntry(7, 6, 4.0),
 MatrixEntry(8, 1, 4.0),
 MatrixEntry(8, 4, 3.0),
 MatrixEntry(9, 1, 5.0)]

movieId    1    2    3    4    5    6    7    8    9
userId                                              
1        5.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
3        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
4        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
5        0.0  0.0  0.0  0.0  0.0  2.0  0.0  0.0  0.0
6        4.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
7        0.0  0.0  0.0  0.0  0.0  4.0  0.0  0.0  0.0
8        4.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  0.0
9        5.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

但是，我希望输出：

      1   2   3   4   5   6   7   8   9 
    ------------------------------------- ...
 1  | 5
 2  | 
 3  |
 4  | 
 5  |                     2
    ...

更新

熊猫的等价物是pivot_table，例如

import pandas as pd
import numpy as np
import os
import requests
import zipfile

np.set_printoptions(precision=4)

filename = 'ml-1m.zip'
if not os.path.exists(filename):
    r = requests.get('http://files.grouplens.org/datasets/movielens/ml-1m.zip', stream=True)
    if r.status_code == 200:
        with open(filename, 'wb') as f:
            for chunk in r:
                f.write(chunk)           
    else:
        raise 'Could not save dataset'

zip_ref = zipfile.ZipFile('ml-1m.zip', 'r')
zip_ref.extractall('.')
zip_ref.close()

ratingsNames = ["userId", "movieId", "rating", "timestamp"]
ratings = pd.read_table("./ml-1m/ratings.dat", header=None, sep="::", names=ratingsNames, engine='python')

ratingsMatrix = ratings.pivot_table(columns=['movieId'], index =['userId'], values='rating', dropna = False)

ratingsMatrix = ratingsMatrix.fillna(0)

# we don't have space to print the full matrix, just show the first few cells
print(ratingsMatrix.ix[:9, :9])

哪些产出：

[MatrixEntry(1, 1, 5.0),
 MatrixEntry(5, 6, 2.0),
 MatrixEntry(6, 1, 4.0),
 MatrixEntry(7, 6, 4.0),
 MatrixEntry(8, 1, 4.0),
 MatrixEntry(8, 4, 3.0),
 MatrixEntry(9, 1, 5.0)]

movieId    1    2    3    4    5    6    7    8    9
userId                                              
1        5.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
3        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
4        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
5        0.0  0.0  0.0  0.0  0.0  2.0  0.0  0.0  0.0
6        4.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
7        0.0  0.0  0.0  0.0  0.0  4.0  0.0  0.0  0.0
8        4.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  0.0
9        5.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

您只是想打印一定数量的行吗？是的，只是行和列的子集。但行2:4不在原始子集中，为什么要插入空行？我需要向用户展示稀疏矩阵的外观。这是出于教育目的。

评级

中的内容，我假设它不包含空用户行？您只是想打印一定数量的行吗？是的，只是行和列的子集。但是行2:4不在原始子集中，为什么要插入空行？我需要向用户展示稀疏矩阵的外观。它是用于教育目的的。

评级中有什么，我假设它不包含空的用户行？