Python 我应该如何在pySpark中缓存一个大表？_Python_Hadoop_Apache Spark

Python 我应该如何在pySpark中缓存一个大表？

python hadoop apache-spark

Python 我应该如何在pySpark中缓存一个大表？,python,hadoop,apache-spark,Python,Hadoop,Apache Spark,我定义了以下函数： def hdfsToSchemaRDD(hdfsPath, tableName, tableHeader): lines = sc.textFile(hdfsPath) fields = [StructField(field_name, StringType(), True) for field_name in tableHeader.split()] schema = StructType(fields) columns = lines.ma

我定义了以下函数：

def hdfsToSchemaRDD(hdfsPath, tableName, tableHeader):
    lines = sc.textFile(hdfsPath)
    fields = [StructField(field_name, StringType(), True) for field_name in tableHeader.split()]
    schema = StructType(fields)
    columns = lines.map(lambda l: l.split(","))
    tempTable = columns.map(lambda c: tuple([ c[i] for i,v in enumerate(tableHeader.split()) ]))
    schemaTable = sqlContext.applySchema(tempTable, schema)
    schemaTable.registerTempTable(tableName)
    sqlContext.cacheTable(tableName)

def FeatureSummary(features):
    results = sqlContext.sql("""
        SELECT
            %s
            ,SUM(summable) as s1, SUM(othersummable) as s2
        FROM
            LargeTable
        GROUP BY
            %s
        ORDER BY
            %s
    """%(features,features,features))
    for row in results.map(lambda x: x).collect():
        print row

其目的是在给定路径的情况下，从hdfs读取数据集，并将其存储到内存中，这样我就不必每次需要查询时都重新读取数据集，这通常是非常困难的

我遇到的问题是，假设我加载一些大表

LargeTable

，然后运行函数：

def hdfsToSchemaRDD(hdfsPath, tableName, tableHeader):
    lines = sc.textFile(hdfsPath)
    fields = [StructField(field_name, StringType(), True) for field_name in tableHeader.split()]
    schema = StructType(fields)
    columns = lines.map(lambda l: l.split(","))
    tempTable = columns.map(lambda c: tuple([ c[i] for i,v in enumerate(tableHeader.split()) ]))
    schemaTable = sqlContext.applySchema(tempTable, schema)
    schemaTable.registerTempTable(tableName)
    sqlContext.cacheTable(tableName)

def FeatureSummary(features):
    results = sqlContext.sql("""
        SELECT
            %s
            ,SUM(summable) as s1, SUM(othersummable) as s2
        FROM
            LargeTable
        GROUP BY
            %s
        ORDER BY
            %s
    """%(features,features,features))
    for row in results.map(lambda x: x).collect():
        print row

FeatureSummary（attribute1）

，

FeatureSummary（attribute2）

，

FeatureSummary（attribute3）

似乎每次都能读入

LargeTable

。特别是，考虑到属性的级别非常少，我希望

FeatureSummary

函数中的查询会比

LargeTable

的读入运行得快得多。事实并非如此

有没有办法检查表是否已缓存？如何修改函数

hdfsToSchemaRDD

以实际缓存来自hdfs的数据