Python 我应该如何在pySpark中缓存一个大表?
我定义了以下函数:Python 我应该如何在pySpark中缓存一个大表?,python,hadoop,apache-spark,Python,Hadoop,Apache Spark,我定义了以下函数: def hdfsToSchemaRDD(hdfsPath, tableName, tableHeader): lines = sc.textFile(hdfsPath) fields = [StructField(field_name, StringType(), True) for field_name in tableHeader.split()] schema = StructType(fields) columns = lines.ma
def hdfsToSchemaRDD(hdfsPath, tableName, tableHeader):
lines = sc.textFile(hdfsPath)
fields = [StructField(field_name, StringType(), True) for field_name in tableHeader.split()]
schema = StructType(fields)
columns = lines.map(lambda l: l.split(","))
tempTable = columns.map(lambda c: tuple([ c[i] for i,v in enumerate(tableHeader.split()) ]))
schemaTable = sqlContext.applySchema(tempTable, schema)
schemaTable.registerTempTable(tableName)
sqlContext.cacheTable(tableName)
def FeatureSummary(features):
results = sqlContext.sql("""
SELECT
%s
,SUM(summable) as s1, SUM(othersummable) as s2
FROM
LargeTable
GROUP BY
%s
ORDER BY
%s
"""%(features,features,features))
for row in results.map(lambda x: x).collect():
print row
其目的是在给定路径的情况下,从hdfs读取数据集,并将其存储到内存中,这样我就不必每次需要查询时都重新读取数据集,这通常是非常困难的
我遇到的问题是,假设我加载一些大表LargeTable
,然后运行函数:
def hdfsToSchemaRDD(hdfsPath, tableName, tableHeader):
lines = sc.textFile(hdfsPath)
fields = [StructField(field_name, StringType(), True) for field_name in tableHeader.split()]
schema = StructType(fields)
columns = lines.map(lambda l: l.split(","))
tempTable = columns.map(lambda c: tuple([ c[i] for i,v in enumerate(tableHeader.split()) ]))
schemaTable = sqlContext.applySchema(tempTable, schema)
schemaTable.registerTempTable(tableName)
sqlContext.cacheTable(tableName)
def FeatureSummary(features):
results = sqlContext.sql("""
SELECT
%s
,SUM(summable) as s1, SUM(othersummable) as s2
FROM
LargeTable
GROUP BY
%s
ORDER BY
%s
"""%(features,features,features))
for row in results.map(lambda x: x).collect():
print row
FeatureSummary(attribute1)
,FeatureSummary(attribute2)
,FeatureSummary(attribute3)
似乎每次都能读入LargeTable
。特别是,考虑到属性的级别非常少,我希望FeatureSummary
函数中的查询会比LargeTable
的读入运行得快得多。事实并非如此
有没有办法检查表是否已缓存?
如何修改函数hdfsToSchemaRDD
以实际缓存来自hdfs的数据