Pyspark 为什么不能在spark sql中缓存后调用show方法?

Pyspark 为什么不能在spark sql中缓存后调用show方法?,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我用HiveContext(而不是SQLContext)在pyspark中创建了一个名为df的数据帧 但是我发现在调用df.cache()之后,我将无法调用df.show()。例如: >>> df.show(2) +--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+ | bits|

我用HiveContext(而不是SQLContext)在pyspark中创建了一个名为df的数据帧

但是我发现在调用df.cache()之后,我将无法调用df.show()。例如:

>>> df.show(2)
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|    bits|       dst_ip|dst_port|flow_direction|in_iface|ip_dscp|out_iface|    pkts|protocol|       src_ip|src_port|  tag|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|16062594|42.120.84.166|   11291|             1|       3|     36|        2|17606406|    pnni|42.120.84.115|   14166|10008|
|13914480|42.120.82.254|   13667|             0|       4|     32|        1|13953516|   ax.25| 42.120.86.49|   19810|10002|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
only showing top 2 rows


>>> 
>>> df.cache()
DataFrame[bits: bigint, dst_ip: string, dst_port: bigint, flow_direction: string, in_iface: bigint, ip_dscp: string, out_iface: bigint, pkts: bigint, protocol: string, src_ip: string, src_port: bigint, tag: string]


>>> df.show(2)
16/05/16 15:59:32 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<stdin>", line 1, in <lambda>
IndexError: list index out of range
测向显示(2) +--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+ |比特| dst | ip | dst | U端口|流向|输入| ip | dscp |输出|输入| pkts |协议| src | U ip | src | U端口|标记| +--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+ |16062594 | 42.120.84.166 | 11291 | 1 | 3 | 36 | 2 | 17606406 | pnni | 42.120.84.115 | 14166 | 10008| |13914480 | 42.120.82.254 | 13667 | 0 | 4 | 32 | 1 | 13953516 | ax.25 | 42.120.86.49 | 19810 | 10002| +--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+ 仅显示前2行 >>> >>>df.cache() 数据帧[位:bigint,dst_ip:string,dst_端口:bigint,流方向:string,in_-iface:bigint,ip_-dscp:string,out_-iface:bigint,pkts:bigint,协议:string,src_-ip:string,src_-port:bigint,标记:string] >>>df.show(2) 16/05/16 15:59:32错误执行者:第14.0阶段(TID 14)任务0.0中出现异常 org.apache.spark.api.python.PythonException:回溯(最近一次调用last): 文件“/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py”,主文件第111行 过程() 文件“/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py”,第106行,正在处理中 serializer.dump_流(func(拆分索引,迭代器),outfile) 文件“/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py”,第263行,在dump_流中 vs=列表(itertools.islice(迭代器,批处理)) 文件“”,第1行,在 索引器:列表索引超出范围 但是在调用df.unpersist()之后,df.show()将再次工作

我不明白。因为我认为df.cache()只是在缓存RDD供以后使用。为什么df.show()在调用缓存后不工作?

在内存中缓存数据

Spark SQL可以通过调用sqlContext.cacheTable(“tableName”)或dataFrame.cache()来使用内存中的列格式缓存表。然后Spark SQL将只扫描所需的列,并将自动调整压缩以最小化内存使用和GC压力。您可以调用sqlContext.uncacheTable(“tableName”)从内存中删除该表

可以使用SQLContext上的setConf方法或使用SQL运行SET key=value命令来配置内存缓存

来自您的回复(链接)。我知道两件事。1.数据帧上的缓存是惰性的。正在缓存的数据是您在缓存之后读取的数据。2.表上的缓存不是惰性的。但这并不能回答我的问题。。。为什么这个节目不起作用?