Pyspark 为什么不能在spark sql中缓存后调用show方法?
我用HiveContext(而不是SQLContext)在pyspark中创建了一个名为df的数据帧 但是我发现在调用df.cache()之后,我将无法调用df.show()。例如:Pyspark 为什么不能在spark sql中缓存后调用show方法?,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我用HiveContext(而不是SQLContext)在pyspark中创建了一个名为df的数据帧 但是我发现在调用df.cache()之后,我将无法调用df.show()。例如: >>> df.show(2) +--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+ | bits|
>>> df.show(2)
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
| bits| dst_ip|dst_port|flow_direction|in_iface|ip_dscp|out_iface| pkts|protocol| src_ip|src_port| tag|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|16062594|42.120.84.166| 11291| 1| 3| 36| 2|17606406| pnni|42.120.84.115| 14166|10008|
|13914480|42.120.82.254| 13667| 0| 4| 32| 1|13953516| ax.25| 42.120.86.49| 19810|10002|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
only showing top 2 rows
>>>
>>> df.cache()
DataFrame[bits: bigint, dst_ip: string, dst_port: bigint, flow_direction: string, in_iface: bigint, ip_dscp: string, out_iface: bigint, pkts: bigint, protocol: string, src_ip: string, src_port: bigint, tag: string]
>>> df.show(2)
16/05/16 15:59:32 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 1, in <lambda>
IndexError: list index out of range
测向显示(2)
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|比特| dst | ip | dst | U端口|流向|输入| ip | dscp |输出|输入| pkts |协议| src | U ip | src | U端口|标记|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|16062594 | 42.120.84.166 | 11291 | 1 | 3 | 36 | 2 | 17606406 | pnni | 42.120.84.115 | 14166 | 10008|
|13914480 | 42.120.82.254 | 13667 | 0 | 4 | 32 | 1 | 13953516 | ax.25 | 42.120.86.49 | 19810 | 10002|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
仅显示前2行
>>>
>>>df.cache()
数据帧[位:bigint,dst_ip:string,dst_端口:bigint,流方向:string,in_-iface:bigint,ip_-dscp:string,out_-iface:bigint,pkts:bigint,协议:string,src_-ip:string,src_-port:bigint,标记:string]
>>>df.show(2)
16/05/16 15:59:32错误执行者:第14.0阶段(TID 14)任务0.0中出现异常
org.apache.spark.api.python.PythonException:回溯(最近一次调用last):
文件“/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py”,主文件第111行
过程()
文件“/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py”,第106行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py”,第263行,在dump_流中
vs=列表(itertools.islice(迭代器,批处理))
文件“”,第1行,在
索引器:列表索引超出范围
但是在调用df.unpersist()之后,df.show()将再次工作
我不明白。因为我认为df.cache()只是在缓存RDD供以后使用。为什么df.show()在调用缓存后不工作?
在内存中缓存数据
Spark SQL可以通过调用sqlContext.cacheTable(“tableName”)或dataFrame.cache()来使用内存中的列格式缓存表。然后Spark SQL将只扫描所需的列,并将自动调整压缩以最小化内存使用和GC压力。您可以调用sqlContext.uncacheTable(“tableName”)从内存中删除该表
可以使用SQLContext上的setConf方法或使用SQL运行SET key=value命令来配置内存缓存
来自您的回复(链接)。我知道两件事。1.数据帧上的缓存是惰性的。正在缓存的数据是您在缓存之后读取的数据。2.表上的缓存不是惰性的。但这并不能回答我的问题。。。为什么这个节目不起作用?