Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/277.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用pyspark从spark sql获取结果?_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何使用pyspark从spark sql获取结果?

Python 如何使用pyspark从spark sql获取结果?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,目前,我是spark的新手,我正在使用python在spark中编写代码 我能够读取拼花地板文件,并将数据存储在dataframe和temp表中 但它不会打印执行的查询结果。请帮助调试这个 代码: import os os.environ['SPARK_HOME']="/opt/apps/spark-2.0.1-bin-hadoop2.7/" from pyspark import SparkContext from pyspark.sql import SQLContext from pysp

目前,我是spark的新手,我正在使用python在spark中编写代码

我能够读取拼花地板文件,并将数据存储在dataframe和temp表中

但它不会打印执行的查询结果。请帮助调试这个

代码:

import os
os.environ['SPARK_HOME']="/opt/apps/spark-2.0.1-bin-hadoop2.7/"
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sc = SparkContext(master='local')
sqlCtx = SQLContext(sc)
df_tract_alpha = sqlCtx.read.parquet("tract_alpha.parquet")
print (df_tract_alpha.columns)
sqlCtx.registerDataFrameAsTable(df_tract_alpha, "table1")
nt = sqlCtx.sql("SELECT COUNT(*) AS pageCount FROM table1 WHERE pp_count>=500").collect()
n1 = nt[0].pageCount
print n1
这是一个结果:

 Column< pageCount['pageCount'] > instead of printing the value
Column而不是打印值
这是堆栈跟踪

17/06/12 12:54:27 WARN BlockManager: Putting block broadcast_2 failed due to an exception 17/06/12 12:54:27 WARN BlockManager: Block broadcast_2 could not be removed as it was not found on disk or in memory Traceback (most recent call last): File "/home/vn/scripts/g_s_pipe/test_code_here.py", line 66, in nt = sqlContext.sql("SELECT count(*) as pageCount FROM table1 WHERE pp_count>=500").collect() File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 310, in collect port = self._jdf.collectToPython() File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o30.collectToPython. : java.lang.reflect.InaccessibleObjectException: Unable to make field transient java.lang.Object[] java.util.ArrayList.elementData accessible: module java.base does not "opens java.util" to unnamed module @55deb90 at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:335) at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:278) at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:175) at java.base/java.lang.reflect.Field.setAccessible(Field.java:169) at org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply(SizeEstimator.scala:336) at org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply(SizeEstimator.scala:330) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:330) at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:222) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69) at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78) at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70) at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1234) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:103) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:86) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReader(ParquetFileFormat.scala:329) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:281) at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:112) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:55) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:547) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.base/java.lang.Thread.run(Thread.java:844) 17/06/12 12:54:27警告块管理器:由于异常,放置块广播_2失败 17/06/12 12:54:27警告块管理器:无法删除块广播_2,因为在磁盘或内存中找不到它 回溯(最近一次呼叫最后一次): 文件“/home/vn/scripts/g_s_pipe/test_code_here.py”,第66行,在 nt=sqlContext.sql(“选择count(*)作为表1中的pageCount,其中pp_count>=500”).collect() collect中的文件“/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py”,第310行 port=self.\u jdf.collectToPython() 文件“/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py”,第1133行,在__ 文件“/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/utils.py”,第63行,deco格式 返回f(*a,**kw) 文件“/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py”,第319行,在get_return_值中 py4j.protocol.Py4JJavaError:调用o30.collectToPython时出错。 :java.lang.reflect.InAccessibleObject异常:无法使字段临时java.lang.Object[]java.util.ArrayList.elementData可访问:模块java.base未“打开java.util”到未命名模块@55deb90 位于java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:335) 位于java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:278) 位于java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:175) 位于java.base/java.lang.reflect.Field.setAccessible(Field.java:169) 位于org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply(SizeEstimator.scala:336) 在org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply上(SizeEstimator.scala:330) 在scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 位于scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 位于org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:330) 位于org.apache.spark.util.SizeEstimator$.visitingleobject(SizeEstimator.scala:222) 位于org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201) 位于org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69) 位于org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78) 位于org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70) 位于org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31) 在org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues上(MemoryStore.scala:214) 位于org.apache.spark.storage.BlockManager$$anonfun$doputierator$1.apply(BlockManager.scala:935) 位于org.apache.spark.storage.BlockManager$$anonfun$doputierator$1.apply(BlockManager.scala:926) 位于org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) 位于org.apache.spark.storage.BlockManager.doputierator(BlockManager.scala:926) 位于org.apache.spark.storage.BlockManager.putierator(BlockManager.scala:702) 位于org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1234) 位于org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:103) 在org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:86) 位于org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) 在org.apache.spark.broadcast.broadcast上(BroadcastManager.scala:56) 在org.apache.spark.SparkContext.broadcast上(SparkContext.scala:1387) 位于org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReader(ParquetFileFormat.scala:329) 位于org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:281) 位于org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:112) 在org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply上(QueryPlanner.scala:60) 在org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply上(QueryPlanner.scala:60) 位于scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 位于scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 位于org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) 位于org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) 在org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) 在org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) 位于org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) 位于org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) 位于org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) 位于org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) 位于org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) 位于org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) 位于org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) 位于org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) 位于org.apache.spark.sql.catalyst.trees.TreeN
nt = sqlCtx.sql("SELECT COUNT(*) AS pageCount FROM table1 WHERE pp_count>=500") \
           .collect()
$> parquet-tools head data.parquet/
a = 1
pp_count = 500

a = 2
pp_count = 750

a = 3
pp_count = 400

a = 4
pp_count = 600

a = 5
pp_count = 700
sc = SparkContext(master='local')
sqlContext = SQLContext(sc)

df = sqlContext.read.parquet("data.parquet")
print("data columns : {} ".format(df.columns))

sqlContext.registerDataFrameAsTable(df, "table1")
results = sqlContext.sql("SELECT COUNT(*) AS pageCount FROM table1 WHERE pp_count>=500").collect()
df.show()
print("initial data count : {}".format(df.count()))
page_count = results[0].pageCount
print("page count : {}".format(page_count))
data columns : ['a', 'pp_count']
+---+--------+
|  a|pp_count|
+---+--------+
|  1|     500|
|  2|     750|
|  3|     400|
|  4|     600|
|  5|     700|
+---+--------+

initial data count : 5
page count : 4