Python ValueError:对象(3)的长度与字段的长度不匹配
我手动创建PySpark DataFrame,如下所示:Python ValueError:对象(3)的长度与字段的长度不匹配,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我手动创建PySpark DataFrame,如下所示: acdata = sc.parallelize([ [('timestamp', 1506340019), ('pk', 111), ('product_pk', 123), ('country_id', 'FR'), ('channel', 'web')] ]) # Convert to tuple acdata_converted = acdata.map(lambda x: (x[0][1], x[1][1], x[2][1]))
acdata = sc.parallelize([
[('timestamp', 1506340019), ('pk', 111), ('product_pk', 123), ('country_id', 'FR'), ('channel', 'web')]
])
# Convert to tuple
acdata_converted = acdata.map(lambda x: (x[0][1], x[1][1], x[2][1]))
# Define schema
acschema = StructType([
StructField("timestamp", LongType(), True),
StructField("pk", LongType(), True),
StructField("product_pk", LongType(), True),
StructField("country_id", StringType(), True),
StructField("channel", StringType(), True)
])
df = sqlContext.createDataFrame(acdata_converted, acschema)
但是当我编写df.head()
并执行spark submit
时,我得到以下错误:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000001/pyspark.zip/pyspark/sql/session.py", line 520, in prepare
File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/sql/types.py", line 1358, in _verify_type
"length of fields (%d)" % (len(obj), len(dataType.fields)))
ValueError: Length of object (3) does not match with length of fields (12)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
org.apache.spark.api.python.python异常:回溯(最近一次调用上次):
文件“/mnt/thread/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/worker.py”,主文件第177行
过程()
文件“/mnt/thread/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/worker.py”,第172行,处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/mnt/thread/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/serializers.py”,第268行,位于转储流中
vs=列表(itertools.islice(迭代器,批处理))
文件“/mnt/thread/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000001/pyspark.zip/pyspark/sql/session.py”,第520行,准备中
文件“/mnt/thread/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/sql/types.py”,第1358行,在验证类型中
“字段长度(%d)”%(len(obj),len(dataType.fields)))
ValueError:对象(3)的长度与字段(12)的长度不匹配
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
位于org.apache.spark.api.python.PythonRunner$$anon$1。(PythonRDD.scala:234)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
位于org.apache.spark.scheduler.Task.run(Task.scala:108)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:335)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)
这意味着什么以及如何解决它?您需要映射所有5个字段以与定义的模式匹配
acdata_converted = acdata.map(lambda x: (x[0][1], x[1][1], x[2][1], x[3][1], x[4][1]))
我会这样做:
acdata = sc.parallelize([{'timestamp': 1506340019, 'pk': 111, 'product_pk': 123, 'country_id': 'FR', 'channel': 'web'}, {...}])
# Define schema
acschema = StructType([
StructField("timestamp", LongType(), True),
StructField("pk", LongType(), True),
StructField("product_pk", LongType(), True),
StructField("country_id", StringType(), True),
StructField("channel", StringType(), True)
])
df = sqlContext.createDataFrame(acdata_converted, acschema)
如果您真的需要并行化数据,也要考虑一下。也可以从字典创建数据帧。是否只需要第一行和5列?执行此操作或使用字典列表发送到
sc.parallelize