Python pyspark-读取格式错误的.gz文件
我正在EMR上阅读pyspark中的压缩.gz文件。但是 文件格式不正确(它是一个json文件,每行的列数不同),出现以下异常。有人能提供如何在pyspark中读取格式错误的gz文件的指针吗 代码: 错误:Python pyspark-读取格式错误的.gz文件,python,apache-spark,pyspark,amazon-emr,gzip,Python,Apache Spark,Pyspark,Amazon Emr,Gzip,我正在EMR上阅读pyspark中的压缩.gz文件。但是 文件格式不正确(它是一个json文件,每行的列数不同),出现以下异常。有人能提供如何在pyspark中读取格式错误的gz文件的指针吗 代码: 错误: > > Traceback (most recent call last): File "<stdin>", line 1, in > > <module> File "/usr/lib/spark/python/pys
> > Traceback (most recent call last): File "<stdin>", line 1, in
> > <module> File "/usr/lib/spark/python/pyspark/sql/session.py", line
> > 58, in toDF
> > return sparkSession.createDataFrame(self, schema, sampleRatio) File "/usr/lib/spark/python/pyspark/sql/session.py",
> line 687, in
> > createDataFrame
> > rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) File "/usr/lib/spark/python/pyspark/sql/session.py",
> > line 384, in _createFromRDD
> > struct = self._inferSchema(rdd, samplingRatio, names=schema) File "/usr/lib/spark/python/pyspark/sql/session.py", line 355, in
> > _inferSchema
> > first = rdd.first() File "/usr/lib/spark/python/pyspark/rdd.py", line 1376, in first
> > rs = self.take(1) File "/usr/lib/spark/python/pyspark/rdd.py", line 1328, in take
> > totalParts = self.getNumPartitions() File "/usr/lib/spark/python/pyspark/rdd.py", line 2455, in getNumPartitions
> > return self._prev_jrdd.partitions().size() File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
> > line 1160, in __call__ File
> > "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> > return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line
> > 324, in get_return_value py4j.protocol.Py4JError: An error occurred
> > while calling o81.partitions. Trace:
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$ComputationException:
> > java.lang.ArrayIndexOutOfBoundsException: 16227 at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.formatSource(Errors.java:690)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.format(Errors.java:555)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.ProvisionException.getMessage(ProvisionException.java:59)
> > at java.lang.Throwable.getLocalizedMessage(Throwable.java:391) at
> > java.lang.Throwable.toString(Throwable.java:480) at
> > java.lang.Throwable.<init>(Throwable.java:311) at
> > java.lang.Exception.<init>(Exception.java:102) at
> > java.lang.RuntimeException.<init>(RuntimeException.java:96) at
> > py4j.Py4JException.<init>(Py4JException.java:56) at
> > py4j.Py4JJavaException.<init>(Py4JJavaException.java:59) at
> > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:251) at
> > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
> > py4j.Gateway.invoke(Gateway.java:282) at
> > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> > at py4j.commands.CallCommand.execute(CallCommand.java:79) at
> > py4j.GatewayConnection.run(GatewayConnection.java:214) at
> > java.lang.Thread.run(Thread.java:748) Caused by:
> > java.lang.ArrayIndexOutOfBoundsException: 16227 at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.readClass(Unknown
> > Source) at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
> > Source) at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
> > Source) at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$LineNumbers.<init>(LineNumbers.java:62)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549)
> > ... 20 more
>回溯(最近一次调用最后一次):文件“”,第1行,在
>>文件“/usr/lib/spark/python/pyspark/sql/session.py”,第行
>>58,在toDF中
>>返回sparkSession.createDataFrame(self、schema、sampleRatio)文件“/usr/lib/spark/python/pyspark/sql/session.py”,
>第687行,输入
>>createDataFrame
>>rdd,schema=self.\u createFromRDD(data.map(prepare),schema,samplingario)文件“/usr/lib/spark/python/pyspark/sql/session.py”,
>>第384行,在_createFromRDD中
>>struct=self.\u inferSchema(rdd,samplingRatio,names=schema)文件“/usr/lib/spark/python/pyspark/sql/session.py”,第355行,在
>>\u推断模式
>>first=rdd.first()文件“/usr/lib/spark/python/pyspark/rdd.py”,第1376行,第一行
>>rs=self.take(1)文件“/usr/lib/spark/python/pyspark/rdd.py”,take中的第1328行
>>totalParts=self.getNumPartitions()文件“/usr/lib/spark/python/pyspark/rdd.py”,第2455行,在getNumPartitions中
>>返回self._prev_jrdd.partitions().size()文件“/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py”,
>>第1160行,在调用文件中
>>“/usr/lib/spark/python/pyspark/sql/utils.py”,第63行,在deco中
>>返回f(*a,**kw)文件“/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py”,行
>>324,在get_return_值py4j.protocol.Py4JError中:发生错误
>>调用o81.0分区时。跟踪:
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$ComputeException:
>>java.lang.ArrayIndexOutOfBoundsException:16227 at
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.Errors.formatSource(Errors.java:690)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.Errors.format(Errors.java:555)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.ProvisionException.getMessage(ProvisionException.java:59)
>>位于java.lang.Throwable.getLocalizedMessage(Throwable.java:391)
>>java.lang.Throwable.toString(Throwable.java:480)位于
>>java.lang.Throwable.(Throwable.java:311)位于
>>java.lang.Exception.(Exception.java:102)位于
>>java.lang.RuntimeException.(RuntimeException.java:96)位于
>>py4j.Py4JException.(Py4JException.java:56)位于
>>py4j.Py4JJavaException.(Py4JJavaException.java:59)位于
>>py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:251)位于
>>py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)位于
>>py4j.Gateway.invoke(Gateway.java:282)位于
>>py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>在py4j.commands.CallCommand.execute(CallCommand.java:79)处
>>py4j.GatewayConnection.run(GatewayConnection.java:214)位于
>>java.lang.Thread.run(Thread.java:748)由以下原因引起:
>>java.lang.ArrayIndexOutOfBoundsException:16227 at
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.asm.$ClassReader.readClass(未知
>>来源)在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.asm.$ClassReader.accept(未知
>>来源)在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.asm.$ClassReader.accept(未知
>>来源)在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$LineNumbers.(LineNumbers.java:62)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549)
> > ... 20多
您有没有找到这个问题的解决方案?今天我突然想到了这个问题。同样在Spark作业的上下文中,相同的stacktrace也在完全相同的数组索引16227处失败。
> > Traceback (most recent call last): File "<stdin>", line 1, in
> > <module> File "/usr/lib/spark/python/pyspark/sql/session.py", line
> > 58, in toDF
> > return sparkSession.createDataFrame(self, schema, sampleRatio) File "/usr/lib/spark/python/pyspark/sql/session.py",
> line 687, in
> > createDataFrame
> > rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) File "/usr/lib/spark/python/pyspark/sql/session.py",
> > line 384, in _createFromRDD
> > struct = self._inferSchema(rdd, samplingRatio, names=schema) File "/usr/lib/spark/python/pyspark/sql/session.py", line 355, in
> > _inferSchema
> > first = rdd.first() File "/usr/lib/spark/python/pyspark/rdd.py", line 1376, in first
> > rs = self.take(1) File "/usr/lib/spark/python/pyspark/rdd.py", line 1328, in take
> > totalParts = self.getNumPartitions() File "/usr/lib/spark/python/pyspark/rdd.py", line 2455, in getNumPartitions
> > return self._prev_jrdd.partitions().size() File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
> > line 1160, in __call__ File
> > "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> > return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line
> > 324, in get_return_value py4j.protocol.Py4JError: An error occurred
> > while calling o81.partitions. Trace:
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$ComputationException:
> > java.lang.ArrayIndexOutOfBoundsException: 16227 at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.formatSource(Errors.java:690)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.format(Errors.java:555)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.ProvisionException.getMessage(ProvisionException.java:59)
> > at java.lang.Throwable.getLocalizedMessage(Throwable.java:391) at
> > java.lang.Throwable.toString(Throwable.java:480) at
> > java.lang.Throwable.<init>(Throwable.java:311) at
> > java.lang.Exception.<init>(Exception.java:102) at
> > java.lang.RuntimeException.<init>(RuntimeException.java:96) at
> > py4j.Py4JException.<init>(Py4JException.java:56) at
> > py4j.Py4JJavaException.<init>(Py4JJavaException.java:59) at
> > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:251) at
> > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
> > py4j.Gateway.invoke(Gateway.java:282) at
> > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> > at py4j.commands.CallCommand.execute(CallCommand.java:79) at
> > py4j.GatewayConnection.run(GatewayConnection.java:214) at
> > java.lang.Thread.run(Thread.java:748) Caused by:
> > java.lang.ArrayIndexOutOfBoundsException: 16227 at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.readClass(Unknown
> > Source) at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
> > Source) at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
> > Source) at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$LineNumbers.<init>(LineNumbers.java:62)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33)
> > at
> > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549)
> > ... 20 more