Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/313.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark-读取格式错误的.gz文件_Python_Apache Spark_Pyspark_Amazon Emr_Gzip - Fatal编程技术网

Python pyspark-读取格式错误的.gz文件

Python pyspark-读取格式错误的.gz文件,python,apache-spark,pyspark,amazon-emr,gzip,Python,Apache Spark,Pyspark,Amazon Emr,Gzip,我正在EMR上阅读pyspark中的压缩.gz文件。但是 文件格式不正确(它是一个json文件,每行的列数不同),出现以下异常。有人能提供如何在pyspark中读取格式错误的gz文件的指针吗 代码: 错误: > > Traceback (most recent call last): File "<stdin>", line 1, in > > <module> File "/usr/lib/spark/python/pys

我正在EMR上阅读pyspark中的压缩.gz文件。但是 文件格式不正确(它是一个json文件,每行的列数不同),出现以下异常。有人能提供如何在pyspark中读取格式错误的gz文件的指针吗

代码:

错误:

>     > Traceback (most recent call last):   File "<stdin>", line 1, in
>     > <module>   File "/usr/lib/spark/python/pyspark/sql/session.py", line
>     > 58, in toDF
>     >     return sparkSession.createDataFrame(self, schema, sampleRatio)   File "/usr/lib/spark/python/pyspark/sql/session.py",
> line 687, in
>     > createDataFrame
>     >     rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)   File "/usr/lib/spark/python/pyspark/sql/session.py",
>     > line 384, in _createFromRDD
>     >     struct = self._inferSchema(rdd, samplingRatio, names=schema)   File "/usr/lib/spark/python/pyspark/sql/session.py", line 355, in
>     > _inferSchema
>     >     first = rdd.first()   File "/usr/lib/spark/python/pyspark/rdd.py", line 1376, in first
>     >     rs = self.take(1)   File "/usr/lib/spark/python/pyspark/rdd.py", line 1328, in take
>     >     totalParts = self.getNumPartitions()   File "/usr/lib/spark/python/pyspark/rdd.py", line 2455, in getNumPartitions
>     >     return self._prev_jrdd.partitions().size()   File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>     > line 1160, in __call__   File
>     > "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>     >     return f(*a, **kw)   File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line
>     > 324, in get_return_value py4j.protocol.Py4JError: An error occurred
>     > while calling o81.partitions. Trace:
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$ComputationException:
>     > java.lang.ArrayIndexOutOfBoundsException: 16227     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.formatSource(Errors.java:690)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.format(Errors.java:555)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.ProvisionException.getMessage(ProvisionException.java:59)
>     >     at java.lang.Throwable.getLocalizedMessage(Throwable.java:391)  at
>     > java.lang.Throwable.toString(Throwable.java:480)    at
>     > java.lang.Throwable.<init>(Throwable.java:311)  at
>     > java.lang.Exception.<init>(Exception.java:102)  at
>     > java.lang.RuntimeException.<init>(RuntimeException.java:96)     at
>     > py4j.Py4JException.<init>(Py4JException.java:56)    at
>     > py4j.Py4JJavaException.<init>(Py4JJavaException.java:59)    at
>     > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:251)    at
>     > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)  at
>     > py4j.Gateway.invoke(Gateway.java:282)   at
>     > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>     >     at py4j.commands.CallCommand.execute(CallCommand.java:79)   at
>     > py4j.GatewayConnection.run(GatewayConnection.java:214)  at
>     > java.lang.Thread.run(Thread.java:748) Caused by:
>     > java.lang.ArrayIndexOutOfBoundsException: 16227     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.readClass(Unknown
>     > Source)     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
>     > Source)     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
>     > Source)     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$LineNumbers.<init>(LineNumbers.java:62)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549)
>     >     ... 20 more
>回溯(最近一次调用最后一次):文件“”,第1行,在
>>文件“/usr/lib/spark/python/pyspark/sql/session.py”,第行
>>58,在toDF中
>>返回sparkSession.createDataFrame(self、schema、sampleRatio)文件“/usr/lib/spark/python/pyspark/sql/session.py”,
>第687行,输入
>>createDataFrame
>>rdd,schema=self.\u createFromRDD(data.map(prepare),schema,samplingario)文件“/usr/lib/spark/python/pyspark/sql/session.py”,
>>第384行,在_createFromRDD中
>>struct=self.\u inferSchema(rdd,samplingRatio,names=schema)文件“/usr/lib/spark/python/pyspark/sql/session.py”,第355行,在
>>\u推断模式
>>first=rdd.first()文件“/usr/lib/spark/python/pyspark/rdd.py”,第1376行,第一行
>>rs=self.take(1)文件“/usr/lib/spark/python/pyspark/rdd.py”,take中的第1328行
>>totalParts=self.getNumPartitions()文件“/usr/lib/spark/python/pyspark/rdd.py”,第2455行,在getNumPartitions中
>>返回self._prev_jrdd.partitions().size()文件“/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py”,
>>第1160行,在调用文件中
>>“/usr/lib/spark/python/pyspark/sql/utils.py”,第63行,在deco中
>>返回f(*a,**kw)文件“/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py”,行
>>324,在get_return_值py4j.protocol.Py4JError中:发生错误
>>调用o81.0分区时。跟踪:
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$ComputeException:
>>java.lang.ArrayIndexOutOfBoundsException:16227 at
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.Errors.formatSource(Errors.java:690)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.Errors.format(Errors.java:555)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.ProvisionException.getMessage(ProvisionException.java:59)
>>位于java.lang.Throwable.getLocalizedMessage(Throwable.java:391)
>>java.lang.Throwable.toString(Throwable.java:480)位于
>>java.lang.Throwable.(Throwable.java:311)位于
>>java.lang.Exception.(Exception.java:102)位于
>>java.lang.RuntimeException.(RuntimeException.java:96)位于
>>py4j.Py4JException.(Py4JException.java:56)位于
>>py4j.Py4JJavaException.(Py4JJavaException.java:59)位于
>>py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:251)位于
>>py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)位于
>>py4j.Gateway.invoke(Gateway.java:282)位于
>>py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>在py4j.commands.CallCommand.execute(CallCommand.java:79)处
>>py4j.GatewayConnection.run(GatewayConnection.java:214)位于
>>java.lang.Thread.run(Thread.java:748)由以下原因引起:
>>java.lang.ArrayIndexOutOfBoundsException:16227 at
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.asm.$ClassReader.readClass(未知
>>来源)在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.asm.$ClassReader.accept(未知
>>来源)在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.asm.$ClassReader.accept(未知
>>来源)在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$LineNumbers.(LineNumbers.java:62)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33)
>>在
>>com.amazon.ws.emr.hadoop.fs.shade.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549)
>     >     ... 20多

您有没有找到这个问题的解决方案?今天我突然想到了这个问题。同样在Spark作业的上下文中,相同的stacktrace也在完全相同的数组索引16227处失败。
>     > Traceback (most recent call last):   File "<stdin>", line 1, in
>     > <module>   File "/usr/lib/spark/python/pyspark/sql/session.py", line
>     > 58, in toDF
>     >     return sparkSession.createDataFrame(self, schema, sampleRatio)   File "/usr/lib/spark/python/pyspark/sql/session.py",
> line 687, in
>     > createDataFrame
>     >     rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)   File "/usr/lib/spark/python/pyspark/sql/session.py",
>     > line 384, in _createFromRDD
>     >     struct = self._inferSchema(rdd, samplingRatio, names=schema)   File "/usr/lib/spark/python/pyspark/sql/session.py", line 355, in
>     > _inferSchema
>     >     first = rdd.first()   File "/usr/lib/spark/python/pyspark/rdd.py", line 1376, in first
>     >     rs = self.take(1)   File "/usr/lib/spark/python/pyspark/rdd.py", line 1328, in take
>     >     totalParts = self.getNumPartitions()   File "/usr/lib/spark/python/pyspark/rdd.py", line 2455, in getNumPartitions
>     >     return self._prev_jrdd.partitions().size()   File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>     > line 1160, in __call__   File
>     > "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>     >     return f(*a, **kw)   File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line
>     > 324, in get_return_value py4j.protocol.Py4JError: An error occurred
>     > while calling o81.partitions. Trace:
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$ComputationException:
>     > java.lang.ArrayIndexOutOfBoundsException: 16227     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.formatSource(Errors.java:690)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.Errors.format(Errors.java:555)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.ProvisionException.getMessage(ProvisionException.java:59)
>     >     at java.lang.Throwable.getLocalizedMessage(Throwable.java:391)  at
>     > java.lang.Throwable.toString(Throwable.java:480)    at
>     > java.lang.Throwable.<init>(Throwable.java:311)  at
>     > java.lang.Exception.<init>(Exception.java:102)  at
>     > java.lang.RuntimeException.<init>(RuntimeException.java:96)     at
>     > py4j.Py4JException.<init>(Py4JException.java:56)    at
>     > py4j.Py4JJavaException.<init>(Py4JJavaException.java:59)    at
>     > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:251)    at
>     > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)  at
>     > py4j.Gateway.invoke(Gateway.java:282)   at
>     > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>     >     at py4j.commands.CallCommand.execute(CallCommand.java:79)   at
>     > py4j.GatewayConnection.run(GatewayConnection.java:214)  at
>     > java.lang.Thread.run(Thread.java:748) Caused by:
>     > java.lang.ArrayIndexOutOfBoundsException: 16227     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.readClass(Unknown
>     > Source)     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
>     > Source)     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.asm.$ClassReader.accept(Unknown
>     > Source)     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$LineNumbers.<init>(LineNumbers.java:62)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33)
>     >     at
>     > com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549)
>     >     ... 20 more