Apache spark 如何在Spark中设置ORC条纹大小_Apache Spark_Orc

Apache spark 如何在Spark中设置ORC条纹大小

apache-spark

Apache spark 如何在Spark中设置ORC条纹大小,apache-spark,orc,Apache Spark,Orc,我试图在Spark（2.3）中生成一个数据集，并以ORC文件格式编写它。我正在尝试为ORC条带大小和压缩大小设置一些属性。我从SO的帖子中得到了一些提示。但是spark并不尊重这些属性，我在生成的ORC文件中的条纹大小比我设置的要小得多 val conf: SparkConf = new SparkConf().setAppName("App") .set("spark.sql.orc.impl", "native") .set("spark.sql.hive.convertMetas

我试图在Spark（2.3）中生成一个数据集，并以ORC文件格式编写它。我正在尝试为ORC条带大小和压缩大小设置一些属性。我从SO的帖子中得到了一些提示。但是spark并不尊重这些属性，我在生成的ORC文件中的条纹大小比我设置的要小得多

val conf: SparkConf = new SparkConf().setAppName("App")
  .set("spark.sql.orc.impl", "native")
  .set("spark.sql.hive.convertMetastoreOrc", "true")
  .set("spark.sql.orc.stripe.size", "67108864")
  .set("spark.sql.orc.compress.size", "262144")
  .set("orc.stripe.size", "67108864")
  .set("orc.compress.size", "262144")

data.sortWithinPartitions("column")
  .write
  .option("orc.compress", "ZLIB")
  .mode("overwrite")
  .format("org.apache.spark.sql.execution.datasources.orc")
  .save(outputPath)

我还尝试将数据写为：

data.sortWithinPartitions("column")
  .write
  .option("orc.compress", "ZLIB")
  .option("orc.stripe.size", "67108864")
  .option("orc.compress.size", "262144")
  .mode("overwrite")
  .format("org.apache.spark.sql.execution.datasources.orc")
  .save(outputPath)

但是没有运气

ORC文件转储的相关部分：

File Version: 0.12 with ORC_135
Rows: 3174228
Compression: ZLIB
Compression size: 32768
...
Stripe: offset: 3 data: 6601333 rows: 30720 tail: 2296 index: 16641
Stripe: offset: 6620273 data: 6016778 rows: 25600 tail: 2279 index: 13595
Stripe: offset: 12652925 data: 6031290 rows: 25600 tail: 2284 index: 13891
Stripe: offset: 18700390 data: 6132228 rows: 25600 tail: 2283 index: 13805
Stripe: offset: 24848706 data: 6066176 rows: 25600 tail: 2267 index: 13855
Stripe: offset: 30931004 data: 6562819 rows: 30720 tail: 2308 index: 16851
Stripe: offset: 37512982 data: 6462380 rows: 30720 tail: 2304 index: 16994
Stripe: offset: 43994660 data: 6655346 rows: 30720 tail: 2291 index: 17031

我也有同样的问题，在我的例子中，它似乎与Hortonworks HDP使用的版本有关。在本文中，您可以看到类似的讨论，其中他们建议使用HDP2.6.3+和Spark 2.2+，后者利用了较新的Hive库：

也许您的Spark 2.3仍然配置为使用较旧的Hive 1.2.1库。

以下内容适用于Spark 2.4.4

spark = (SparkSession
     .builder
     .config('hive.exec.orc.default.stripe.size', 64*1024*1024)
     .getOrCreate()
     )
df = ...
df.write.format('orc').save('output.orc')