Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件?
我计划用ORC文件格式做一些测试,但我无法从其他文件创建ORC文件 我在拼花地板(还有HDF5和CSV)文件中有一个顺序/柱状数据存储,我可以使用Spark和Pandas轻松地进行操作 我在本地工作,在一个只有8GB内存的MAC-OS上。该表有80000列和7000行。每列表示一个度量源,行表示整个时间段的度量 我正在使用Pycharm,并且已经尝试使用Spark读/写数据帧和SparkSQL/Hive以及本地和HDFS进行转换 代码:Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件?,python,apache-spark,parquet,orc,Python,Apache Spark,Parquet,Orc,我计划用ORC文件格式做一些测试,但我无法从其他文件创建ORC文件 我在拼花地板(还有HDF5和CSV)文件中有一个顺序/柱状数据存储,我可以使用Spark和Pandas轻松地进行操作 我在本地工作,在一个只有8GB内存的MAC-OS上。该表有80000列和7000行。每列表示一个度量源,行表示整个时间段的度量 我正在使用Pycharm,并且已经尝试使用Spark读/写数据帧和SparkSQL/Hive以及本地和HDFS进行转换 代码: spark = SparkSession \ .b
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("ConvertPqtoOrc") \
.config('spark.sql.debug.maxToStringFields', 100000) \
.config('spark.network.timeout', 10000000) \
.config('spark.executor.heartbeatInterval', 10000000) \
.config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
.config('spark.executor.memory', '6g') \
.config('spark.executor.cores', '4') \
.config('spark.driver.memory', '6g') \
.config('spark.cores.max', '300') \
.config('spark.sql.orc.enabled', 'true') \
.config('spark.sql.hive.convertMetastoreOrc', 'true') \
.config('spark.sql.orc.filterPushdown', 'true') \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)
spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")
spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))
spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))
spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")
火花读/写尝试:
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("ConvertPqtoOrc") \
.config('spark.sql.debug.maxToStringFields', 100000) \
.config('spark.network.timeout', 10000000) \
.config('spark.executor.heartbeatInterval', 10000000) \
.config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
.config('spark.executor.memory', '6g') \
.config('spark.executor.cores', '4') \
.config('spark.driver.memory', '6g') \
.config('spark.cores.max', '300') \
.config('spark.sql.orc.enabled', 'true') \
.config('spark.sql.hive.convertMetastoreOrc', 'true') \
.config('spark.sql.orc.filterPushdown', 'true') \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)
spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")
spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))
spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))
spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")
Spark SQL/Hive尝试:
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("ConvertPqtoOrc") \
.config('spark.sql.debug.maxToStringFields', 100000) \
.config('spark.network.timeout', 10000000) \
.config('spark.executor.heartbeatInterval', 10000000) \
.config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
.config('spark.executor.memory', '6g') \
.config('spark.executor.cores', '4') \
.config('spark.driver.memory', '6g') \
.config('spark.cores.max', '300') \
.config('spark.sql.orc.enabled', 'true') \
.config('spark.sql.hive.convertMetastoreOrc', 'true') \
.config('spark.sql.orc.filterPushdown', 'true') \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)
spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")
spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))
spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))
spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")
这两种方法都适用于过滤数据集(最多20k列/7000行),当我使用整个表时,经过很长时间后,在这两种情况下都会显示以下警告/错误:
WARN DAGScheduler: Broadcasting large task binary with size 7.3 MiB
ERROR Utils: Aborting task
java.lang.OutOfMemoryError: Java heap space