Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件？_Python_Apache Spark_Parquet_Orc

Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件？

python apache-spark

Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件？,python,apache-spark,parquet,orc,Python,Apache Spark,Parquet,Orc,我计划用ORC文件格式做一些测试，但我无法从其他文件创建ORC文件我在拼花地板（还有HDF5和CSV）文件中有一个顺序/柱状数据存储，我可以使用Spark和Pandas轻松地进行操作我在本地工作，在一个只有8GB内存的MAC-OS上。该表有80000列和7000行。每列表示一个度量源，行表示整个时间段的度量我正在使用Pycharm，并且已经尝试使用Spark读/写数据帧和SparkSQL/Hive以及本地和HDFS进行转换代码： spark = SparkSession \ .b

我计划用ORC文件格式做一些测试，但我无法从其他文件创建ORC文件

我在拼花地板（还有HDF5和CSV）文件中有一个顺序/柱状数据存储，我可以使用Spark和Pandas轻松地进行操作

我在本地工作，在一个只有8GB内存的MAC-OS上。该表有80000列和7000行。每列表示一个度量源，行表示整个时间段的度量

我正在使用Pycharm，并且已经尝试使用Spark读/写数据帧和SparkSQL/Hive以及本地和HDFS进行转换

代码：

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("ConvertPqtoOrc") \
    .config('spark.sql.debug.maxToStringFields', 100000) \
    .config('spark.network.timeout', 10000000) \
    .config('spark.executor.heartbeatInterval', 10000000) \
    .config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
    .config('spark.executor.memory', '6g') \
    .config('spark.executor.cores', '4') \
    .config('spark.driver.memory', '6g') \
    .config('spark.cores.max', '300') \
    .config('spark.sql.orc.enabled', 'true') \
    .config('spark.sql.hive.convertMetastoreOrc', 'true') \
    .config('spark.sql.orc.filterPushdown', 'true') \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)

spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")

spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))

spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))

spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")

火花读/写尝试：

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("ConvertPqtoOrc") \
    .config('spark.sql.debug.maxToStringFields', 100000) \
    .config('spark.network.timeout', 10000000) \
    .config('spark.executor.heartbeatInterval', 10000000) \
    .config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
    .config('spark.executor.memory', '6g') \
    .config('spark.executor.cores', '4') \
    .config('spark.driver.memory', '6g') \
    .config('spark.cores.max', '300') \
    .config('spark.sql.orc.enabled', 'true') \
    .config('spark.sql.hive.convertMetastoreOrc', 'true') \
    .config('spark.sql.orc.filterPushdown', 'true') \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)

spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")

spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))

spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))

spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")

Spark SQL/Hive尝试：

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("ConvertPqtoOrc") \
    .config('spark.sql.debug.maxToStringFields', 100000) \
    .config('spark.network.timeout', 10000000) \
    .config('spark.executor.heartbeatInterval', 10000000) \
    .config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
    .config('spark.executor.memory', '6g') \
    .config('spark.executor.cores', '4') \
    .config('spark.driver.memory', '6g') \
    .config('spark.cores.max', '300') \
    .config('spark.sql.orc.enabled', 'true') \
    .config('spark.sql.hive.convertMetastoreOrc', 'true') \
    .config('spark.sql.orc.filterPushdown', 'true') \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)

spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")

spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))

spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))

spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")

这两种方法都适用于过滤数据集（最多20k列/7000行），当我使用整个表时，经过很长时间后，在这两种情况下都会显示以下警告/错误：

WARN DAGScheduler: Broadcasting large task binary with size 7.3 MiB
ERROR Utils: Aborting task
java.lang.OutOfMemoryError: Java heap space