Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/336.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件?_Python_Apache Spark_Parquet_Orc - Fatal编程技术网

Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件?

Python 如何将拼花地板文件转换为数据集大于RAM内存的ORC文件?,python,apache-spark,parquet,orc,Python,Apache Spark,Parquet,Orc,我计划用ORC文件格式做一些测试,但我无法从其他文件创建ORC文件 我在拼花地板(还有HDF5和CSV)文件中有一个顺序/柱状数据存储,我可以使用Spark和Pandas轻松地进行操作 我在本地工作,在一个只有8GB内存的MAC-OS上。该表有80000列和7000行。每列表示一个度量源,行表示整个时间段的度量 我正在使用Pycharm,并且已经尝试使用Spark读/写数据帧和SparkSQL/Hive以及本地和HDFS进行转换 代码: spark = SparkSession \ .b

我计划用ORC文件格式做一些测试,但我无法从其他文件创建ORC文件

我在拼花地板(还有HDF5和CSV)文件中有一个顺序/柱状数据存储,我可以使用Spark和Pandas轻松地进行操作

我在本地工作,在一个只有8GB内存的MAC-OS上。该表有80000列和7000行。每列表示一个度量源,行表示整个时间段的度量

我正在使用Pycharm,并且已经尝试使用Spark读/写数据帧和SparkSQL/Hive以及本地和HDFS进行转换

代码:

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("ConvertPqtoOrc") \
    .config('spark.sql.debug.maxToStringFields', 100000) \
    .config('spark.network.timeout', 10000000) \
    .config('spark.executor.heartbeatInterval', 10000000) \
    .config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
    .config('spark.executor.memory', '6g') \
    .config('spark.executor.cores', '4') \
    .config('spark.driver.memory', '6g') \
    .config('spark.cores.max', '300') \
    .config('spark.sql.orc.enabled', 'true') \
    .config('spark.sql.hive.convertMetastoreOrc', 'true') \
    .config('spark.sql.orc.filterPushdown', 'true') \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext
df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)
spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")

spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))

spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))

spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")
火花读/写尝试:

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("ConvertPqtoOrc") \
    .config('spark.sql.debug.maxToStringFields', 100000) \
    .config('spark.network.timeout', 10000000) \
    .config('spark.executor.heartbeatInterval', 10000000) \
    .config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
    .config('spark.executor.memory', '6g') \
    .config('spark.executor.cores', '4') \
    .config('spark.driver.memory', '6g') \
    .config('spark.cores.max', '300') \
    .config('spark.sql.orc.enabled', 'true') \
    .config('spark.sql.hive.convertMetastoreOrc', 'true') \
    .config('spark.sql.orc.filterPushdown', 'true') \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext
df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)
spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")

spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))

spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))

spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")
Spark SQL/Hive尝试:

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("ConvertPqtoOrc") \
    .config('spark.sql.debug.maxToStringFields', 100000) \
    .config('spark.network.timeout', 10000000) \
    .config('spark.executor.heartbeatInterval', 10000000) \
    .config('spark.storage.blockManagerSlaveTimeoutMs', 10000000) \
    .config('spark.executor.memory', '6g') \
    .config('spark.executor.cores', '4') \
    .config('spark.driver.memory', '6g') \
    .config('spark.cores.max', '300') \
    .config('spark.sql.orc.enabled', 'true') \
    .config('spark.sql.hive.convertMetastoreOrc', 'true') \
    .config('spark.sql.orc.filterPushdown', 'true') \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext
df_spark = spark.read.parquet(pq_path)
df_spark.write.mode("overwrite").format("orc").save(ORC_path)
spark.sql("DROP TABLE if exists tbl_pq")
spark.sql("DROP TABLE if exists tbl_orc")

spark.sql("CREATE TABLE IF NOT EXISTS tbl_pq USING PARQUET LOCATION '{}'".format(db_path + Pq_file))

spark.sql("CREATE TABLE IF NOT EXISTS tbl_orc ({0} double) STORED AS ORC LOCATION '{1}'".format(" double, ".join(map(str, myColumns)), db_path + Output_file))

spark.sql("INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl_porque")
这两种方法都适用于过滤数据集(最多20k列/7000行),当我使用整个表时,经过很长时间后,在这两种情况下都会显示以下警告/错误:

WARN DAGScheduler: Broadcasting large task binary with size 7.3 MiB
ERROR Utils: Aborting task
java.lang.OutOfMemoryError: Java heap space