Apache spark 基于字母分割的火花写入拼花地板_Apache Spark_Hadoop_Parquet

Apache spark 基于字母分割的火花写入拼花地板

apache-spark hadoop

Apache spark 基于字母分割的火花写入拼花地板,apache-spark,hadoop,parquet,Apache Spark,Hadoop,Parquet,我对这个话题做了很多研究。我有一个3 tb大小的数据集。以下是该表的数据架构： root |-- user: string (nullable = true) |-- attributes: array (nullable = true) | |-- element: string (containsNull = true) 每天，我都会得到一个需要属性的用户列表。我想知道是否可以将上面的模式写入一个拼花地板文件，其中包含用户的前两个字母。比如说, Omkar | [a,b,c,

我对这个话题做了很多研究。我有一个3 tb大小的数据集。以下是该表的数据架构：

root
 |-- user: string (nullable = true)
 |-- attributes: array (nullable = true)
 |    |-- element: string (containsNull = true)

每天，我都会得到一个需要属性的用户列表。我想知道是否可以将上面的模式写入一个拼花地板文件，其中包含用户的前两个字母。比如说,

Omkar | [a,b,c,d,e]
Mac   | [a,b,c,d,e]
Zee   | [a,b,c,d,e]
Kim   | [a,b,c,d,e]
Kelly | [a,b,c,d,e]

在上述数据集上，我可以执行以下操作：

spark.write.mode("overwrite").partitionBy("user".substr(0,2)).parquet("path/to/location")

这样做，我觉得下次加入用户时加载到内存中的数据将非常少，因为我们只能访问那些分区

如果有人这样实现了，有什么评论吗

谢谢

你可以。只需将代码替换为：

df
.withColumn（“prefix”，$“user”。substr（0,2））//添加前缀列
.write.mode（“覆盖”）
.partitionBy（“prefix”）//将其用于分区
.拼花地板（“路径/目的地/位置”）

是，这应该可以工作，但如果加载数据，还需要在筛选/联接条件中包含

“user”.substr（0,2）

，否则分区修剪将不起作用。Spark无法知道用户

Omkar

在分区

Om