Dataframe PySpark通过限制行数来分割密钥
我有一个包含3列的数据框,如下所示:Dataframe PySpark通过限制行数来分割密钥,dataframe,apache-spark,pyspark,Dataframe,Apache Spark,Pyspark,我有一个包含3列的数据框,如下所示: +-------+--------------------+-------------+ | id | reports | hash | +-------+--------------------+-------------+ |abc | [[1,2,3], [4,5,6]] | 9q5 | |def | [[1,2,3], [4,5,6]] | 9q5 | |ghi
+-------+--------------------+-------------+
| id | reports | hash |
+-------+--------------------+-------------+
|abc | [[1,2,3], [4,5,6]] | 9q5 |
|def | [[1,2,3], [4,5,6]] | 9q5 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 |
|lmn | [[1,2,3], [4,5,6]] | abc |
|opq | [[1,2,3], [4,5,6]] | abc |
|rst | [[1,2,3], [4,5,6]] | abc |
+-------+--------------------+-------------+
现在我的问题是,我需要限制每个散列的行数
我想我可以转换散列,例如,对于前1k行,9q5_1中的9q5,对于第二1k行,等等,对于散列中的每个值
有一个相似但不同的数据帧,数据帧被分割,我想保留一个并更改键值
有没有关于如何实现这一目标的建议?谢谢,我找到了解决办法。我使用Window函数为geohash列中的每个值创建一个新列,该列具有增量索引。然后,我应用一个udf函数,该函数根据原始的geohash和索引合成我需要的新哈希值'geohash'\ux
partition_size_limit = 10
generate_indexed_geohash_udf = udf(lambda geohash, index: "{0}_{1}".format(geohash, int(index / partition_size_limit)))
window = Window.partitionBy(df_split['geohash']).orderBy(df_split['id'])
df_split.select('*', rank().over(window).alias('index')).withColumn("indexed_geohash", generate_indexed_geohash_udf('geohash', 'index'))
结果是:
+-------+--------------------+-------------+-------------+-----------------+
| id | reports | hash | index | indexed_geohash |
+-------+--------------------+-------------+-------------+-----------------+
|abc | [[1,2,3], [4,5,6]] | 9q5 | 1 | 9q5_0 |
|def | [[1,2,3], [4,5,6]] | 9q5 | 2 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 3 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 4 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 5 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 6 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 7 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 8 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 9 | 9q5_0 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 10 | 9q5_1 |
|ghi | [[1,2,3], [4,5,6]] | 9q5 | 11 | 9q5_1 |
|lmn | [[1,2,3], [4,5,6]] | abc | 1 | abc_0 |
|opq | [[1,2,3], [4,5,6]] | abc | 2 | abc_0 |
|rst | [[1,2,3], [4,5,6]] | abc | 3 | abc_0 |
+-------+--------------------+-------------+-------------+-----------------+
编辑:史蒂文的答案也非常有效
partition_size_limit = 10
window = Window.partitionBy(df_split['geohash']).orderBy(df_split['id'])
df_split.select('*', rank().over(window).alias('index')).withColumn("indexed_geohash", F.concat_ws("_", F.col("geohash"), F.floor((F.col("index") / F.lit(partition_size_limit))).cast("String")))
你不需要一个自定义项。将其保持在火花范围内,以获得更好的性能。将
generate\u indexed\u geohash\u udf
替换为类似F.concat\u ws(“,”,F.col(“geohash”),F.floor((F.col(“index”)/F.lit(partition\u size\u limit))。cast(“string”)