Pyspark 分批拆分数据帧
我的要求是将数据帧分成两个批次,每个批次只包含两个项目,并且批次大小(输出中的批次)递增Pyspark 分批拆分数据帧,pyspark,Pyspark,我的要求是将数据帧分成两个批次,每个批次只包含两个项目,并且批次大小(输出中的批次)递增 col#1 col#2 DATE A 1 202010 B 1.1 202010 C 1.2 202010 D 1.3 202001 E 1.4 202001 O/p 我通过以下方法实现了这一目标: def dfZipWithIndex (df, offset=1, colName='rowId'): new_schema =
col#1 col#2 DATE
A 1 202010
B 1.1 202010
C 1.2 202010
D 1.3 202001
E 1.4 202001
O/p
我通过以下方法实现了这一目标:
def dfZipWithIndex (df, offset=1, colName='rowId'):
new_schema = StructType([StructField(colName,LongType(),True)]+
df.schema.fields)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd =zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
chunk_size=2
final_new=dfZipWithIndex(input_df)
temp_final=input_df.withColumn('BATCH',F.floor(F.col('rowId')/chunk_size)+1)
def dfZipWithIndex (df, offset=1, colName='rowId'):
new_schema = StructType([StructField(colName,LongType(),True)]+
df.schema.fields)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd =zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
chunk_size=2
final_new=dfZipWithIndex(input_df)
temp_final=input_df.withColumn('BATCH',F.floor(F.col('rowId')/chunk_size)+1)