Pyspark 分批拆分数据帧_Pyspark

Pyspark 分批拆分数据帧

pyspark

Pyspark 分批拆分数据帧,pyspark,Pyspark,我的要求是将数据帧分成两个批次，每个批次只包含两个项目，并且批次大小（输出中的批次）递增 col#1 col#2 DATE A 1 202010 B 1.1 202010 C 1.2 202010 D 1.3 202001 E 1.4 202001 O/p 我通过以下方法实现了这一目标： def dfZipWithIndex (df, offset=1, colName='rowId'): new_schema =

我的要求是将数据帧分成两个批次，每个批次只包含两个项目，并且批次大小（输出中的批次）递增

col#1 col#2 DATE
A     1     202010
B     1.1   202010
C     1.2   202010 
D     1.3   202001
E     1.4   202001

O/p

我通过以下方法实现了这一目标：

def dfZipWithIndex (df, offset=1, colName='rowId'):   
   new_schema = StructType([StructField(colName,LongType(),True)]+ 
   df.schema.fields)   
   zipped_rdd = df.rdd.zipWithIndex()   
   new_rdd =zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))   
   return spark.createDataFrame(new_rdd, new_schema)

chunk_size=2 
final_new=dfZipWithIndex(input_df) 


temp_final=input_df.withColumn('BATCH',F.floor(F.col('rowId')/chunk_size)+1)

def dfZipWithIndex (df, offset=1, colName='rowId'):   
   new_schema = StructType([StructField(colName,LongType(),True)]+ 
   df.schema.fields)   
   zipped_rdd = df.rdd.zipWithIndex()   
   new_rdd =zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))   
   return spark.createDataFrame(new_rdd, new_schema)

chunk_size=2 
final_new=dfZipWithIndex(input_df) 


temp_final=input_df.withColumn('BATCH',F.floor(F.col('rowId')/chunk_size)+1)