Pyspark 识别在Spark中恢复的值

Pyspark 识别在Spark中恢复的值,pyspark,spark-dataframe,sparkr,Pyspark,Spark Dataframe,Sparkr,我有一个Spark客户数据框架,如下所示 #SparkR code customers <- data.frame(custID = c("001", "001", "001", "002", "002", "002", "002"), date = c("2017-02-01", "2017-03-01", "2017-04-01", "2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01"), value = c('new',

我有一个Spark客户数据框架,如下所示

#SparkR code
customers <- data.frame(custID = c("001", "001", "001", "002", "002", "002", "002"),
  date = c("2017-02-01", "2017-03-01", "2017-04-01", "2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01"),
  value = c('new', 'good', 'good', 'new', 'good', 'new', 'bad'))
customers <- createDataFrame(customers)
display(customers)

custID|  date     | value
--------------------------
001   | 2017-02-01| new
001   | 2017-03-01| good
001   | 2017-04-01| good
002   | 2017-01-01| new
002   | 2017-02-01| good
002   | 2017-03-01| new
002   | 2017-04-01| bad
在Pypark中:

from pyspark.sql import functions as f

spark = SparkSession.builder.getOrCreate()

# df is equal to your customers dataframe
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', header=True, sep='|').cache()

df_new = df.filter(df['value'] == 'new').withColumn('tag', f.rank().over(Window.partitionBy('custID').orderBy('date')))
df = df_new.union(df.filter(df['value'] != 'new').withColumn('tag', f.lit(None)))
df = df.withColumn('tag', f.collect_list('tag').over(Window.partitionBy('custID').orderBy('date'))) \
    .withColumn('tag', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('tag'))

df.show()
和输出:

+------+----------+-----+---+                                                   
|custID|      date|value|tag|
+------+----------+-----+---+
|   001|2017-02-01|  new|  1|
|   001|2017-03-01| good|  1|
|   001|2017-04-01| good|  1|
|   002|2017-01-01|  new|  1|
|   002|2017-02-01| good|  1|
|   002|2017-03-01|  new|  2|
|   002|2017-04-01|  bad|  2|
+------+----------+-----+---+

顺便说一下,熊猫可以很容易地做到这一点。

这可以使用以下代码来完成:

用“新建”筛选出所有记录

df_new=t2.日期)按1,2,3“分组

谢谢,我可以在R或pandas中执行此操作,但我有一个非常大的数据帧,需要Spark。
+------+----------+-----+---+                                                   
|custID|      date|value|tag|
+------+----------+-----+---+
|   001|2017-02-01|  new|  1|
|   001|2017-03-01| good|  1|
|   001|2017-04-01| good|  1|
|   002|2017-01-01|  new|  1|
|   002|2017-02-01| good|  1|
|   002|2017-03-01|  new|  2|
|   002|2017-04-01|  bad|  2|
+------+----------+-----+---+
df_new<-sql("select * from df where value="new")
createOrReplaceTempView(df_new,"df_new")

df_new<-sql("select *,row_number() over(partiting by custID order by date) 
tag from df_new")
createOrReplaceTempView(df_new,"df_new")

df<-sql("select custID,date,value,min(tag) as tag from 
(select t1.*,t2.tag from df t1 left outer join df_new t2 on 
t1.custID=t2.custID and t1.date>=t2.date) group by 1,2,3")