Python pyspark dataframe foreach用于填充列表
我在Spark 1.6.1和Python 2.7中工作,我需要解决以下问题:Python pyspark dataframe foreach用于填充列表,python,apache-spark,foreach,pyspark,spark-dataframe,Python,Apache Spark,Foreach,Pyspark,Spark Dataframe,我在Spark 1.6.1和Python 2.7中工作,我需要解决以下问题: 获取包含X行的数据帧a 对于A中的每一行,根据字段,创建新数据帧B的一行或多行 保存新的数据帧B 我现在提出的解决方案是收集数据帧A,遍历它,将B的行附加到列表中,然后从该列表中创建数据帧B 使用这个解决方案,我显然失去了使用数据帧的所有好处,我想使用foreach,但我找不到一种方法来实现这一点。到目前为止,我已经试过了: 将空列表传递给foreach函数(这只是忽略foreach函数,不做任何事情) 创建要在
- 获取包含X行的数据帧a
- 对于A中的每一行,根据字段,创建新数据帧B的一行或多行
- 保存新的数据帧B
- 将空列表传递给foreach函数(这只是忽略foreach函数,不做任何事情)
- 创建要在foreach函数中使用的全局变量(抱怨找不到列表)
def f(row, list):
if row.one:
list += [Row(type='one', field='ok')]
else:
list += [Row(type='one', field='ok')]
list += [Row(type='two', field='nok')]
list = []
dfA.foreach(lambda x : f(x, list))
正如我提到的,它什么也不做,它不执行函数
我也尝试过(在课程开始时定义的列表):
---------编辑2:
我现在正在做的是:
list = []
for row in dfA.collect():
string = re.search(a_regex, row['raw'])
if string:
dates = re.findall(date_regex, string.group())
for date in dates:
date_string = datetime.strptime(date, '%Y-%m-%d').date()
list += [Row(event_type='1', event_date=date_string)]
b_string = re.search(b_regex, row['raw'])
if b_string:
dates = re.findall(date_regex, b_string.group())
for date in dates:
scheduled_to = datetime.strptime(date, '%Y-%m-%d').date()
list += [Row(event_type='2', event_date= date_string)]
然后:
dfB = self._sql_context.createDataFrame(list)
dfA是由其他进程给出的,我不能更改它,我知道使用数据帧是一种非常愚蠢的方式,但我对此无能为力
--------------------编辑3:
dfA.原始样品:
{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]}
{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}
{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
而正则表达式:
a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'
dfA.select('raw').show(2,False)
df.select('raw').printSchema()
选择所需的
raw
列后,需要编写udf
函数返回event\u type
和event\u date
字符串
import re
def searchUdf(regex, dateRegex, x):
list_return = []
string = re.search(regex, x)
if string:
dates = re.findall(dateRegex, string.group())
for date in dates:
date_string = datetime.strptime(date, '%Y-%m-%d').date()
list_return.append(date_string)
return list_return
from pyspark.sql import functions as F
udfFunctionCall = F.udf(searchUdf, T.ArrayType(T.DateType()))
udf
函数将解析原始列字符串,并将regex和dateRegex作为参数传递,并将eventType和data_字符串作为arrayType
列返回
您应该调用定义的udf
函数并筛选出空行,然后将列作为event\u type
和event\u date
列分开
df = df.select("raw")
adf = df.select(F.lit(1).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date"))\
.filter(F.size(F.col("event_date")) > 0)
bdf = df.select(F.lit(2).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date")) \
.filter(F.size(F.col("event_date")) > 0)
问题中使用的正则表达式如下所示:
a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'
现在,对于这两种事件类型
,您都有了两个数据帧
,最后一步是将它们合并在一起
adf.unionAll(bdf)
就这样。你的困惑完全解决了
使用以下原始列
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|raw |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]} |
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]} |
|{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
你应该得到
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|event_type|event_date |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[2018-03-10] |
|1 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
|2 |[2018-03-10] |
|2 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
请分享您尝试过的内容,样本输入和预期输出。完成后,预期输出显然是填充的列表从A获取X行的条件是什么。您可以发布样本输入和预期输出吗?还要解释,根据字段,为A中的每一行创建一行或多行新数据框B
hi ramesh,首先,让我粘贴我现在所拥有的,这样你就可以看到我正在尝试做什么:嘿,Ramesh,看起来很好。。。我将尝试所有的案例(新的,但没有删除,删除,但没有新的和大量的新闻和很多删除),看看它是如何工作的,我会让你知道(并接受答案,如果一切顺利:))非常感谢巨大的帮助Hi Ramesh,我很抱歉,但它不工作。对于此输入:{“新建”:[],“删除”:[{“开始”:“2018-02-24”,“结束”:“2018-03-02”},{“开始”:“2018-03-03”,“结束”:“2018-03-09”},{“开始”:“2018-03-10”,“结束”:“2018-03-16”},{“开始”:“2018-03-17”,“结束”:“2018-03-23”},{“开始”:“2018-03-24”,“结束”:“2018-03-30”}只得到第一个,据我所知,过滤器将得到至少有一个条目的行,但是查看你发送的代码,只会得到第一个。。。我错过什么了吗?为什么不工作?谢谢你能解释一下吗,你说它只得到第一行吗?因为我只得到一个df,其中一行有event_type=2和event_date=2018-02-24(这是第一行),这不仅有帮助,而且还教会了我一些未来的诀窍:)
adf.unionAll(bdf)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|raw |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]} |
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]} |
|{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|event_type|event_date |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[2018-03-10] |
|1 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
|2 |[2018-03-10] |
|2 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+