Apache spark Pyspark-如何根据条件将多个json流合并到单个数据帧?
我有以下来自不同流媒体主题的json事件Apache spark Pyspark-如何根据条件将多个json流合并到单个数据帧?,apache-spark,pyspark,apache-spark-sql,spark-streaming,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Spark Streaming,Pyspark Dataframes,我有以下来自不同流媒体主题的json事件 customer= [ { "Customer_id": "103", "Customer_name": "Hari", "email_address": "hari@gmail.com" }] product = [ { "Customer_id
customer= [
{
"Customer_id": "103",
"Customer_name": "Hari",
"email_address": "hari@gmail.com"
}]
product = [
{
"Customer_id": "103",
"product_id": " 205",
"product_name": "Books",
"product_Category": "Stationary"
}]
Sales= [
{
"customer_id": "103",
"line": {
"product_id": "205",
"purchase_time": "2017-08-19 12:17:55-0400",
"quantity": "2",
"unit_price": "25000"
},
"shipping_address": "Chennai"
}]
上面的流媒体与下面的用例是不同的主题
sales_schema = StructType([
StructField("Customer_id", StringType(), True),
StructField("Customer_name", StringType(), True),
StructField("email_address", StringType(), True),
StructField("product_Category", StringType(), True),
StructField("product_id", StringType(), True),
StructField("product_name", StringType(), True),
StructField("purchase_time", StringType(), True),
StructField("quantity", StringType(), True),
StructField("unit_price", StringType(), True),
StructField("shipping_address", StringType(), True)
]
)
cus_Topic=session.sparkContext.parallelize(customer)
sales_df = session.createDataFrame(cus_Topic,sales_schema)
product_topic = session.sparkContext.parallelize(product)
productDF = session.createDataFrame(product_topic)
sales_df.show()
|Customer_id|Customer_name| email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|
103| Hari| hari@gmail.com| null| null| null| null| null| null| null|
104| Umesh|Umesh3@gmail.com| null| null| null| null| null| null| null|
productDF.show()
+-----------+----------------+----------+------------+
|Customer_id|product_Category|product_id|product_name|
+-----------+----------------+----------+------------+
| 103| Stationary| 205| Books|
| 104| Electronics| 206| Mobile|
+-----------+----------------+----------+------------+
现在我想根据客户id合并这个数据帧
product_search_DF = sales_df.join(productDF, [sales_df.Customer_id==productDF.Customer_id], 'left_outer')
product_search_DF.show()
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
|Customer_id|Customer_name| email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|Customer_id|product_Category|product_id|product_name|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
| 104| Umesh|Umesh3@gmail.com| null| null| null| null| null| null| null| 104| Electronics| 206| Mobile|
| 103| Hari| hari@gmail.com| null| null| null| null| null| null| null| 103| Stationary| 205| Books|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
但它会产生重复列
此外,iam所看到的是,从直播主题来看,来自客户、产品和销售的所有这些数据都应该合并到一个数据帧中
我还想知道实现这一目标的正确方法
谢谢你的帮助。。
谢谢恐怕我没有完全理解您的问题,您是否正在寻找如何将json数据提取到数据帧中?如果我错了,请纠正我hi@dsk我已经更新了我的问题,理想情况下我需要将不同的json流连接到单个dataframeHI@dsk。。希望这个问题更容易理解你能帮我们吗这里有两件事。1) 为什么要用sales模式存储客户详细信息,而其他所有字段都填充为null?只需将customer存储为具有3个customer字段的customer模式,然后执行左侧外部联接,就不会得到这些重复的列。2) 如果您有任何特定的理由这样做(用额外的字段填充customer schema,这些字段将始终为空),那么最终您只需在连接的数据帧上进行选择,并仅从每个数据帧中选择所需的文件。Hi@SD3。。我需要将所有这三个主题(客户、产品、销售)合并到一个数据框架中。销售部。因此,在sales_df中,它将有关于客户事件的合并数据,因此我可以根据需要进一步应用转换。。请纠正我,如果此方法不正确,请在接收到数据帧的实时流媒体时。。谢谢