Apache spark Pyspark-如何根据条件将多个json流合并到单个数据帧？_Apache Spark_Pyspark_Apache Spark Sql_Spark Streaming_Pyspark Dataframes

Apache spark Pyspark-如何根据条件将多个json流合并到单个数据帧？

apache-spark pyspark

Apache spark Pyspark-如何根据条件将多个json流合并到单个数据帧？,apache-spark,pyspark,apache-spark-sql,spark-streaming,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Spark Streaming,Pyspark Dataframes,我有以下来自不同流媒体主题的json事件 customer= [ { "Customer_id": "103", "Customer_name": "Hari", "email_address": "hari@gmail.com" }] product = [ { "Customer_id

我有以下来自不同流媒体主题的json事件

customer= [

    {
        "Customer_id": "103",
        "Customer_name": "Hari",
        "email_address": "hari@gmail.com"
    }]

product = [
  {
    "Customer_id": "103",
    "product_id": " 205",
    "product_name": "Books",
    "product_Category": "Stationary"
  }]

Sales= [
  {
    "customer_id": "103",
    "line": {
      "product_id": "205",
      "purchase_time": "2017-08-19 12:17:55-0400",
      "quantity": "2",
      "unit_price": "25000"
    },
    "shipping_address": "Chennai"
  }]

上面的流媒体与下面的用例是不同的主题

当用户登录到e-portal-customer.json时

当用户为产品选择serach时-product.json

当用户签出产品时-sales.json

我为上述用例创建了以下DF

sales_schema = StructType([
    StructField("Customer_id", StringType(), True),
    StructField("Customer_name", StringType(), True),
    StructField("email_address", StringType(), True),
    StructField("product_Category", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("product_name", StringType(), True),
    StructField("purchase_time", StringType(), True),
    StructField("quantity", StringType(), True),
    StructField("unit_price", StringType(), True),
    StructField("shipping_address", StringType(), True)
   ]
)
cus_Topic=session.sparkContext.parallelize(customer)
sales_df = session.createDataFrame(cus_Topic,sales_schema)

product_topic = session.sparkContext.parallelize(product)
productDF = session.createDataFrame(product_topic)

sales_df.show()

|Customer_id|Customer_name|   email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|
         103|         Hari|  hari@gmail.com|            null|      null|        null|         null|    null|      null|            null|
         104|        Umesh|Umesh3@gmail.com|            null|      null|        null|         null|    null|      null|            null|


productDF.show()

+-----------+----------------+----------+------------+
|Customer_id|product_Category|product_id|product_name|
+-----------+----------------+----------+------------+
|        103|      Stationary|       205|       Books|
|        104|     Electronics|       206|      Mobile|
+-----------+----------------+----------+------------+

现在我想根据客户id合并这个数据帧

product_search_DF = sales_df.join(productDF, [sales_df.Customer_id==productDF.Customer_id], 'left_outer')
product_search_DF.show()

+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
|Customer_id|Customer_name|   email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|Customer_id|product_Category|product_id|product_name|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
|        104|        Umesh|Umesh3@gmail.com|            null|      null|        null|         null|    null|      null|            null|        104|     Electronics|       206|      Mobile|
|        103|         Hari|  hari@gmail.com|            null|      null|        null|         null|    null|      null|            null|        103|      Stationary|       205|       Books|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+

但它会产生重复列

此外，iam所看到的是，从直播主题来看，来自客户、产品和销售的所有这些数据都应该合并到一个数据帧中

我还想知道实现这一目标的正确方法

谢谢你的帮助。。

谢谢

恐怕我没有完全理解您的问题，您是否正在寻找如何将json数据提取到数据帧中？如果我错了，请纠正我hi@dsk我已经更新了我的问题，理想情况下我需要将不同的json流连接到单个dataframeHI@dsk。。希望这个问题更容易理解你能帮我们吗这里有两件事。1）为什么要用sales模式存储客户详细信息，而其他所有字段都填充为null？只需将customer存储为具有3个customer字段的customer模式，然后执行左侧外部联接，就不会得到这些重复的列。2）如果您有任何特定的理由这样做（用额外的字段填充customer schema，这些字段将始终为空），那么最终您只需在连接的数据帧上进行选择，并仅从每个数据帧中选择所需的文件。Hi@SD3。。我需要将所有这三个主题（客户、产品、销售）合并到一个数据框架中。销售部。因此，在sales_df中，它将有关于客户事件的合并数据，因此我可以根据需要进一步应用转换。。请纠正我，如果此方法不正确，请在接收到数据帧的实时流媒体时。。谢谢