Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Mongodb 如何通过在pyspark上插入子文档将两个文档合并为一个文档?_Mongodb_Apache Spark_Pyspark_Kafka Consumer Api_Spark Structured Streaming - Fatal编程技术网

Mongodb 如何通过在pyspark上插入子文档将两个文档合并为一个文档?

Mongodb 如何通过在pyspark上插入子文档将两个文档合并为一个文档?,mongodb,apache-spark,pyspark,kafka-consumer-api,spark-structured-streaming,Mongodb,Apache Spark,Pyspark,Kafka Consumer Api,Spark Structured Streaming,我有一个大问题,希望能清楚地解释我想做什么。 我正在尝试在pyspark(Spark Structured Streaming)上获取流结构,我想在从Kafka的抓取中获取新数据时更新同一文档 以下是在本地主机和MongoCompass上发送的JSON示例: { _id: ObjectId("28276465847392747") id: reply Company: reply Value:{ Date: 20-05-2020 Last_Hour_Contract: 09.1

我有一个大问题,希望能清楚地解释我想做什么。 我正在尝试在pyspark(Spark Structured Streaming)上获取流结构,我想在从Kafka的抓取中获取新数据时更新同一文档

以下是在本地主机和MongoCompass上发送的JSON示例:

{
_id: ObjectId("28276465847392747")
id: reply
Company: reply
Value:{

    Date: 20-05-2020
    Last_Hour_Contract: 09.12.25
    Last_Hour: 09.14.30
    Price: 16.08 
    Quantity: 8000 
    Medium_Price: 8.98 
    Min_Price: 8.98 
    Max_Price: 20.33
    News: { id_news: Reply_20-05-20
           title_news: "titolo news"
           text: "text"
           date: 20-05-2020
           hour: 09:13:00
           subject: Reply
        }

     }
}
{
_id: ObjectId("28276465847392747")
id: reply
Company: reply
Value:{

    Date: 20-05-2020
    Last_Hour_Contract: 09.12.25
    Last_Hour: 09.14.30
    Price: 17.78 
    Quantity: 9000 
    Medium_Price: 67.98 
    Min_Price: 8.98 
    Max_Price: 20.33
    News: { id_news: Reply_20-05-20
           title_news: "title_news"
           text: "text"
           date: 20-05-2020
           hour: 09:13:00
           subject: Reply
        }

    }
}
我想要实现的是在新数据到达时将各种文档(基于Company_Name=“Name_Company”)合并到一个文档中

我想要的JSON文档的设置如下:

{
_id: ObjectId("3333884747656565"),
id: reply
Date: 21-05-2020
Company: Reply
Value:{
    Date: 20-05-2020
    Last_Hour_Contract: 09.12.25
    Last_Hour: 09.14.30
    Price: 16.08
    Quantity: 8000
    Medium_Price: 8.98
    Min_Price: 8.98
    Max_Price: 20.33
    News: {id_news: Reply_20-05-20
           title_news: "title news..."
           text: "text..."
           date: 20-05-2020
           hour: 09:13:00
           subject: Reply
        }
    Date: 21-05-2020    
    Last_Hour_Contract: 09.12.25
    Last_Hour: 09.16.50
    Price: 16.68
    Quantity: 7000
    Medium_Price: 8.98
    Min_Price: 8.98
    Max_Price: 20.33
    News: {id_news: Reply_20-05-20
           title_news: "title news..."
           text: "text..."
           date: 21-05-2020
           hour: 09:14:00
           subject: Reply
        }
   }
}
我还插入了一个图像,以便您更好地理解(我希望两个箭头可以理解):

如何使用Pyspark实现这一点?谢谢

这是我的代码:

def writeStreamer(sdf):
    sdf.select("id_Borsa","NomeAzienda","Valori_Di_Borsa")  \
    .dropDuplicates(["id_Borsa","NomeAzienda","Valori_Di_Borsa"]) \
    .writeStream \
    .outputMode("append") \
    .foreachBatch(foreach_batch_function) \
    .start() 


def foreach_batch_function(sdf, epoch_id):
    sdf.write \
        .format("mongo") \
        .mode("append") \
        .option("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/DataManagement.Data") \
        .option("spark.mongodb.output.uri", "mongodb://127.0.0.1:27017/DataManagement.Data") \
        .save() #"com.mongodb.spark.sql.DefaultSource"


df_borsa = spark.readStream.format("kafka") \
          .option("kafka.bootstrap.servers", kafka_broker) \
          .option("startingOffsets", "latest") \
          .option("subscribe","Reply_borsa") \
          .load() \
          .selectExpr("CAST(value AS STRING)") 

df_news = spark.readStream.format("kafka") \
          .option("kafka.bootstrap.servers", kafka_broker) \
          .option("startingOffsets", "latest") \
          .option("subscribe","Reply_news") \
          .load() \
          .selectExpr("CAST(value AS STRING)") 


df_borsa = df_borsa.withColumn("Valori_Di_Borsa",F.struct(F.col("Data"),F.col("PrezzoUltimoContratto"),F.col("Var%"),F.col("VarAssoluta"),F.col("OraUltimoContratto"),F.col("QuantitaUltimo"),F.col("QuantitaAcquisto"),F.col("QuantitaVendita"),F.col("QuantitaTotale"),F.col("NumeroContratti"),F.col("MaxOggi"),F.col("MinOggi")))

df_news = df_news.withColumn("News",F.struct(F.col("id_News"),F.col("TitoloNews"),F.col("TestoNews"),F.col("DataNews"),F.col("OraNews")))

# Apply watermarks on event-time columns
dfWithWatermark = df_borsa.select("id_Borsa","NomeAzienda","StartTime","Valori_Di_Borsa").withWatermark("StartTime", "2 hours") # maximal delay

df1WithWatermark = df_news.select("SoggettoNews","EndTime").withWatermark("EndTime", "3 hours") # maximal delay

# Join with event-time constraints
sdf = dfWithWatermark.join(df1WithWatermark,expr(""" 
      SoggettoNews = NomeAzienda AND
      EndTime >= StartTime AND
      EndTime <= StartTime + interval 1 hours
      """),
       "leftOuter").withColumn("Valori_Di_Borsa",F.struct(F.col("Valori_Di_Borsa.*"),F.col("News"))) 


query = writeStreamer(sdf)

spark.streams.awaitAnyTermination()
def writeStreamer(sdf):
sdf.选择(“id_Borsa”、“NomeAzienda”、“Valori_Di_Borsa”)\
.dropDuplicates([“id_Borsa”、“NomeAzienda”、“Valori_Di_Borsa”])\
.writeStream\
.outputMode(“追加”)\
.foreachBatch(foreach\u batch\u函数)\
.start()
def foreach_批处理_函数(sdf,历元id):
写\
.格式(“mongo”)\
.mode(“追加”)\
.option(“spark.mongodb.input.uri”mongodb://127.0.0.1:27017/DataManagement.Data") \
.option(“spark.mongodb.output.uri”mongodb://127.0.0.1:27017/DataManagement.Data") \
.save()#“com.mongodb.spark.sql.DefaultSource”
df_borsa=spark.readStream.format(“卡夫卡”)\
.option(“kafka.bootstrap.servers”,kafka\u代理)\
.选项(“起始偏移量”、“最新”)\
.期权(“认购”、“回复”)\
.load()\
.selectExpr(“转换(值为字符串)”)
df_news=spark.readStream.format(“卡夫卡”)\
.option(“kafka.bootstrap.servers”,kafka\u代理)\
.选项(“起始偏移量”、“最新”)\
.选项(“订阅”、“回复新闻”)\
.load()\
.selectExpr(“转换(值为字符串)”)
df_borsa=df_borsa.带列(“Valori_Di_borsa”)、F.struct(F.col(“数据”)、F.col(“Prezzoultimocontrato”)、F.col(“Varassolta”)、F.col(“Orultimocontrato”)、F.col(“QuantitaUltimo”)、F.col(“QuantitaAcquisto”)、F.col(“QuantitaVendita”)、F.col(“QuantitaTotale”)、F.col(“Numerocontratio”)、F.col(“MaxOggi”)、F.col(“Minogi”))
df_news=df_news.withColumn(“新闻”)、F.struct(F.col(“id_新闻”)、F.col(“TitoloNews”)、F.col(“TestoNews”)、F.col(“数据新闻”)、F.col(“OraNews”))
#在事件时间列上应用水印
dfWithWatermark=df_borsa。选择(“id_borsa”、“NomeAzienda”、“StartTime”、“Valori_Di_borsa”)。withWatermark(“StartTime”、“2小时”)#最大延迟
df1WithWatermark=df#U新闻。选择(“SoggettoNews”,“EndTime”)。withWatermark(“EndTime”,“3小时”)#最大延迟
#加入事件时间约束
sdf=dfWithWatermark.join(df1WithWatermark,expr
SoggettoNews=NomeAzienda和
EndTime>=开始时间和结束时间

EndTime您只需使用
分组
操作符按
公司
字段对文档进行分组,并使用
$push
操作符将每个分组文档的
对象添加到新形成的数组字段

因此,上述mongo实现如下所示:

db.collection.aggregate([{
$group:{
_id:“$Company”,
id:{$first:'$id'},
日期:{$first:'$first'},
值:{$push:'$value'}
}
}])
您可以轻松地将上述聚合转换为PySpark实现

你需要做如下的事情:

pipeline=“{'$group':{'$id':'$Company','id':{'$first':'$id'},'日期:{'$first':'$first'},'值:{'$push':'$value'}”
df=spark.read.format(“mongo”).option(“pipeline”,pipeline.load())
df.show()

注意:我不是PySpark的专家,但您可以轻松地将其转换为所需的实现。

在mongocompass上它工作得很好!但在PySpark上还没有……我试图在此处将您的管道插入foreach_batch_函数,但一旦我进入mongocompass,它就会返回我想要的两个或更多文档正是你写的,但我一进入MongoCompass就什么也没发生……我试图在我的代码的不同点插入你的管道,但什么都没有。我比你更不幸。我甚至不知道PySpark。如果你愿意,我会删除这个答案并对这个问题进行投票,这样你就能从专家那里得到答案?而且,这将是一个l也为我挣钱。(不过,我只是指导你)好吧!如果你能做到,你会帮我一个大忙的,谢谢