如何使用python中的结构化spark流，使用ForeachWriter将行插入Mongodb？_Mongodb_Apache Spark_Pyspark_Spark Streaming_Spark Structured Streaming

如何使用python中的结构化spark流，使用ForeachWriter将行插入Mongodb？

mongodb apache-spark pyspark

如何使用python中的结构化spark流，使用ForeachWriter将行插入Mongodb？,mongodb,apache-spark,pyspark,spark-streaming,spark-structured-streaming,Mongodb,Apache Spark,Pyspark,Spark Streaming,Spark Structured Streaming,我使用spark streaming从kafka读取数据并将其插入mongodb。我正在使用pyspark 2.4.4。我试图使用ForeachWriter，因为仅对每个方法使用就意味着将为每一行建立连接 def open(self, partition_id, epoch_id): # Open connection. This method is optional in Python. self.connection = MongoClient("19

我使用spark streaming从kafka读取数据并将其插入mongodb。我正在使用pyspark 2.4.4。我试图使用ForeachWriter，因为仅对每个方法使用就意味着将为每一行建立连接

    def open(self, partition_id, epoch_id):
        # Open connection. This method is optional in Python.
        self.connection = MongoClient("192.168.0.100:27017")
        self.db = self.connection['test']
        self.coll = self.db['output']    
        print(epoch_id)
        pass

    def process(self, row):
        # Write row to connection. This method is NOT optional in Python.
        #self.coll=None -> used this to test, if I'm getting an exception if it is there but I'm not getting one
        self.coll.insert_one(row.asDict())
        pass

    def close(self, error):
        # Close the connection. This method in optional in Python.
        print(error)
        pass

df_w=df7\
        .writeStream\
        .foreach(ForeachWriter())\
        .trigger(processingTime='1 seconds') \
        .outputMode("update") \
        .option("truncate", "false")\
        .start()df_w=df7\
        .writeStream\
        .foreach(ForeachWriter())\
        .trigger(processingTime='1 seconds') \
        .outputMode("update") \
        .option("truncate", "false")\
        .start()

我的问题是它没有插入mongodb，我找不到解决方案。如果把它注释掉，我会出错的。但进程方法未执行。有人有什么想法吗？

您在流程函数的第一行将集合设置为

None

。因此，您无需插入行。

另外，我不知道它是在这里，还是在您的代码中，但是您有两次writeStream部分

在流程函数的第一行将集合设置为

None

。因此，您无需插入行。

另外，我不知道它是在这里，还是在您的代码中，但是您有两次writeStream部分

spark文档中可能没有记录这一点。但如果您查看pyspark中foreach的定义，它有以下代码行：

# Check if the data should be processed
  should_process = True
  if open_exists:
    should_process = f.open(partition_id, epoch_id)

因此，每当我们打开一个新连接时，open必须返回True。在实际文档中，他们使用了“pass”，这导致“process（）”从未被调用。（此答案供将来遇到相同问题的任何人参考。）

spark文档中可能没有记录此答案。但如果您查看pyspark中foreach的定义，它有以下代码行：

# Check if the data should be processed
  should_process = True
  if open_exists:
    should_process = f.open(partition_id, epoch_id)

因此，每当我们打开一个新连接时，open必须返回True。在实际文档中，他们使用了“pass”，这导致“process（）”从未被调用。（这个答案供将来遇到同样问题的人参考。）

我真的很抱歉复制了我正在测试的代码，我将其设置为“无”只是为了验证它是否正常工作，所以如果我将其插入mongo，它应该会出错，对吗？我没有得到一个。你能再看一遍吗？我真的很抱歉，我复制了我正在测试的代码，我设置为“无”只是为了验证它是否正常工作，所以如果我插入mongo，它应该会出错，对吗？我没有得到一个。你能再看一遍吗。