Python 在数据库表中加载spark sql表

Python 在数据库表中加载spark sql表,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,有没有办法像我们在sql中所做的那样,将spark sql表按原样加载到数据库表中 insert into database_table select * from sparksql_table. pg_hook = PostgresHook(postgres_conn_id="ingestion_db", schema="ingestiondb") connection = pg_hook.get_conn() cursor = connection.cursor() spark =

有没有办法像我们在sql中所做的那样,将spark sql表按原样加载到数据库表中

insert into database_table select * from sparksql_table.

pg_hook = PostgresHook(postgres_conn_id="ingestion_db", schema="ingestiondb")

connection = pg_hook.get_conn()

cursor = connection.cursor()

spark = SparkSession \
    .builder \
    .appName("Spark csv schema inference") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()\
我可以运行以下命令:

spark.sql(“从元数据表中选择*).show()

但不是这个:


cursor.execute(“从MetadataTable中选择*)使用此软件包打开spark shell spark shell--包org.postgresql:postgresql:42.1.1

val url = "jdbc:postgresql://localhost:5432/dbname"

 def getProperties: Properties ={
 val prop = new Properties
 prop.setProperty("user", "dbuser")
 prop.setProperty("password", "dbpassword")
 prop.setProperty("driver", "org.postgresql.Driver")
 prop
 }

val df = spark.sql("""select * from table  """)

df.write.mode("append").option("driver", "org.postgresql.Driver").jdbc(url, 
"tablename", getProperties)

之后,您可以检查postgres数据库中的表。此外,请参考spark提供的不同模式选项,并选择适合您的模式。

您可以找到与@Ghost9的答案相当的python:

使用postgres驱动程序包初始化spark会话,如下所示(检查正确版本):

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars.packages", "postgresql-42.2.5.jar") \
    .getOrCreate()
然后,您可以使用以下功能连接到jdbc:

def connect_to_sql(
        spark, df, jdbc_hostname, jdbc_port, database, data_table, username, password
):
    jdbc_url = "jdbc:postgresql://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

    connection_details = {
        "user": username,
        "password": password,
        "driver": "org.postgresql.Driver",
    }

    df.write.jdbc(url=jdbc_url, table=data_table, mode="append", properties=connection_details)
您可以找到不同的模式:

    append: Append contents of this :class:DataFrame to existing data.
    overwrite: Overwrite existing data.
    ignore: Silently ignore this operation if data already exists.
    error (default case): Throw an exception if data already exists.