Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/azure/11.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python SparkSQL JDBC(PySpark)到Postgres-创建表并使用CTE_Python_Postgresql_Apache Spark_Jdbc_Pyspark - Fatal编程技术网

Python SparkSQL JDBC(PySpark)到Postgres-创建表并使用CTE

Python SparkSQL JDBC(PySpark)到Postgres-创建表并使用CTE,python,postgresql,apache-spark,jdbc,pyspark,Python,Postgresql,Apache Spark,Jdbc,Pyspark,我正在进行一个项目,将Python概念验证(POC)移植到PySpark。POC充分利用了PostGRE,特别是PostGIS地理空间库。大部分工作包括Python在调用数据进行最终处理之前向Postgres发出命令 传递给Postgres的一些查询包含CREATE TABLE,INSERT,CREATE TEMP TABLE,以及带有语句的CTE。我试图确定是否有可能通过JDBC将这些查询从Spark传递给Postgres 是否有人可以确认Spark JDBC中的此功能是否可用于其他数据库?明

我正在进行一个项目,将Python概念验证(POC)移植到PySpark。POC充分利用了PostGRE,特别是PostGIS地理空间库。大部分工作包括Python在调用数据进行最终处理之前向Postgres发出命令

传递给Postgres的一些查询包含
CREATE TABLE
INSERT
CREATE TEMP TABLE
,以及带有
语句的CTE
。我试图确定是否有可能通过JDBC将这些查询从Spark传递给Postgres

是否有人可以确认Spark JDBC中的此功能是否可用于其他数据库?明确地说,我希望将纯英语SQL查询传递给Postgres,而不是使用可用的SparkSQL API(因为它们不支持我需要的所有操作)。我使用的是Spark 2.3.0版、PostgreSQL 10.11版和Python 2.7.5版(是的,我知道Python 2的EOL,这是另一个故事)

以下是我到目前为止所做的尝试:

使用SparkSession.read

为Postgres创建Spark会话
定义要传递给
dbtable
param的查询 将Postgres spark会话对象的
qry
传递到
dbtable
param 它返回以下语法错误(使用上面列出的其他SQL命令时会产生相同类型的错误):

返回以下解析异常:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 367, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 360, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 714, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 73, in deco
    raise ParseException(s.split(': ', 1)[1], stackTrace)
ParseException: u"\nno viable alternative at input 'create table ('(line 1, pos 13)\n\n== SQL ==\ncreate table (name varchar(50), age int)\n-------------^^^\n"
我的问题归结为:

  • 我的方法是否缺少某种配置或其他必要步骤
  • Postgres能否以某种方式利用
    spark.sql()
    API
  • 我想实现的目标可能实现吗
  • 我曾在互联网上搜寻,试图找到使用SparkSQL向PostgreSQL发出此类SQL查询的示例,但没有找到任何解决方案。如果有解决方案,我希望看到一个例子,否则确认这是不可能的就足够了

    我想实现的目标可能实现吗

    我认为不是。Spark是一个数据处理框架,因此它的API主要是为数据源的读写操作而开发的。在您的例子中,您有一些DDL语句,Spark不应该执行这样的操作

    例如,第一个示例中的
    dbtable
    选项必须是表名或某些SELECT查询

    如果您需要运行一些DDL、DCL、TCL查询,那么您应该以其他方式执行,例如通过
    psycopg2
    模块

    Postgres能否以某种方式利用spark.sql()API

    spark.sql
    是一种在SparkSession表或视图中注册执行SparkSQL代码的方法。它可以与任何受支持的数据源一起工作,不仅是jdbc,还可以在SparkSQL语法的Spark端工作。比如说

    val spark = SparkSession
            ...
            .getOrCreate()
    
    spark.read
      .format("jdbc")
      .option("url", "jdbc:postgresql://ip/database_name")
      .option("dbtable", "schema.tablename")
      .load()
      .createOrReplaceTempView("my_spark_table_over_postgresql_table")
    
    // and then you can operate with a view:
    val df = spark.sql("select * from my_spark_table_over_postgresql_table where ... ")
    

    谢谢你的回答,这是我的男人的感觉,但我想得到一些确认,你只是提供了!感谢您对SparkSession表的澄清。至于您对
    psycopg2
    的评论,我以前在非分布式应用程序中使用过它,也没有考虑在Spark应用程序中尝试一下。你对此有经验/可以说是有效的吗?如果你有任何链接,我可以参考这个整合,我将不胜感激。
    postgres.read \
        .format("jdbc") \
        .option("url", "jdbc:postgresql://....) \
        .option("dbtable", qry) \
        .option("user", configs['user']) \
        .option("password", configs['password']) \
        .option("driver", "org.postgresql.Driver") \
        .option("ssl", "true") \
        .load()
    
    Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 367, in <module>
        raise Exception(traceback.format_exc())
    Exception: Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 360, in <module>
        exec(code, _zcUserQueryNameSpace)
      File "<stdin>", line 9, in <module>
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 172, in load
        return self._df(self._jreader.load())
      File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
        return f(*a, **kw)
      File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
        format(target_id, ".", name), value)
    Py4JJavaError: An error occurred while calling o484.load.
    : org.postgresql.util.PSQLException: ERROR: syntax error at or near "create"
      Position: 15
    
    postgres.sql("""create table (name varchar(50), age int)""")
    
    Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 367, in <module>
        raise Exception(traceback.format_exc())
    Exception: Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 360, in <module>
        exec(code, _zcUserQueryNameSpace)
      File "<stdin>", line 1, in <module>
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 714, in sql
        return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
      File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 73, in deco
        raise ParseException(s.split(': ', 1)[1], stackTrace)
    ParseException: u"\nno viable alternative at input 'create table ('(line 1, pos 13)\n\n== SQL ==\ncreate table (name varchar(50), age int)\n-------------^^^\n"
    
    Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 367, in <module>
        raise Exception(traceback.format_exc())
    Exception: Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-5711943099029736374.py", line 360, in <module>
        exec(code, _zcUserQueryNameSpace)
      File "<stdin>", line 1, in <module>
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 714, in sql
        return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
      File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 73, in deco
        raise ParseException(s.split(': ', 1)[1], stackTrace)
    ParseException: u"\nextraneous input 'create' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 1)\n\n== SQL ==\n(create table (name varchar(50), age int))\n-^^^\n"
    
    val spark = SparkSession
            ...
            .getOrCreate()
    
    spark.read
      .format("jdbc")
      .option("url", "jdbc:postgresql://ip/database_name")
      .option("dbtable", "schema.tablename")
      .load()
      .createOrReplaceTempView("my_spark_table_over_postgresql_table")
    
    // and then you can operate with a view:
    val df = spark.sql("select * from my_spark_table_over_postgresql_table where ... ")