Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据_Apache Spark_Pyspark

Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据

apache-spark pyspark

Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据,apache-spark,pyspark,Apache Spark,Pyspark,假设我有这样一个employee表： | employee_id | employee_name | department | created_at | updated_at | |-------------|---------------|------------|---------------------|---------------------| | 1 | Jessica | Finance | 2020-10-

假设我有这样一个employee表：

| employee_id | employee_name | department | created_at          | updated_at          |
|-------------|---------------|------------|---------------------|---------------------|
| 1           | Jessica       | Finance    | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2           | Michael       | IT         | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3           | Sheila        | HR         | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ...         | ...           | ...        | ...                 | ...                 |
| 1000        | Emily         | IT         | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.write.parquet("gs://{bucket_name}/{target_directory}/")

last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
               "FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")

通常，我可以使用JDBC连接在Pyspark中批处理数据，并像这样写入GCS：

| employee_id | employee_name | department | created_at          | updated_at          |
|-------------|---------------|------------|---------------------|---------------------|
| 1           | Jessica       | Finance    | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2           | Michael       | IT         | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3           | Sheila        | HR         | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ...         | ...           | ...        | ...                 | ...                 |
| 1000        | Emily         | IT         | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.write.parquet("gs://{bucket_name}/{target_directory}/")

last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
               "FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")

当我像上面的代码一样使用.load（）创建df时，数据是否仍在数据库服务器中，或者spark是否从表中下载所有数据并将其移动到spark cluster（假设数据库和spark cluster放在不同的服务器上）

如果我需要获取时间范围内的特定数据，比如说我需要在>2020-10-15 00:00:00时创建的数据

下面的代码是否足够？因为当数据大小超过25GB时，我发现速度非常慢

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.createOrReplaceTempView("get_specific_data")

get_specific_data = spark.sql('''
                        SELECT employee_id, employee_name, department, created_at, updated_at
                        FROM get_specific_data 
                        WHERE created_at > '2020-10-15 00:00:00'
                        '''

get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")

我的问题更像是，如果我知道需要按列created_date（或任何其他列、按ID或其他内容）检索哪些数据，如何在Pyspark中高效地获取特定数据。我需要spark sql吗？还是使用其他工具？（为了每天批处理数据）

如果我在表\u source中只指定表名，它将把所有数据加载到spark cluster中

要选择我需要的特定数据，我可以使用如下内容：

| employee_id | employee_name | department | created_at          | updated_at          |
|-------------|---------------|------------|---------------------|---------------------|
| 1           | Jessica       | Finance    | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2           | Michael       | IT         | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3           | Sheila        | HR         | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ...         | ...           | ...        | ...                 | ...                 |
| 1000        | Emily         | IT         | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.write.parquet("gs://{bucket_name}/{target_directory}/")

last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
               "FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")