Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据
假设我有这样一个employee表:Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据,apache-spark,pyspark,Apache Spark,Pyspark,假设我有这样一个employee表: | employee_id | employee_name | department | created_at | updated_at | |-------------|---------------|------------|---------------------|---------------------| | 1 | Jessica | Finance | 2020-10-
| employee_id | employee_name | department | created_at | updated_at |
|-------------|---------------|------------|---------------------|---------------------|
| 1 | Jessica | Finance | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2 | Michael | IT | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3 | Sheila | HR | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ... | ... | ... | ... | ... |
| 1000 | Emily | IT | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.parquet("gs://{bucket_name}/{target_directory}/")
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
"FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")
通常,我可以使用JDBC连接在Pyspark中批处理数据,并像这样写入GCS:
| employee_id | employee_name | department | created_at | updated_at |
|-------------|---------------|------------|---------------------|---------------------|
| 1 | Jessica | Finance | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2 | Michael | IT | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3 | Sheila | HR | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ... | ... | ... | ... | ... |
| 1000 | Emily | IT | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.parquet("gs://{bucket_name}/{target_directory}/")
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
"FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")
当我像上面的代码一样使用.load()创建df时,数据是否仍在数据库服务器中,或者spark是否从表中下载所有数据并将其移动到spark cluster(假设数据库和spark cluster放在不同的服务器上)
如果我需要获取时间范围内的特定数据,比如说我需要在>2020-10-15 00:00:00时创建的数据
下面的代码是否足够?因为当数据大小超过25GB时,我发现速度非常慢
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
df.createOrReplaceTempView("get_specific_data")
get_specific_data = spark.sql('''
SELECT employee_id, employee_name, department, created_at, updated_at
FROM get_specific_data
WHERE created_at > '2020-10-15 00:00:00'
'''
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")
我的问题更像是,如果我知道需要按列created_date(或任何其他列、按ID或其他内容)检索哪些数据,如何在Pyspark中高效地获取特定数据。我需要spark sql吗?还是使用其他工具?(为了每天批处理数据)如果我在表\u source中只指定表名,它将把所有数据加载到spark cluster中 要选择我需要的特定数据,我可以使用如下内容:
| employee_id | employee_name | department | created_at | updated_at |
|-------------|---------------|------------|---------------------|---------------------|
| 1 | Jessica | Finance | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2 | Michael | IT | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3 | Sheila | HR | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ... | ... | ... | ... | ... |
| 1000 | Emily | IT | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.parquet("gs://{bucket_name}/{target_directory}/")
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
"FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")