Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据

Apache spark 如何有效地从Pyspark中的RDBMS表中选择部分数据,apache-spark,pyspark,Apache Spark,Pyspark,假设我有这样一个employee表: | employee_id | employee_name | department | created_at | updated_at | |-------------|---------------|------------|---------------------|---------------------| | 1 | Jessica | Finance | 2020-10-

假设我有这样一个employee表:

| employee_id | employee_name | department | created_at          | updated_at          |
|-------------|---------------|------------|---------------------|---------------------|
| 1           | Jessica       | Finance    | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2           | Michael       | IT         | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3           | Sheila        | HR         | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ...         | ...           | ...        | ...                 | ...                 |
| 1000        | Emily         | IT         | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |
df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.write.parquet("gs://{bucket_name}/{target_directory}/")
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
               "FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")
通常,我可以使用JDBC连接在Pyspark中批处理数据,并像这样写入GCS:

| employee_id | employee_name | department | created_at          | updated_at          |
|-------------|---------------|------------|---------------------|---------------------|
| 1           | Jessica       | Finance    | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2           | Michael       | IT         | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3           | Sheila        | HR         | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ...         | ...           | ...        | ...                 | ...                 |
| 1000        | Emily         | IT         | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |
df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.write.parquet("gs://{bucket_name}/{target_directory}/")
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
               "FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")
当我像上面的代码一样使用.load()创建df时,数据是否仍在数据库服务器中,或者spark是否从表中下载所有数据并将其移动到spark cluster(假设数据库和spark cluster放在不同的服务器上)

如果我需要获取时间范围内的特定数据,比如说我需要在>2020-10-15 00:00:00时创建的数据

下面的代码是否足够?因为当数据大小超过25GB时,我发现速度非常慢

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.createOrReplaceTempView("get_specific_data")

get_specific_data = spark.sql('''
                        SELECT employee_id, employee_name, department, created_at, updated_at
                        FROM get_specific_data 
                        WHERE created_at > '2020-10-15 00:00:00'
                        '''

get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")


我的问题更像是,如果我知道需要按列created_date(或任何其他列、按ID或其他内容)检索哪些数据,如何在Pyspark中高效地获取特定数据。我需要spark sql吗?还是使用其他工具?(为了每天批处理数据)

如果我在表\u source中只指定表名,它将把所有数据加载到spark cluster中

要选择我需要的特定数据,我可以使用如下内容:

| employee_id | employee_name | department | created_at          | updated_at          |
|-------------|---------------|------------|---------------------|---------------------|
| 1           | Jessica       | Finance    | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2           | Michael       | IT         | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3           | Sheila        | HR         | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ...         | ...           | ...        | ...                 | ...                 |
| 1000        | Emily         | IT         | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |
df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.write.parquet("gs://{bucket_name}/{target_directory}/")
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
               "FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")