Python 将Pyspark与SQL DB结合使用的最佳方法
我的SQL DB有数百万条记录的表,其中一些记录有上亿条,我的主要选择是大约4000行代码,但结构如下:Python 将Pyspark与SQL DB结合使用的最佳方法,python,apache-spark,pyspark,spark-dataframe,pyspark-sql,Python,Apache Spark,Pyspark,Spark Dataframe,Pyspark Sql,我的SQL DB有数百万条记录的表,其中一些记录有上亿条,我的主要选择是大约4000行代码,但结构如下: SELECT A.seq field1, field2, field3, field4, (select field from tableX X... where A.seq = X.seq ...) field5, (select field from tableY Y... where A.seq = Y.seq ...) field6, (se
SELECT A.seq field1, field2, field3, field4,
(select field from tableX X... where A.seq = X.seq ...) field5,
(select field from tableY Y... where A.seq = Y.seq ...) field6,
(select field from tableN Z... where A.seq = Z.seq ...) field7,
field8, field9
FROM tableA A, tableB B, tableN N
WHERE A.seq = B.seq
AND A.req_seq = N.req_seq;
# load the tables in the cluster separately
conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)
dataframeA = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableA)\
.option("user", "myuser")\
.option("password", "mypass").load()
dataframeB = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableC)\
.option("user", "myuser")\
.option("password", "mypass").load()
dataframeC = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableC)\
.option("user", "myuser")\
.option("password", "mypass").load()
# then do the needed joins
df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)
df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)
# then with that dataframe calculate the subselect fields
def calculate_field5(seq):
# load the table in the cluster as with the main tables
# and query the datafame
# or make the query to DB and return the field
return field
df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields
我的想法是这样做:
SELECT A.seq field1, field2, field3, field4,
(select field from tableX X... where A.seq = X.seq ...) field5,
(select field from tableY Y... where A.seq = Y.seq ...) field6,
(select field from tableN Z... where A.seq = Z.seq ...) field7,
field8, field9
FROM tableA A, tableB B, tableN N
WHERE A.seq = B.seq
AND A.req_seq = N.req_seq;
# load the tables in the cluster separately
conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)
dataframeA = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableA)\
.option("user", "myuser")\
.option("password", "mypass").load()
dataframeB = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableC)\
.option("user", "myuser")\
.option("password", "mypass").load()
dataframeC = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableC)\
.option("user", "myuser")\
.option("password", "mypass").load()
# then do the needed joins
df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)
df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)
# then with that dataframe calculate the subselect fields
def calculate_field5(seq):
# load the table in the cluster as with the main tables
# and query the datafame
# or make the query to DB and return the field
return field
df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields
这样好吗?
我应该换一种方式吗
我们将非常非常感谢您的任何建议
如果您想在执行中使用MySql,这是一种方法
但是请注意,由于mySql的查询时间,您的执行可能需要很多时间才能运行。MySql不是分布式数据库,因此您可以有很多时间从MySql检索数据
我给你的建议
尝试将数据检索到hdfs(如果您有hdfs),尝试使用。一个示例是如何以增量方式使用它
尝试转换存储为的数据。参见示例
此建议旨在减少数据库的执行时间。每次您直接从MySql请求数据时。您正在使用MySql的资源将数据发送到Spark。按照我建议的方式,您可以将DB复制到HDFS,并将这些数据交给Spark处理。这不会导致从数据库执行时间
为什么要使用兽人?Orc是以紧凑的柱状结构转换数据的好选择。这将增加您的数据检索和搜索。谢谢您的回答!我会看看这些技术。因此,最好将所有需要的表检索到文件系统或内存中,然后应用过滤器