Python 将Pyspark与SQL DB结合使用的最佳方法_Python_Apache Spark_Pyspark_Spark Dataframe_Pyspark Sql

Python 将Pyspark与SQL DB结合使用的最佳方法

python apache-spark pyspark

Python 将Pyspark与SQL DB结合使用的最佳方法,python,apache-spark,pyspark,spark-dataframe,pyspark-sql,Python,Apache Spark,Pyspark,Spark Dataframe,Pyspark Sql,我的SQL DB有数百万条记录的表，其中一些记录有上亿条，我的主要选择是大约4000行代码，但结构如下： SELECT A.seq field1, field2, field3, field4, (select field from tableX X... where A.seq = X.seq ...) field5, (select field from tableY Y... where A.seq = Y.seq ...) field6, (se

我的SQL DB有数百万条记录的表，其中一些记录有上亿条，我的主要选择是大约4000行代码，但结构如下：

SELECT A.seq field1, field2, field3, field4,
       (select field from tableX X... where A.seq = X.seq ...) field5,
       (select field from tableY Y... where A.seq = Y.seq ...) field6,
       (select field from tableN Z... where A.seq = Z.seq ...) field7,
       field8, field9
  FROM tableA A, tableB B, tableN N
 WHERE A.seq = B.seq
   AND A.req_seq = N.req_seq;

# load the tables in the cluster separately

conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)    

dataframeA = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableA)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeB = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeC = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

# then do the needed joins

df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)

df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)


# then with that dataframe calculate the subselect fields

def calculate_field5(seq):
    # load the table in the cluster as with the main tables 
    # and query the datafame
    # or make the query to DB and return the field
    return field

df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields

我的想法是这样做：

SELECT A.seq field1, field2, field3, field4,
       (select field from tableX X... where A.seq = X.seq ...) field5,
       (select field from tableY Y... where A.seq = Y.seq ...) field6,
       (select field from tableN Z... where A.seq = Z.seq ...) field7,
       field8, field9
  FROM tableA A, tableB B, tableN N
 WHERE A.seq = B.seq
   AND A.req_seq = N.req_seq;

# load the tables in the cluster separately

conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)    

dataframeA = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableA)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeB = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeC = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

# then do the needed joins

df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)

df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)


# then with that dataframe calculate the subselect fields

def calculate_field5(seq):
    # load the table in the cluster as with the main tables 
    # and query the datafame
    # or make the query to DB and return the field
    return field

df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields

这样好吗？我应该换一种方式吗

我们将非常非常感谢您的任何建议

如果您想在执行中使用MySql，这是一种方法

但是请注意，由于mySql的查询时间，您的执行可能需要很多时间才能运行。MySql不是分布式数据库，因此您可以有很多时间从MySql检索数据

我给你的建议

尝试将数据检索到hdfs（如果您有hdfs），尝试使用。一个示例是如何以增量方式使用它

尝试转换存储为的数据。参见示例

此建议旨在减少数据库的执行时间。每次您直接从MySql请求数据时。您正在使用MySql的资源将数据发送到Spark。按照我建议的方式，您可以将DB复制到HDFS，并将这些数据交给Spark处理。这不会导致从数据库执行时间

为什么要使用兽人？Orc是以紧凑的柱状结构转换数据的好选择。这将增加您的数据检索和搜索。

谢谢您的回答！我会看看这些技术。因此，最好将所有需要的表检索到文件系统或内存中，然后应用过滤器