Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/276.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将Pyspark与SQL DB结合使用的最佳方法_Python_Apache Spark_Pyspark_Spark Dataframe_Pyspark Sql - Fatal编程技术网

Python 将Pyspark与SQL DB结合使用的最佳方法

Python 将Pyspark与SQL DB结合使用的最佳方法,python,apache-spark,pyspark,spark-dataframe,pyspark-sql,Python,Apache Spark,Pyspark,Spark Dataframe,Pyspark Sql,我的SQL DB有数百万条记录的表,其中一些记录有上亿条,我的主要选择是大约4000行代码,但结构如下: SELECT A.seq field1, field2, field3, field4, (select field from tableX X... where A.seq = X.seq ...) field5, (select field from tableY Y... where A.seq = Y.seq ...) field6, (se

我的SQL DB有数百万条记录的表,其中一些记录有上亿条,我的主要选择是大约4000行代码,但结构如下:

SELECT A.seq field1, field2, field3, field4,
       (select field from tableX X... where A.seq = X.seq ...) field5,
       (select field from tableY Y... where A.seq = Y.seq ...) field6,
       (select field from tableN Z... where A.seq = Z.seq ...) field7,
       field8, field9
  FROM tableA A, tableB B, tableN N
 WHERE A.seq = B.seq
   AND A.req_seq = N.req_seq;
# load the tables in the cluster separately

conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)    

dataframeA = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableA)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeB = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeC = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

# then do the needed joins

df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)

df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)


# then with that dataframe calculate the subselect fields

def calculate_field5(seq):
    # load the table in the cluster as with the main tables 
    # and query the datafame
    # or make the query to DB and return the field
    return field

df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields
我的想法是这样做:

SELECT A.seq field1, field2, field3, field4,
       (select field from tableX X... where A.seq = X.seq ...) field5,
       (select field from tableY Y... where A.seq = Y.seq ...) field6,
       (select field from tableN Z... where A.seq = Z.seq ...) field7,
       field8, field9
  FROM tableA A, tableB B, tableN N
 WHERE A.seq = B.seq
   AND A.req_seq = N.req_seq;
# load the tables in the cluster separately

conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)    

dataframeA = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableA)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeB = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeC = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

# then do the needed joins

df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)

df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)


# then with that dataframe calculate the subselect fields

def calculate_field5(seq):
    # load the table in the cluster as with the main tables 
    # and query the datafame
    # or make the query to DB and return the field
    return field

df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields
这样好吗? 我应该换一种方式吗

我们将非常非常感谢您的任何建议

如果您想在执行中使用MySql,这是一种方法

但是请注意,由于mySql的查询时间,您的执行可能需要很多时间才能运行。MySql不是分布式数据库,因此您可以有很多时间从MySql检索数据

我给你的建议

尝试将数据检索到hdfs(如果您有hdfs),尝试使用。一个示例是如何以增量方式使用它

尝试转换存储为的数据。参见示例

此建议旨在减少数据库的执行时间。每次您直接从MySql请求数据时。您正在使用MySql的资源将数据发送到Spark。按照我建议的方式,您可以将DB复制到HDFS,并将这些数据交给Spark处理。这不会导致从数据库执行时间


为什么要使用兽人?Orc是以紧凑的柱状结构转换数据的好选择。这将增加您的数据检索和搜索。

谢谢您的回答!我会看看这些技术。因此,最好将所有需要的表检索到文件系统或内存中,然后应用过滤器