Python 使用pyspark从CSV文件在配置单元中创建表的更好方法_Python_Apache Spark_Pyspark

Python 使用pyspark从CSV文件在配置单元中创建表的更好方法

python apache-spark pyspark

Python 使用pyspark从CSV文件在配置单元中创建表的更好方法,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我在HDFS中有一个6CSV文件3存在于名为/user/data/的目录中，而/user/docs/中的3则存在于/user/docs/中 /user/data/拥有tab\u团队、tab\u玩家、tab\u国家的CSV文件 /user/docs/拥有tab\u团队、tab\u玩家、tab\u国家的CSV文件即使名称相同，这些文件中的数据也不同现在使用这些CSV文件，我想使用pyspark 我做了如下的事情 file_list = ['tab_team', 'tab_players', 't

我在

HDFS

中有一个

CSV

文件<代码>3存在于名为

/user/data/

的目录中，而

/user/docs/

中的

则存在于

/user/docs/

中

/user/data/

拥有

tab\u团队、tab\u玩家、tab\u国家的CSV文件
/user/docs/
拥有tab\u团队、tab\u玩家、tab\u国家的CSV文件
即使名称相同，这些文件中的数据也不同
现在使用这些CSV文件，我想使用pyspark

我做了如下的事情
file_list = ['tab_team', 'tab_players', 'tab_country']

for team in file_list:
    df = sqlContext.read.load("/user/data/{}/*.csv".format(team), format='com.databricks.spark.csv', header='true', inferSchema='true')

    df.registerTempTable("my_temp_table")

    sqlContext.sql("create table {}.`data_{}` stored as ORC as select * from my_temp_table".format(db_name, team))


for team in file_list:
    df = sqlContext.read.load("/user/docs/{}/*.csv".format(team), format='com.databricks.spark.csv', header='true', inferSchema='true')

    df.registerTempTable("my_temp_table")

    sqlContext.sql("create table {}.`docs_{}` stored as ORC as select * from my_temp_table".format(db_name, team))

我得到了我想要的。但正如您在这里看到的，大多数代码都是重复代码。我想减少代码的重复性。我该怎么做呢？
再来一个循环怎么样
file_list = ['tab_team', 'tab_players', 'tab_country']
file_path = ['data', 'docs']

for team in file_list:
    for path in file_path:
        df = sqlContext.read.load("/user/{}/{}/*.csv".format(path, team), format='com.databricks.spark.csv', header='true', inferSchema='true')

        df.registerTempTable("my_temp_table")

        sqlContext.sql("create table {}.`{}_{}` stored as ORC as select * from my_temp_table".format(db_name, path, team))

再来一圈怎么样
file_list = ['tab_team', 'tab_players', 'tab_country']
file_path = ['data', 'docs']

for team in file_list:
    for path in file_path:
        df = sqlContext.read.load("/user/{}/{}/*.csv".format(path, team), format='com.databricks.spark.csv', header='true', inferSchema='true')

        df.registerTempTable("my_temp_table")

        sqlContext.sql("create table {}.`{}_{}` stored as ORC as select * from my_temp_table".format(db_name, path, team))