Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/multithreading/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Multithreading 如何循环文件以显示使用for循环的所有月份?_Multithreading_For Loop_Pyspark_Databricks_Pyspark Dataframes - Fatal编程技术网

Multithreading 如何循环文件以显示使用for循环的所有月份?

Multithreading 如何循环文件以显示使用for循环的所有月份?,multithreading,for-loop,pyspark,databricks,pyspark-dataframes,Multithreading,For Loop,Pyspark,Databricks,Pyspark Dataframes,既然有人提到,我应该显示循环代码,这里是我现在的代码。 这里我总共有9个文件,我试图将它们合并成一个大数据框架,每个月都有自己的单独列。我试图用循环来缩短代码,但它只显示了最后一个月(2020-09)。请看一看我如何改进我的代码!谢谢你的指导 df1=spark.read.format("parquet").load("dbfs:2020-01/")\ .withColumnRenamed("count", "2020-01&

既然有人提到,我应该显示循环代码,这里是我现在的代码。 这里我总共有9个文件,我试图将它们合并成一个大数据框架,每个月都有自己的单独列。我试图用循环来缩短代码,但它只显示了最后一个月(2020-09)。请看一看我如何改进我的代码!谢谢你的指导

df1=spark.read.format("parquet").load("dbfs:2020-01/")\
.withColumnRenamed("count", "2020-01").sort('name_id').drop('date')

df2=spark.read.format("parquet").load("dbfs:2020-02/")\
.withColumnRenamed("count", "2020-02").sort('name_id').drop('date')

df3=spark.read.format("parquet").load("dbfs:2020-03/")\
.withColumnRenamed("count", "2020-03").sort('name_id').drop('date')

df4=spark.read.format("parquet").load("dbfs:2020-04/")\
.withColumnRenamed("count", "2020-04").sort('name_id').drop('date')

df5=spark.read.format("parquet").load("dbfs:2020-05/")\
.withColumnRenamed("count", "2020-05").sort('name_id').drop('date')

df6=spark.read.format("parquet").load("dbfs:2020-06/")\
.withColumnRenamed("count", "2020-06").sort('name_id').drop('date')

df7=spark.read.format("parquet").load("dbfs:2020-07/")\
.withColumnRenamed("count", "2020-07").sort('name_id').drop('date')

df8=spark.read.format("parquet").load("dbfs:2020-08/")\
.withColumnRenamed("count", "2020-08").sort('name_id').drop('date')

df9=spark.read.format("parquet").load("dbfs:2020-09/")\
.withColumnRenamed("count", "2020-09").sort('name_id').drop('date')
下面是我循环浏览这些文件的尝试:

dfs=[]

for i in range(1,10):
  df=spark.read.format("parquet").load("dbfs:2020-01/")\
.withColumnRenamed("count", f"2020-0{i}").sort('name_id').drop('date')
dfs.append(df)
df=dfs[0]

for dfi in dfs[1:]:
  df=df.join(dfi, on= ['name_id','Partner','Metrics','Description'], how='left_outer') 
因为上面的代码只显示2020-09,所以我遇到了错误,为什么要计算每个月的增长百分比

import pyspark.sql.functions as f
from pyspark.sql.window import Window
percentages=df\
 .withColumn('% Increase Feb-20', ((f.col('2020-02')-f.col('2020-01'))/f.col('2020-01')))\
 .withColumn('% Increase Mar-20', ((f.col('2020-03')-f.col('2020-02'))/f.col('2020-02')))\
 .withColumn('% Increase Apr-20', ((f.col('2020-04')-f.col('2020-03'))/f.col('2020-03')))\
 .withColumn('% Increase May-20', ((f.col('2020-05')-f.col('2020-04'))/f.col('2020-04')))\
 .withColumn('% Increase Jun-20', ((f.col('2020-06')-f.col('2020-05'))/f.col('2020-05')))\
 .withColumn('% Increase Jul-20', ((f.col('2020-07')-f.col('2020-06'))/f.col('2020-06')))\
 .withColumn('% Increase Aug-20', ((f.col('2020-08')-f.col('2020-07'))/f.col('2020-07')))\
 .withColumn('% Increase Sep-20', ((f.col('2020-09')-f.col('2020-08'))/f.col('2020-08')))\

display(percentages)```

The error I got is
``` cannot resolve '`2020-02`' given input columns: [Description, 2020-09, name_id, Metrics, Partner];```