Python PySpark数据帧的问题
使用pyspark,我面临数据帧的问题,数据帧没有按照我希望的方式进行分组。在下面的示例中,在Python PySpark数据帧的问题,python,dataframe,pyspark,Python,Dataframe,Pyspark,使用pyspark,我面临数据帧的问题,数据帧没有按照我希望的方式进行分组。在下面的示例中,在分析列中,我希望值是不同的,以便可以按月查看趋势值。我怎样才能做到这一点 ***df Dataframe consists following*** -----'df' DATAFRAME CONSISTS OF FOLLOWING ---- +-----------+----+-----+------+-------------------+ |CLIENT_NAME|YEAR|MONTH|EN
分析
列中,我希望值是不同的,以便可以按月查看趋势值。我怎样才能做到这一点
***df Dataframe consists following***
-----'df' DATAFRAME CONSISTS OF FOLLOWING ----
+-----------+----+-----+------+-------------------+
|CLIENT_NAME|YEAR|MONTH|ENGINE|TOTAL_UNIQUE_MEMBER|
+-----------+----+-----+------+-------------------+
| Paax |2019| 12| ERG2| 435911|
| Paax |2019| 11| ELE| 435911|
| Paax |2019| 11| PHA| 435911|
| Paax |2019| 12| ELE| 435911|
| Paax |2019| 12| EBM| 512518|
| Paax |2019| 12| PHA| 435911|
+-----------+----+-----+------+-------------------+
I m taking above values and keeping in dictionary
and getted those values from dictionary and
assigned to 'list of tuple' and finally tuple is added to some columns i.e Dataframe
我试过:
import os
import glob
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
path = "/Users/ash2/Desktop/new_results/prodd"
mon_dict = {'01':'Jan','02':'Feb','03':'Mar','04':'Apr','05':'May','06':'Jun','07':'Jul','08':'Aug','09':'Sep','10':'Oct','11':'Nov','12':'Dec'}
def get_list_dirs(path):
lst = os.listdir(path)
if '.DS_Store' in lst:
lst.remove('.DS_Store')
return lst
for i in get_list_dirs(path):
output_path = "/Users/ash2/Desktop"+os.sep+"out"
#below taking part files which consists of rows and columns and values seperated by delimeter
all_filenames = glob.glob(path + os.sep + i + os.sep + '2019' + os.sep + '*' +os.sep + 'uniqueMemberReport' +os.sep + 'part*')
df = spark.read.format("csv").option("header", "true").option('delimiter', '|').load(all_filenames)
tup = []
l =[]
#df.show()
df.persist()
for i in range(1,df.count()+1):
k = df.take(i)
d = k[i-1].asDict()
client = d['CLIENT_NAME']
month = d['MONTH']
anlytic = d['ENGINE']
count = d['TOTAL_UNIQUE_MEMBER']
y_m = mon_dict[month] + ' - 2019'
l.append(anlytic)
l1 = list(dict.fromkeys(l))
if(month == '12') :
tup.append((anlytic,'','','','','','','','','','','',count))
if(month == '11' and anlytic in l1) :
tup.append((anlytic,'','','','','','','','','','',count,''))
#tup.append(('','','','','','','','','','','','',''))
strong text
df_text = spark.createDataFrame(tup, ['ANALYTIC','JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])
print(df_text.show()) [Code Image][1]
我的输出
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT| NOV| DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
| ERG2| | | | | | | | | | | |435911|
| ELE| | | | | | | | | | |435911| |
| PHA| | | | | | | | | | |435911| |
| ELE| | | | | | | | | | | |435911|
| EBM| | | | | | | | | | | |512518|
| PHA| | | | | | | | | | | |435911|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT| NOV| DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
| ERG2| | | | | | | | | | | |435911|
| ELE| | | | | | | | | | |435911|435911|
| PHA| | | | | | | | | | |435911|435911|
| EBM| | | | | | | | | | | |512518|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
预期产出
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT| NOV| DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
| ERG2| | | | | | | | | | | |435911|
| ELE| | | | | | | | | | |435911| |
| PHA| | | | | | | | | | |435911| |
| ELE| | | | | | | | | | | |435911|
| EBM| | | | | | | | | | | |512518|
| PHA| | | | | | | | | | | |435911|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT| NOV| DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
| ERG2| | | | | | | | | | | |435911|
| ELE| | | | | | | | | | |435911|435911|
| PHA| | | | | | | | | | |435911|435911|
| EBM| | | | | | | | | | | |512518|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
“未获得如上所示的预期输出不确定这应该做什么:
l1=list(dict.fromkeys(l))
,但如果满足相同分析的条件,则此代码块会在元组列表中追加两行:
if(month == '12') :
tup.append((anlytic,'','','','','','','','','','','',count))
if(month == '11' and anlytic in l1) :
tup.append((anlytic,'','','','','','','','','','',count,''))
尝试在同一元组中追加DEC和NOV值。使用NamedTuples也可能有帮助
if(month == '12') :
(analytic)+('',)*8+(count)
tup.append((anlytic,'','','','','','','','','','','',count))
if(month == '11' and anlytic in l1) :
tup.append((anlytic,'','','','','','','','','','',count,tup[analytic_index][-1]))
您需要创建数据帧,请尝试以下操作:
df.groupby('ENGINE').pivot("MONTH").agg(F.max('TOTAL')).show()
欢迎来到SO!你能提供一个可复制的数据样本吗?只需从您的数据中选择几行就可以了。编辑的帖子和添加的数据是否要计算
total_unique_成员的不同计数,按month,engine
分组?是的,我需要total_unique_成员按月分组,engineisn不是df.groupby(['month','engine'])['total_unique_member'])。nunique()
是否有任何有效的解决方案或其他方法来获取我的预期输出如果(月==''12'):deccount=count如果(月=='11'):ncount=count tup.append((客户端,分析,'','','','','','','','',ncount,deccount))是否有更好的方法来获取预期输出