Python PySpark数据帧的问题

Python PySpark数据帧的问题,python,dataframe,pyspark,Python,Dataframe,Pyspark,使用pyspark,我面临数据帧的问题,数据帧没有按照我希望的方式进行分组。在下面的示例中,在分析列中,我希望值是不同的,以便可以按月查看趋势值。我怎样才能做到这一点 ***df Dataframe consists following*** -----'df' DATAFRAME CONSISTS OF FOLLOWING ---- +-----------+----+-----+------+-------------------+ |CLIENT_NAME|YEAR|MONTH|EN

使用pyspark,我面临数据帧的问题,数据帧没有按照我希望的方式进行分组。在下面的示例中,在
分析
列中,我希望值是不同的,以便可以按月查看趋势值。我怎样才能做到这一点

 ***df Dataframe consists following***
-----'df' DATAFRAME CONSISTS OF FOLLOWING ----


+-----------+----+-----+------+-------------------+
|CLIENT_NAME|YEAR|MONTH|ENGINE|TOTAL_UNIQUE_MEMBER|
+-----------+----+-----+------+-------------------+
|   Paax    |2019|   12|  ERG2|             435911|
|   Paax    |2019|   11|   ELE|             435911|
|   Paax    |2019|   11|   PHA|             435911|
|   Paax    |2019|   12|   ELE|             435911|
|   Paax    |2019|   12|   EBM|             512518|
|   Paax    |2019|   12|   PHA|             435911|
+-----------+----+-----+------+-------------------+


I m taking above values and keeping in dictionary 
and getted those values from dictionary and 
assigned to 'list of tuple' and finally tuple is added to some columns i.e Dataframe
我试过:

import os
import glob
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()


path = "/Users/ash2/Desktop/new_results/prodd"
mon_dict = {'01':'Jan','02':'Feb','03':'Mar','04':'Apr','05':'May','06':'Jun','07':'Jul','08':'Aug','09':'Sep','10':'Oct','11':'Nov','12':'Dec'}

def get_list_dirs(path):
        lst = os.listdir(path)
        if '.DS_Store' in lst:
            lst.remove('.DS_Store')
        return lst
for i in get_list_dirs(path):
    output_path = "/Users/ash2/Desktop"+os.sep+"out"
    #below taking part files which consists of rows and columns and values seperated by delimeter
    all_filenames =  glob.glob(path + os.sep + i + os.sep + '2019' + os.sep + '*' +os.sep + 'uniqueMemberReport' +os.sep + 'part*')
    df = spark.read.format("csv").option("header", "true").option('delimiter', '|').load(all_filenames)
    tup = []
    l =[]
    #df.show()
    df.persist()
    for i in range(1,df.count()+1):
        k = df.take(i)
        d = k[i-1].asDict()
        client = d['CLIENT_NAME']
        month = d['MONTH']
        anlytic = d['ENGINE']
        count = d['TOTAL_UNIQUE_MEMBER']
        y_m = mon_dict[month] + ' - 2019'
        l.append(anlytic)
        l1 = list(dict.fromkeys(l))
        if(month == '12') :
            tup.append((anlytic,'','','','','','','','','','','',count)) 
        if(month == '11' and anlytic in l1) :
            tup.append((anlytic,'','','','','','','','','','',count,''))

    #tup.append(('','','','','','','','','','','','',''))
strong text
    df_text = spark.createDataFrame(tup, ['ANALYTIC','JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])
    print(df_text.show()) [Code Image][1]
我的输出

+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|   NOV|   DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|    ERG2|   |   |   |   |   |   |   |   |   |   |      |435911|
|     ELE|   |   |   |   |   |   |   |   |   |   |435911|      |
|     PHA|   |   |   |   |   |   |   |   |   |   |435911|      |
|     ELE|   |   |   |   |   |   |   |   |   |   |      |435911|
|     EBM|   |   |   |   |   |   |   |   |   |   |      |512518|
|     PHA|   |   |   |   |   |   |   |   |   |   |      |435911|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|   NOV|   DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|    ERG2|   |   |   |   |   |   |   |   |   |   |      |435911|
|     ELE|   |   |   |   |   |   |   |   |   |   |435911|435911|
|     PHA|   |   |   |   |   |   |   |   |   |   |435911|435911|
|     EBM|   |   |   |   |   |   |   |   |   |   |      |512518|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
预期产出

+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|   NOV|   DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|    ERG2|   |   |   |   |   |   |   |   |   |   |      |435911|
|     ELE|   |   |   |   |   |   |   |   |   |   |435911|      |
|     PHA|   |   |   |   |   |   |   |   |   |   |435911|      |
|     ELE|   |   |   |   |   |   |   |   |   |   |      |435911|
|     EBM|   |   |   |   |   |   |   |   |   |   |      |512518|
|     PHA|   |   |   |   |   |   |   |   |   |   |      |435911|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
--------+---+---+---+---+---+---+---+---+---+---+------+------+
|ANALYTIC|JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|   NOV|   DEC|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+
|    ERG2|   |   |   |   |   |   |   |   |   |   |      |435911|
|     ELE|   |   |   |   |   |   |   |   |   |   |435911|435911|
|     PHA|   |   |   |   |   |   |   |   |   |   |435911|435911|
|     EBM|   |   |   |   |   |   |   |   |   |   |      |512518|
+--------+---+---+---+---+---+---+---+---+---+---+------+------+

未获得如上所示的预期输出

不确定这应该做什么:
l1=list(dict.fromkeys(l))
,但如果满足相同分析的条件,则此代码块会在元组列表中追加两行:

if(month == '12') :
        tup.append((anlytic,'','','','','','','','','','','',count)) 
if(month == '11' and anlytic in l1) :
        tup.append((anlytic,'','','','','','','','','','',count,''))
尝试在同一元组中追加DEC和NOV值。使用NamedTuples也可能有帮助

if(month == '12') :
        (analytic)+('',)*8+(count)
        tup.append((anlytic,'','','','','','','','','','','',count)) 
if(month == '11' and anlytic in l1) :
        tup.append((anlytic,'','','','','','','','','','',count,tup[analytic_index][-1]))
您需要创建数据帧,请尝试以下操作:

df.groupby('ENGINE').pivot("MONTH").agg(F.max('TOTAL')).show()

欢迎来到SO!你能提供一个可复制的数据样本吗?只需从您的数据中选择几行就可以了。编辑的帖子和添加的数据是否要计算
total_unique_成员的不同计数,按
month,engine
分组?是的,我需要total_unique_成员按月分组,engineisn不是
df.groupby(['month','engine'])['total_unique_member'])。nunique()
是否有任何有效的解决方案或其他方法来获取我的预期输出如果(月==''12'):deccount=count如果(月=='11'):ncount=count tup.append((客户端,分析,'','','','','','','','',ncount,deccount))是否有更好的方法来获取预期输出