python函数来标记数据帧中前30%的事务

python函数来标记数据帧中前30%的事务,python,function,Python,Function,我需要在一个包含7000行事务数据的数据帧上创建一个函数或循环,这样我就可以找到以百分比组发生的第一个事务 数据已使用pySpark按日期列按升序排序,如下所示: 排序的_df=df.orderByascdate 我现在需要一个函数,它将为dataframe中前30%的数据行查找并在新列中创建一个标志,因此在本例中,它将是前2100行7000*0.3。 然后,我想改进这个函数,为40%、50%、60%事务括号中的行添加额外的标志 问题的下一部分是能够将其应用于数据中的一组不同月份,因为对于上述d

我需要在一个包含7000行事务数据的数据帧上创建一个函数或循环,这样我就可以找到以百分比组发生的第一个事务

数据已使用pySpark按日期列按升序排序,如下所示:

排序的_df=df.orderByascdate

我现在需要一个函数,它将为dataframe中前30%的数据行查找并在新列中创建一个标志,因此在本例中,它将是前2100行7000*0.3。 然后,我想改进这个函数,为40%、50%、60%事务括号中的行添加额外的标志 问题的下一部分是能够将其应用于数据中的一组不同月份,因为对于上述df,我已将其过滤为一个月的数据,以便于应用。 我被困在这里,因为我对创建函数还不熟悉,希望以此作为学习的机会。
非常感谢

像这样的东西是你想要的吗

def flag_dataframe(df):
    df = df.reset_index() #to make sure the row index its still in the right order
    df.insert(len(df.columns), 'Flag', None) #create column flag
    flags = [30,40,50,60,70,80,90,100] #the flag percentages
    for i, row in df.iterrows(): #iterate through the dataframe, i is the index of the row, which is reset on the second line
        for flag in flags: 
            if(i / len(df) * 100 <= flag): #check which flag is the right flag
                df.loc[i, "flag"] = f"{flag}%" #setting the flag value of this row
                break #break out of this loop so it wont override the flag value for another one
    return df
您可以通过从函数中删除标志列表并将其添加为带有一些自定义值的参数来改进这一点。在本例中,我只是使用了您在问题中列出的标志

有关如何将其应用于本案例中选定数量的同月记录的问题:

def flag_dataframe_by_month(df):
    df = df.reset_index() # to make sure the row index its still in the right order
    df.insert(len(df.columns), 'Flag', None) #create column flag
    flags = [30,40,50,60,70,80,90,100] #the flag percentages
    months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
    for month in months:
        month_df = df[df["Month"] == month] #this will select all the rows from a month, but keep the index that is set on line 2
        month_df.insert(len(month_df.columns), 'month_rec_index', [i for i in range(len(month_df))]) #this will create an index based on the number of records with the same month, this index will not be used in the result
        for i, row in month_df.iterrows(): #iterate through the records with the same month, i is the index of the row in the original dataframe, which is set on line 2
            for flag in flags: 
                if(row["month_rec_index"] / len(month_df) * 100 < flag): #check which flag is the right flag
                    df.loc[i, "Flag"] = f"{flag}%" #setting the flag value of this row in the original dataframe
                    break #break out of this loop so it wont override the flag value for another one
    return df.drop(columns=["index"]) #pandas creates a second index, I dont exactly know why, but this is how to remove it again.
用法相同,如果您的月份名称不同或按索引命名,只需编辑月份列表中的月份即可


我还编辑了原始答案中的一些行,因为这些行给出了警告

标记是什么意思?很抱歉,我稍微修改了这个问题-我的意思是我希望函数创建一个新列,并基本上标记30%的数据行,然后标记其他40,50,60,%等等,谢谢你亲爱的韦斯尔,非常感谢你,这正是我所需要的,现在我可以自己申请下一次。非常感谢。亲爱的维塞尔,对不起,我希望你不介意我再问你一个问题,但我想你可以很容易地帮助我。假设我现在有一个巨大的数据框——有很多个月的数据,我想在数据框中的每个月执行相同的操作,请问我该如何实现?对于上述问题,我通过过滤初始数据框来简化问题,只显示一个月的数据。我不介意,但我认为最好也编辑您的问题,以遵循stackoverflow的规则。非常感谢Wessel,我当然会修改问题,但我之前一次问两个问题时遇到了麻烦,我的问题结束了
def flag_dataframe_by_month(df):
    df = df.reset_index() # to make sure the row index its still in the right order
    df.insert(len(df.columns), 'Flag', None) #create column flag
    flags = [30,40,50,60,70,80,90,100] #the flag percentages
    months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
    for month in months:
        month_df = df[df["Month"] == month] #this will select all the rows from a month, but keep the index that is set on line 2
        month_df.insert(len(month_df.columns), 'month_rec_index', [i for i in range(len(month_df))]) #this will create an index based on the number of records with the same month, this index will not be used in the result
        for i, row in month_df.iterrows(): #iterate through the records with the same month, i is the index of the row in the original dataframe, which is set on line 2
            for flag in flags: 
                if(row["month_rec_index"] / len(month_df) * 100 < flag): #check which flag is the right flag
                    df.loc[i, "Flag"] = f"{flag}%" #setting the flag value of this row in the original dataframe
                    break #break out of this loop so it wont override the flag value for another one
    return df.drop(columns=["index"]) #pandas creates a second index, I dont exactly know why, but this is how to remove it again.