在Python Dataframe中对附近的列值进行分组_Python_Python 3.x_Pandas_Dataframe_Pandas Groupby

在Python Dataframe中对附近的列值进行分组

python python-3.x pandas dataframe

在Python Dataframe中对附近的列值进行分组,python,python-3.x,pandas,dataframe,pandas-groupby,Python,Python 3.x,Pandas,Dataframe,Pandas Groupby,我有一个数据框，有一些列，比如说“n”列，还有一些行，比如说“m”行。我想根据一列（列：'x'）的值对数据帧行进行分组，这与列'x'的值不完全匹配。我需要将附近的值分组。例如，我的数据帧如下所示： y yh x xw w Nxt 0 2987 3129 347 2092 1735.0 501 1 2715 2847 501 1725 1224.0 492 2 2419 2716 490 2196 1

我有一个数据框，有一些列，比如说“n”列，还有一些行，比如说“m”行。我想根据一列（列：'x'）的值对数据帧行进行分组，这与列'x'的值不完全匹配。我需要将附近的值分组。例如，我的数据帧如下所示：

      y    yh     x    xw       w   Nxt
0   2987  3129   347  2092  1735.0   501
1   2715  2847   501  1725  1224.0   492
2   2419  2716   490  2196  1704.0   492
3   2310  2373   492   794   302.0   886
4   2309  2370   886  1012   126.0   492
5   2198  2261   497   791   299.0   886
6   2197  2258   886  1010   124.0   492
7   1663  2180   375  1092   600.0  1323

在上面的dataframe中，列“x”值之间的差异在20之间，然后我需要将它们分组到一个新的dataframe中，其余的可以避免。这里，index=1,2,3,5行可以是一个组，index=4,6可以是另一个组，因为这些行的“x”列之间的差值在20之间。我的预期输出应该是三个数据帧-

df1

：一个保存所有分组的行和

df2

：保存另一组行和“df3”：其余行如下所示：

df1：

df2：

df3：

我尝试了Groupby应用和Groupby转换，但没有成功。如果有人能帮我得到这个期望值，那将是很大的帮助，提前谢谢。

根据我的理解，我已经完成了问题的实现

group = df.groupby("x").groups

def neighbour(temp):
    temp_final = []
    final = []
    for i in range(len(temp)):
        t = []
        for j in range(len(temp)):
            if abs(temp[i] - temp[j]) <= 20:
                t.append(temp[j])
            else:
                pass
        t = sorted(t)
        temp_final.append(t)

    temp_final = list(set(frozenset(sublist) for sublist in final))
    for i in range(len(temp_final)):
        u = []
        for item in temp_final[i]:
            u.append(item)
        final.append(u)

    return final

dataframes = {}
for i in range(len(val)):
    key_name = "dataframe_"+str(i)
    dg = pd.DataFrame()
    for item in val[i]:
        index = list(group[item])
        for i in range(len(index)):
            dg = dg.append(df.iloc[index[i]])

    dataframes[key_name] = dg

这是输出。

为了将列“x”中的值分组到20以内，您可以使用

shift

并创建一个名为“group”的列，以便在值按“x”排序后定位两行之间的所有空间都在20以上的位置

df = df.sort_values('x')
df.loc[(df.x.shift() < df.x - 20),'group'] = 1 # everytime the jump betweeen two row is more than 20
# use cumsum, ffill and fillna to complete the column group and have a different number for each one
df['group'] = df['group'].cumsum().ffill().fillna(0)
#if the order of indexes matters, you can here add df = df.sort_index() and the code after is the same

现在，当组中有多行时，可以为每个组创建数据帧列表。您需要在“x”上使用

groupby

，

筛选长度大于1的组。最后，将长度为1的所有组添加为一个数据帧：
list_df = [df_g for name_g, df_g in df.groupby('group').filter(lambda x: len(x)>1).groupby('group')] +\
            [df.groupby('group').filter(lambda x: len(x)==1)]

例如，列表中的每个元素都是您想要的数据帧之一
print (list_df [0])
      y    yh    x    xw       w  Nxt  group
2  2419  2716  490  2196  1704.0  492    2.0
3  2310  2373  492   794   302.0  886    2.0
5  2198  2261  497   791   299.0  886    2.0
1  2715  2847  501  1725  1224.0  492    2.0

或
我知道您希望每个行都有一个名称，但我认为如果它们在列表中，访问它们会更容易
第4行和第7行之间的差异也小于20，这些行是否也应该单独分组？第0行和第7行的x列值彼此相差在20以内，因此它们是否应该位于另一个数据帧中？第4行和第6行相同吗？我的问题是，你是如何决定df1只保存500左右的x值组的？@user2699，是的，这些行也可以单独分组为另一个数据帧，如df3…你如何知道它们可以组合在一起的范围？@Ben.T，是的，对于第4行和第6行，应该是另一个数据帧df3。。。对于第0行和第7行，我已经编辑了值，这是类型错误，抱歉。太棒了。。。这正是我所期望的。非常感谢您理解我的问题并帮助我找到正确的解决方案。感谢您的解决方案和努力。。。。但我更感兴趣的是从一些内置的函数而不是传统的循环方式中得到这个。。。。
dataframes

{'dataframe_0':      Nxt       w      x      xw       y      yh
5  886.0   299.0  497.0   791.0  2198.0  2261.0
2  492.0  1704.0  490.0  2196.0  2419.0  2716.0
3  886.0   302.0  492.0   794.0  2310.0  2373.0
1  492.0  1224.0  501.0  1725.0  2715.0  2847.0, 'dataframe_1':       Nxt
w      x      xw       y      yh
0   501.0  1735.0  357.0  2092.0  2987.0  3129.0
7  1323.0   600.0  375.0  1092.0  1663.0  2180.0, 'dataframe_2':      Nxt      
w      x      xw       y      yh
4  492.0  126.0  886.0  1012.0  2309.0  2370.0
6  492.0  124.0  886.0  1010.0  2197.0  2258.0}

df = df.sort_values('x')
df.loc[(df.x.shift() < df.x - 20),'group'] = 1 # everytime the jump betweeen two row is more than 20
# use cumsum, ffill and fillna to complete the column group and have a different number for each one
df['group'] = df['group'].cumsum().ffill().fillna(0)
#if the order of indexes matters, you can here add df = df.sort_index() and the code after is the same

      y    yh    x    xw       w   Nxt  group
0  2987  3129  347  2092  1735.0   501    0.0
7  1663  2180  375  1092   600.0  1323    1.0
2  2419  2716  490  2196  1704.0   492    2.0
3  2310  2373  492   794   302.0   886    2.0
5  2198  2261  497   791   299.0   886    2.0
1  2715  2847  501  1725  1224.0   492    2.0
4  2309  2370  886  1012   126.0   492    3.0
6  2197  2258  886  1010   124.0   492    3.0

list_df = [df_g for name_g, df_g in df.groupby('group').filter(lambda x: len(x)>1).groupby('group')] +\
            [df.groupby('group').filter(lambda x: len(x)==1)]

print (list_df [0])
      y    yh    x    xw       w  Nxt  group
2  2419  2716  490  2196  1704.0  492    2.0
3  2310  2373  492   794   302.0  886    2.0
5  2198  2261  497   791   299.0  886    2.0
1  2715  2847  501  1725  1224.0  492    2.0

print (list_df [-1])
      y    yh    x    xw       w   Nxt  group
0  2987  3129  347  2092  1735.0   501    0.0
7  1663  2180  375  1092   600.0  1323    1.0