Python 数据帧的自然排序_Python_Python 2.7_Sorting_Pandas_Natsort

Python 数据帧的自然排序

python python-2.7 sorting pandas

Python 数据帧的自然排序,python,python-2.7,sorting,pandas,natsort,Python,Python 2.7,Sorting,Pandas,Natsort,我有一个带有索引的熊猫数据框，我想自然排序。纳索特似乎不起作用。在构建数据帧之前对索引进行排序似乎没有什么帮助，因为我对数据帧所做的操作似乎会打乱排序过程。有没有想过我该如何自然地利用指数 from natsort import natsorted import pandas as pd # An unsorted list of strings a = ['0hr', '128hr', '72hr', '48hr', '96hr'] # Sorted incorrectly b = sort

我有一个带有索引的熊猫数据框，我想自然排序。纳索特似乎不起作用。在构建数据帧之前对索引进行排序似乎没有什么帮助，因为我对数据帧所做的操作似乎会打乱排序过程。有没有想过我该如何自然地利用指数

from natsort import natsorted
import pandas as pd

# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted 
c = natsorted(a)

# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)

print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)

如果要对df进行排序，只需对索引或数据进行排序，并直接分配给df的索引，而不是尝试将df作为参数传递，因为这样会生成一个空列表：

In [7]:

df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')

请注意，

df.index=natsorted（df.index）

也起作用

如果将df作为参数传递，则会生成一个空列表，在这种情况下，因为df是空的（没有列），否则它将返回排序后的列，这不是您想要的：

In [10]:

natsorted(df)
Out[10]:
[]

编辑

如果要对索引进行排序，以便数据与索引一起重新排序，请使用：

请注意，您必须将

reindex

的结果分配给新的df或其自身，它不接受

inplace

参数。

既然

pandas

在

sort_值

和

sort_索引

中都支持

键

，您现在应该参考并发送所有向上投票，因为这是正确的答案。我将把我的答案留在这里，供那些停留在旧的

pandas

版本上的人们参考，或者作为一种历史的好奇心

答案是被问到的问题。我还想添加如何在

DataFrame

中的列上使用

natsort

，因为这将是下一个问题

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted, index_natsorted, order_by_index

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df
Out[4]: 
         a   b
0hr     a5  b1
128hr   a1  b1
72hr   a10  b2
48hr    a2  b2
96hr   a12  b1

如图所示，按索引排序相当简单：

In [5]: df.reindex(index=natsorted(df.index))
Out[5]: 
         a   b
0hr     a5  b1
48hr    a2  b2
72hr   a10  b2
96hr   a12  b1
128hr   a1  b1

如果希望以相同的方式对列进行排序，则需要按照所需列的重新排序顺序对索引进行排序

natsort

提供了方便的函数

index\u natsorted

和

order\u by\u index

来实现这一点

In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

如果要按任意数量的列（或列和索引）重新排序，可以使用

zip

（或Python2上的

itertools.izip

）指定对多个列的排序。给出的第一列将是主排序列，然后是次排序列，然后是第三排序列，等等

In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]: 
         a   b
0hr     a5  b1
96hr   a12  b1
128hr   a1  b1
48hr    a2  b2
72hr   a10  b2

这里有一种使用

category

对象的替代方法，我从

pandas

devs那里得知，这种方法是“正确的”。这需要（据我所知）pandas>=0.16.0。目前，它只适用于列，但显然在pandas>=0.17.0中，它们将添加

CategoricalIndex

，这将允许在索引上使用此方法

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df.a = df.a.astype('category')

In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)

In [6]: df.b = df.b.astype('category')

In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)

In [9]: df.sort('a')
Out[9]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [10]: df.sort('b')
Out[10]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

In [11]: df.sort(['b', 'a'])
Out[11]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

category

对象允许您定义要使用的

DataFrame

的排序顺序。调用

reorder_categories

时给出的元素必须是唯一的，因此对列“b”调用

set

我让用户来决定这是否优于

reindex

方法，因为它要求您在

DataFrame

内排序之前对列数据进行独立排序（尽管我认为第二种排序相当有效）

完全公开，我是

natsort

的作者。

使用

sort\u值pandas>=1.1.0
使用DataFrame.sort\u values
中的新key
参数，因为我们可以直接对列进行排序，而无需使用以下命令将其设置为索引：
您好，natsort
developer在这里natsort
目前不支持处理整个数据帧对象。传递数据帧对象的预期输出是什么？我相信这没有抓住要点。我意识到我可以自然地对a进行排序，并将其用作索引，但由于我对数据帧执行的操作，我的实际代码弄乱了数据帧索引的排序。我需要在数据帧中使用索引和相关数据。那么，你在这里问什么，你想在数据操作后对索引进行排序？您可以使用reindex
并在索引df.reindex（index=natsorted（df.index））
@EdChum上调用natsorted
。

。。。是的，这听起来正是他们想要的。我认为最终这是正确的答案。@SethMMorton sorry

reindex

是少数不接受param

in place

的函数之一，所以是的，你必须把它分配给itself@sethMMorton我想我会期望

df3.index

与

相同，同时对数据进行排序，使其与索引值保持一致。如果

pd.sort

具有

键

选项，那就好了，但它没有。提供了一种解决方法，允许您传递从

natsort\u keygen

生成的密钥。我刚刚向

pandas

devs发出了一个正式请求，要求将

key

添加到

sort

方法中：我上面的问题是一个dupe，当前的问题是，

pandas

对于

sort\u value

有一个

key

参数，现在应该是可以接受的答案。这个建议的解决方案是一个“最大努力”的解决方案-不是

key=natsort\u keygen（）

更少努力吗？同意，相应地更新我的答案。谢谢你写的漂亮的包装：）@SethMMorton

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df.a = df.a.astype('category')

In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)

In [6]: df.b = df.b.astype('category')

In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)

In [9]: df.sort('a')
Out[9]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [10]: df.sort('b')
Out[10]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

In [11]: df.sort(['b', 'a'])
Out[11]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

df = pd.DataFrame({
    "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
    "value": [10, 20, 30, 40, 50]
})

    time  value
0    0hr     10
1  128hr     20
2   72hr     30
3   48hr     40
4   96hr     50

from natsort import natsort_keygen

df.sort_values(
    by="time",
    key=natsort_keygen()
)

    time  value
0    0hr     10
3   48hr     40
2   72hr     30
4   96hr     50
1  128hr     20