Python 交叉表:命名为格式化日期的列的更改顺序(mmm yy)
我一直在寻找如何为pandas交叉表排序列,但没有结果。我特别需要对基于日期值的格式化日期(mmm-yy)列进行排序,而不是按照3个字母的月份名称(mmm)按字母顺序排序 以下是我的代码的详细信息: python 3.3 熊猫0.12.0Python 交叉表:命名为格式化日期的列的更改顺序(mmm yy),python,sorting,date,pandas,crosstab,Python,Sorting,Date,Pandas,Crosstab,我一直在寻找如何为pandas交叉表排序列,但没有结果。我特别需要对基于日期值的格式化日期(mmm-yy)列进行排序,而不是按照3个字母的月份名称(mmm)按字母顺序排序 以下是我的代码的详细信息: python 3.3 熊猫0.12.0 f_dtflt是一个数据帧 f_dtflt.COLLECTION_DATE是数据类型datetime64[ns] 我的交叉表声明是: pd.交叉表(f_dtflt.EW_regioncallsite,f_dtflt.COLLECTION_DATE.apply(
f_dtflt
是一个数据帧
f_dtflt.COLLECTION_DATE
是数据类型datetime64[ns]
我的交叉表声明是:
pd.交叉表(f_dtflt.EW_regioncallsite,f_dtflt.COLLECTION_DATE.apply(lambda x:x.strftime(“%b%y”)),边距=True)
输出为:
COLLECTION_DATE Apr 13 Aug 13 Dec 12 Feb 13 Jan 13 Jul 13 Jun 13
EW_REGIONCOLLSITE
EAST 1964 2092 2280 2272 2757 2113 1902
WEST 2579 2011 1003 2351 2216 1506 1823
All 4543 4103 3283 4623 4973 3619 3725
COLLECTION_DATE Mar 13 May 13 Nov 12 Oct 12 Sep 13 All
EW_REGIONCOLLSITE
EAST 1682 1981 2108 825 975 22951
WEST 2770 3014 407 42 888 20610
All 4452 4995 2515 867 1863 43561
我希望列按升序日期排序…10月12日,11月12日。。。一月十三日,…九月十三日。
我承认我可以将日期格式化为yy-mm(例如13-01),但这些标签将在报告中使用,这是我不希望做出的妥协
我是python和pandas的新手,所以请通过在你的回答中连接任何点来帮助新手!非常感谢
方法1 根据@Andy答案的第一部分进行编辑。步骤3存在问题: 我已经尝试实施安迪的建议,下面是关于这项工作的更多信息 1) 我跑了下面一行,看看日期是什么样子。下一行为收款日期创建“2012-10”等值。(“通过印刷美化”) 2) 将上述语句输入交叉表时,它会将月份值更改为数字,如513、514等(字段中的实际值?) 以下是输出:
col_0 513 514 515 516 517 518 519 520 521 522
EW_REGIONCOLLSITE
EAST 825 2108 2280 2757 2272 1682 1964 1981 1902 2113
WEST 42 407 1003 2216 2351 2770 2579 3014 1823 1506
All 867 2515 3283 4973 4623 4452 4543 4995 3725 3619
col_0 523 524 All
EW_REGIONCOLLSITE
EAST 2092 975 22951
WEST 2011 888 20610
All 4103 1863 43561
3) 当我运行以下代码时,它抛出一个错误,即“int”对象没有属性“strftime”
table1.columns = table1.columns.map(lambda x: x.strftime("%b %y"))
我对此进行了多次尝试,以下是我的一些笔记:
# This runs and creates an array of strings: '513' etc.
pd.to_datetime(table1.columns.map(str), unit='M')
# The last entry in table1.columns is "All" and needs to be removed. Hence [:-1] slice.
# This also runs but seems to give years in 1630's.
pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')
# This does not run because it says object is immutable
table1.columns[:-1]=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')
# This also runs but the output is weird. It seems to give an array of both dates and -1
table1.columns.reindex(pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))
# Does not run: DatetimeIndex() must be called with a collection of some kind, '513' was passed
table1.columns = table1.columns.map(lambda x: pd.DatetimeIndex(str(x)).strftime("%b %y"))
# Does not run: DatetimeIndex object is not callable
table1.rename(columns=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))
# Working with a new concept
# This creates row titles of 12 10, 12 11, etc.
table1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%y %m")), margins=True)
# This throws an error that yb is not defined
table1.columns.map(lambda yb: "%s %s" % (y, b) for y, b in yb.split())
# Tried to simplify and see what happens. Runs and creates an array of lists such as [['12, '10'], ['12', '11']...]
table1.columns.map(lambda x: x.split())
# Trying a different approach. This creates a numpy array of datetimes.
tempholder=table1.columns[:-1].map(lambda x: datetime.datetime(year=int(x[0:2]), month=int(x[3:]), day=1))
# Noted that f_dtflt['COLLECTION_DATE'] was a dtype of datetime64[ns] but tempholder was dtype object. So had issue.
# Convert to datetime64
# Get error: Out of bounds nanosecond timestamp: 12-10-01 00:00:00
tempholder=pd.to_datetime(tempholder)
# Tempholder is an array of datetimes from the datetime module. I used the pandas date function above.
# Need to change that and use python datetime module function.
# Does not work: 'numpy.ndarray' object has no attribute 'apply'...
# this is a pandas function which does not work on a numpy array.
tempholder.apply(lambda x: x.strftime('%b %y'))
# This works for numpy array but I can't tell what it contains.
# print(tempholder) gives <map object at 0x0000000026C04F28>
# tempholder gives Out[169]: <builtins.map at 0x26c04f28>
tempholder=map(lambda x: x.strftime('%b %y'), tempholder)
4) 这对于在交叉表中标记列是有用的:
table1.columns.name = 'COLLECTION_DATE'
方法2 @安迪给了我第二个建议,而我只是玩弄它,没能让它发挥作用。问题的很大一部分是我对python、pandas和numpy缺乏了解。我在整理时给自己做了笔记。以下是我的笔记:
# This runs and creates an array of strings: '513' etc.
pd.to_datetime(table1.columns.map(str), unit='M')
# The last entry in table1.columns is "All" and needs to be removed. Hence [:-1] slice.
# This also runs but seems to give years in 1630's.
pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')
# This does not run because it says object is immutable
table1.columns[:-1]=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')
# This also runs but the output is weird. It seems to give an array of both dates and -1
table1.columns.reindex(pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))
# Does not run: DatetimeIndex() must be called with a collection of some kind, '513' was passed
table1.columns = table1.columns.map(lambda x: pd.DatetimeIndex(str(x)).strftime("%b %y"))
# Does not run: DatetimeIndex object is not callable
table1.rename(columns=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))
# Working with a new concept
# This creates row titles of 12 10, 12 11, etc.
table1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%y %m")), margins=True)
# This throws an error that yb is not defined
table1.columns.map(lambda yb: "%s %s" % (y, b) for y, b in yb.split())
# Tried to simplify and see what happens. Runs and creates an array of lists such as [['12, '10'], ['12', '11']...]
table1.columns.map(lambda x: x.split())
# Trying a different approach. This creates a numpy array of datetimes.
tempholder=table1.columns[:-1].map(lambda x: datetime.datetime(year=int(x[0:2]), month=int(x[3:]), day=1))
# Noted that f_dtflt['COLLECTION_DATE'] was a dtype of datetime64[ns] but tempholder was dtype object. So had issue.
# Convert to datetime64
# Get error: Out of bounds nanosecond timestamp: 12-10-01 00:00:00
tempholder=pd.to_datetime(tempholder)
# Tempholder is an array of datetimes from the datetime module. I used the pandas date function above.
# Need to change that and use python datetime module function.
# Does not work: 'numpy.ndarray' object has no attribute 'apply'...
# this is a pandas function which does not work on a numpy array.
tempholder.apply(lambda x: x.strftime('%b %y'))
# This works for numpy array but I can't tell what it contains.
# print(tempholder) gives <map object at 0x0000000026C04F28>
# tempholder gives Out[169]: <builtins.map at 0x26c04f28>
tempholder=map(lambda x: x.strftime('%b %y'), tempholder)
#使用新概念
#这将创建12 10、12 11等行标题。
表1=pd.交叉表(f_dtflt.EW_RegionCallSite,f_dtflt.COLLECTION_DATE.apply(λx:x.strftime(“%y%m”)),边距=真)
#这会引发一个未定义yb的错误
表1.columns.map(lambda yb:“%s%s”%(y,b)表示yb.split()中的y,b)
#试图简化,看看会发生什么。运行并创建列表数组,如['12',10'],['12',11'].]
表1.columns.map(lambda x:x.split())
#尝试不同的方法。这将创建日期时间的numpy数组。
tempholder=table1.列[:-1].映射(lambda x:datetime.datetime(year=int(x[0:2]),month=int(x[3:]),day=1))
#注意f_dtflt['COLLECTION_DATE']是datetime64[ns]的数据类型,但tempholder是数据类型对象。问题也是如此。
#转换为datetime64
#获取错误:超出范围纳秒时间戳:12-10-01 00:00:00
tempholder=pd.to_日期时间(tempholder)
#Tempholder是datetime模块中的datetime数组。我使用了上面的pandas日期函数。
#需要改变这一点,并使用python datetime模块函数。
#不起作用:“numpy.ndarray”对象没有属性“apply”。。。
#这是一个在numpy阵列上不起作用的函数。
tempholder.apply(lambda x:x.strftime(“%b%y”))
#这对numpy数组有效,但我不知道它包含什么。
#打印(临时支架)提供
#tempholder发出[169]:
tempholder=map(λx:x.strftime(“%b%y”),tempholder)
如果您将年-月作为字符串(并且顺序正确),则可以反转:
In [1]: df = pd.DataFrame([['a', 'b']], columns=['12 Mar', '12 Jun'])
In [2]: df.columns.map(lambda yb: ' '.join(reversed(yb.split())))
Out[2]: array(['Mar 12', 'Jun 12'], dtype=object)
In [3]: df.columns = df.columns.map(lambda yb: ' '.join(reversed(yb.split())))
我建议你可以用经期来做这件事:
然后,可以将列清理为所需的格式后:
但这似乎将周期索引更改为int(可能是一个bug?)我从稍微不同的角度处理了这个问题,并创建了一个函数,该函数可以用作在pandas中对交叉表中的列进行排序的通用方法。它可能也适用于透视表,但我没有测试它,也没有查看细节。我想它也可以用来订购行标签,但我没有尝试 这将创建一个带有列标签的交叉表,如“12 10_Oct 12”和12 11_Nov 12。该标签有效地强制交叉表的字母排序对我有利。标签的字母排序部分与“u”和我要使用的标签连接
table_1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%y %m_%b %y")), margins=True)
输出:
"COLLECTION_DATE 12 10_Oct 12 12 11_Nov 12 12 12_Dec 12 13 01_Jan 13
EW_REGIONCOLLSITE
EAST 825 2108 2280 2757
WEST 42 407 1003 2216
All 867 2515 3283 4973
COLLECTION_DATE 13 02_Feb 13 13 03_Mar 13 13 04_Apr 13 13 05_May 13
EW_REGIONCOLLSITE
EAST 2272 1682 1964 1981
WEST 2351 2770 2579 3014
All 4623 4452 4543 4995
COLLECTION_DATE 13 06_Jun 13 13 07_Jul 13 13 08_Aug 13 13 09_Sep 13
EW_REGIONCOLLSITE
EAST 1902 2113 2092 975
WEST 1823 1506 2011 888
All 3725 3619 4103 1863
COLLECTION_DATE All
EW_REGIONCOLLSITE
EAST 22951
WEST 20610
All 43561 "
函数和调用:
def clean_label(label_list, margins='False'):
''' This function takes the column index list from a crosstab (or pivot table?) in pandas and removes the
part of the label before and including the "_". This allows the user to order the columns manually by creating
an alphabetical index followed by "_" and then the label that they would like to use. For example, a label such as
['a_Positive', 'b_Negative'] will be converted to ['Positive', 'Negative']. Another example would be to order dates
in a table from ['12 10_Oct 12', '12 11_Nov 12'] to ['Oct 12', 'Nov 12']
margins = False if the crosstab was created without margins and therefore does not have an "All" at the end of the list
margins = True if the crosstab was created with margins and therefore has an "All" at the end of the list
'''
corrected_list=list()
# If one creates margins in pivot/crosstab, will get the last column of "All"
# This has to be removed from the following code or it will throw an error.
if margins:
convert_list = label_list[:-1]
else:
convert_list = label_list
for l in convert_list:
x,y=l.split('_')
corrected_list.append(y)
if margins:
corrected_list.append('Total') # Renames "All" to "Total"
return corrected_list
# Change the labels on the crosstab table
table_1.columns=clean_label(table_1.columns, margins=True)
# Change name of columns
table_1.columns.name = 'Month of Collection'
# Change name of rows
table_1.index.name = 'Region'
输出(最终表格):
你为什么不先对你的
收藏_DATE
列排序,然后执行交叉表
?@EdChum,是的,试过了。交叉表强制排序顺序。找不到交叉表的参数可以让我控制排序顺序或将其关闭。安迪,你的答案帮助我非常接近。我已经在我的原始帖子中添加了一个编辑告诉你挂断的地方。这是我第一次发布,希望这是正确的方式。@StacyL.Gardner我看不到你的编辑!:(问题出在哪里?@StacyL.Gardner显然我的周期解决方案不起作用(看起来crosstab丢失了周期信息,一个bug!):(字符串方法第一次做year-month呢?)(因此它可以正确排序)然后使用df.columns=我的最后一行进行切换?我今天一直在玩弄你的两个建议,但两个建议都无法发挥作用。我已将我的笔记包含在编辑后的帖子中。感谢你的帮助。@StacyL.Gardner噢,很抱歉字符串1引发了n错误,我应该检查一下-我已经更正并实际测试了上面的语法!抱歉花了这么长时间!
"COLLECTION_DATE 12 10_Oct 12 12 11_Nov 12 12 12_Dec 12 13 01_Jan 13
EW_REGIONCOLLSITE
EAST 825 2108 2280 2757
WEST 42 407 1003 2216
All 867 2515 3283 4973
COLLECTION_DATE 13 02_Feb 13 13 03_Mar 13 13 04_Apr 13 13 05_May 13
EW_REGIONCOLLSITE
EAST 2272 1682 1964 1981
WEST 2351 2770 2579 3014
All 4623 4452 4543 4995
COLLECTION_DATE 13 06_Jun 13 13 07_Jul 13 13 08_Aug 13 13 09_Sep 13
EW_REGIONCOLLSITE
EAST 1902 2113 2092 975
WEST 1823 1506 2011 888
All 3725 3619 4103 1863
COLLECTION_DATE All
EW_REGIONCOLLSITE
EAST 22951
WEST 20610
All 43561 "
def clean_label(label_list, margins='False'):
''' This function takes the column index list from a crosstab (or pivot table?) in pandas and removes the
part of the label before and including the "_". This allows the user to order the columns manually by creating
an alphabetical index followed by "_" and then the label that they would like to use. For example, a label such as
['a_Positive', 'b_Negative'] will be converted to ['Positive', 'Negative']. Another example would be to order dates
in a table from ['12 10_Oct 12', '12 11_Nov 12'] to ['Oct 12', 'Nov 12']
margins = False if the crosstab was created without margins and therefore does not have an "All" at the end of the list
margins = True if the crosstab was created with margins and therefore has an "All" at the end of the list
'''
corrected_list=list()
# If one creates margins in pivot/crosstab, will get the last column of "All"
# This has to be removed from the following code or it will throw an error.
if margins:
convert_list = label_list[:-1]
else:
convert_list = label_list
for l in convert_list:
x,y=l.split('_')
corrected_list.append(y)
if margins:
corrected_list.append('Total') # Renames "All" to "Total"
return corrected_list
# Change the labels on the crosstab table
table_1.columns=clean_label(table_1.columns, margins=True)
# Change name of columns
table_1.columns.name = 'Month of Collection'
# Change name of rows
table_1.index.name = 'Region'
"Month of Collection Oct 12 Nov 12 Dec 12 Jan 13 Feb 13 Mar 13 Apr 13
Region
EAST 825 2108 2280 2757 2272 1682 1964
WEST 42 407 1003 2216 2351 2770 2579
All 867 2515 3283 4973 4623 4452 4543
Month of Collection May 13 Jun 13 Jul 13 Aug 13 Sep 13 Total
Region
EAST 1981 1902 2113 2092 975 22951
WEST 3014 1823 1506 2011 888 20610
All 4995 3725 3619 4103 1863 43561 "