Python 有人能解释一下熊猫的这种群居行为吗?

Python 有人能解释一下熊猫的这种群居行为吗?,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我正在coursera上做一个熊猫入门课程。代码的第一行是数据帧的一小部分,包含最高和最低温度的天气数据,分别标记为TMAX和TMIN。我不明白为什么groupby会这样做: df = pd.read_json('{"Data_Value":{"10073":3.3,"10079":-4.4,"17153":15.0,"17155":-1.1,"18049":5.6,"

我正在coursera上做一个熊猫入门课程。代码的第一行是数据帧的一小部分,包含最高和最低温度的天气数据,分别标记为TMAX和TMIN。我不明白为什么groupby会这样做:

df = pd.read_json('{"Data_Value":{"10073":3.3,"10079":-4.4,"17153":15.0,"17155":-1.1,"18049":5.6,"18066":-3.9,"18232":-1.7,"18261":5.6,"1860":15.0,"1906":-1.7,"19769":-3.3,"19772":12.8,"2035":-1.7,"2073":14.4,"24805":-5.6,"24863":4.4,"2812":-5.0,"3058":12.8,"31715":15.6,"31718":-4.4,"32266":15.0,"32274":-5.0,"35479":-3.9,"35771":12.2,"35785":-3.9,"39454":2.8,"39468":-2.8,"39565":-2.2,"39569":14.4,"41309":-3.9,"41334":3.3,"49030":15.0,"49074":-3.9,"49823":15.0,"49827":-3.9,"55067":-2.8,"55102":6.7,"55424":15.0,"55428":-4.4,"60994":13.3,"60995":0.0},"Date":{"10073":1104537600000,"10079":1104537600000,"17153":1104537600000,"17155":1104537600000,"18049":1104537600000,"18066":1104537600000,"18232":1104537600000,"18261":1104537600000,"1860":1104537600000,"1906":1104537600000,"19769":1104537600000,"19772":1104537600000,"2035":1104537600000,"2073":1104537600000,"24805":1104537600000,"24863":1104537600000,"2812":1104537600000,"3058":1104537600000,"31715":1104537600000,"31718":1104537600000,"32266":1104537600000,"32274":1104537600000,"35479":1104537600000,"35771":1104537600000,"35785":1104537600000,"39454":1104537600000,"39468":1104537600000,"39565":1104537600000,"39569":1104537600000,"41309":1104537600000,"41334":1104537600000,"49030":1104537600000,"49074":1104537600000,"49823":1104537600000,"49827":1104537600000,"55067":1104537600000,"55102":1104537600000,"55424":1104537600000,"55428":1104537600000,"60994":1104537600000,"60995":1104537600000},"Element":{"10073":"TMAX","10079":"TMIN","17153":"TMAX","17155":"TMIN","18049":"TMAX","18066":"TMIN","18232":"TMIN","18261":"TMAX","1860":"TMAX","1906":"TMIN","19769":"TMIN","19772":"TMAX","2035":"TMIN","2073":"TMAX","24805":"TMIN","24863":"TMAX","2812":"TMIN","3058":"TMAX","31715":"TMAX","31718":"TMIN","32266":"TMAX","32274":"TMIN","35479":"TMIN","35771":"TMAX","35785":"TMIN","39454":"TMAX","39468":"TMIN","39565":"TMIN","39569":"TMAX","41309":"TMIN","41334":"TMAX","49030":"TMAX","49074":"TMIN","49823":"TMAX","49827":"TMIN","55067":"TMIN","55102":"TMAX","55424":"TMAX","55428":"TMIN","60994":"TMAX","60995":"TMIN"},"ID":{"10073":"USW00014833","10079":"USW00014833","17153":"USC00207320","17155":"USC00207320","18049":"USW00014853","18066":"USW00014853","18232":"USC00205050","18261":"USC00205050","1860":"USC00202308","1906":"USC00205822","19769":"USC00205450","19772":"USC00205450","2035":"USC00202308","2073":"USC00203712","24805":"USW00094889","24863":"USW00094889","2812":"USC00203712","3058":"USC00205822","31715":"USC00205451","31718":"USC00205451","32266":"USC00208202","32274":"USC00208202","35479":"USC00201502","35771":"USC00200230","35785":"USC00200230","39454":"USC00205563","39468":"USC00205563","39565":"USC00200842","39569":"USC00200842","41309":"USC00208080","41334":"USC00208080","49030":"USC00207312","49074":"USC00207312","49823":"USC00200228","49827":"USC00200228","55067":"USC00200032","55102":"USC00200032","55424":"USC00207308","55428":"USC00207308","60994":"USW00004848","60995":"USW00004848"}}')
dfmax1 = df.groupby(["Date"]).max()
print(dfmax1)
结果:

 Data_Value Element           ID
Date                                       
2005-01-01        15.6    TMIN  USW00094889
查找最高温度会返回一个标记为TMIN的元素。该元素不在原始数据帧中,但最高温度的值是正确的

仅对TMAX值使用掩码可修复此问题:

dfmax2 = (df[df['Element']=='TMAX']).groupby(["Date"]).max()

此外,尝试使用min函数获取最低温度会返回标记为TMAX的元素。有人能解释为什么会发生这种情况吗?

值得一提的第一个细节是默认情况下读取json的尝试 将datelike列转换为datetime类型以及以下情况之一: 列被视为datelike,因为它的名称显然是date, 不区分大小写

您可以确认它运行df.info并检查它是否包含:

Date  ...  datetime64[ns]
还要注意,每行都有相同的日期2005-01-01,因此df.groupby 将所有行分组为一个组

另一个细节是,在这种情况下,max函数应用于每个列 并返回每列中的最大值

因为您只有一个组,所以可以通过运行以下命令对其进行检查:

df.Data_Value.max()
df.Element.max()
df.ID.max()
作为单独的命令,您将得到相同的值。 还要注意,TMIN是元素列的最后一个值, 因此,它是此列中的最大值

我也不同意数据帧中不存在TMIN。 打印df它只包含41行,您将看到元素列 只有“TMIN”和“TMAX”值

要获得每个日期的整行索引和最高温度, 您可以运行:

df.reset_index().groupby(["Date"])\
    .apply(lambda grp: grp.loc[grp.Data_Value.idxmax()])\
    .set_index('index', drop=True)
结果是:

18     Data_Value       Date Element           ID
index                                            
31715        15.6 2005-01-01    TMAX  USC00205451
注意:最初的18是找到的最大行的新索引, 仅当分组包含单个组时打印。
否则,如果有更多的组,则不会显示。

请选择一个更具表现力的标题。