熊猫：将备用数据、groupby和datetime扩展到新的数据帧中？_Datetime_Pandas_Indexing

熊猫：将备用数据、groupby和datetime扩展到新的数据帧中？

datetime pandas indexing

熊猫：将备用数据、groupby和datetime扩展到新的数据帧中？,datetime,pandas,indexing,Datetime,Pandas,Indexing,在过去的几天里，我一直在这里阅读一些非常好的帖子，不幸的是，现在轮到我了，因为我有以下问题：我从csv读取了一个大数据帧（df），包括c.20列和所有类型的变量，包括float、object、string、integer和datetime。Datetime无法识别，因此我首先转换了相应的对象列-让我们将其称为“pup”，并在另一列中对其进行规范化（因为我只需要每日级别进行进一步处理）： df.pub = pd.to_datetime(df.pub, format='%d/%m/%Y %H:%

在过去的几天里，我一直在这里阅读一些非常好的帖子，不幸的是，现在轮到我了，因为我有以下问题：

我从csv读取了一个大数据帧（df），包括c.20列和所有类型的变量，包括float、object、string、integer和datetime。Datetime无法识别，因此我首先转换了相应的对象列-让我们将其称为“pup”，并在另一列中对其进行规范化（因为我只需要每日级别进行进一步处理）：

df.pub = pd.to_datetime(df.pub, format='%d/%m/%Y  %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()
df.set_index(['pub']) # indexing in df remained accurate

这一切都很好。现在，我在“pub_day”条件下执行了多个其他列的groupby操作（=countifs）。同样，这些都适用于所有正确的精细和聚合数字。即：

df['counted_if'] = df['some_no'].groupby(df['pub_day']).transform('sum')

我没有连续的“pub”或“pub_day”列，这意味着我的csv中有些日子完全缺失，有些日子多次出现

现在问题来了：接下来我想做的是将正确计算的groupby操作作为新列以连续格式写入新的dataframe df2中，这意味着在“pup_day”中为缺少的天数添加行，并第二次删除包含特定日期的行。仅供参考：当我在第一个df中为groupby操作添加一个新列时，groupby值仍然正确，并且当“pub_day”中的某一天多次出现时，groupby值只是重复的

我尝试了很多东西，也读了很多关于reindex的书，包括fill_值、set_索引等等，但我还是想不出来

因此，如何：（1）将列['count-if']导出到第二个数据帧中？（2）是否将基于日期的日期时间列“pup day”设置为df2索引？（3）是否删除此1列/1索引df2中的重复条目？（4）在某种程度上操纵指数，使所有的日子都包括空的日子，这样我最终每天都有一个离散的时间序列

说真的，我自己知道所有步骤（1）-（4），但不知怎么的，它们似乎只有在单独测试时才起作用。。。我的组合代码杂乱无章，有很多行，并给出了索引错误。。。。有没有快速的5-10线解决方案

更新：这是代码中更广泛的图片： -->测向数据样本（一些数字）：

-->在df2中应该是什么样子：

            ['counted_if']
02/02/2002        24
01/02/2002        3
31/01/2002        5
30/01/2002        0 (or NaN or whatever..)
29/01/2002        0
28/01/2002        0
27/01/2002        0
26/01/2002        6
  .....

一次看似有希望但没有成功的尝试：

希望这能澄清。也尝试了许多不同的组合。高度赞赏解决方案

我为您提供了包含测试数据的解决方案，以便更好地进行测试：

import pandas as pd
import io

temp=u"""1;2;some_no;18;pub;20;pub_day;counted_if
ab;xy;20;abc;02/02/2002 13:03;2;02/02/2002;24
de;it;4;aso;02/02/2002 11:08;32;02/02/2002;24
hi;as;3;asd;01/02/2002 17:30;8;01/02/2002;3
zu;lu;4;akr;31/01/2002 11:03;12;31/01/2002;5
da;fu;1;lts;31/01/2002 09:03;14;31/01/2002;5
la;di;6;unu;26/01/2002 08:07;3;26/01/2002;6"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print df
    1   2  some_no   18               pub  20     pub_day  counted_if
0  ab  xy       20  abc  02/02/2002 13:03   2  02/02/2002          24
1  de  it        4  aso  02/02/2002 11:08  32  02/02/2002          24
2  hi  as        3  asd  01/02/2002 17:30   8  01/02/2002           3
3  zu  lu        4  akr  31/01/2002 11:03  12  31/01/2002           5
4  da  fu        1  lts  31/01/2002 09:03  14  31/01/2002           5
5  la  di        6  unu  26/01/2002 08:07   3  26/01/2002           6

按注释编辑：

print df
    1   2  some_no   18               pub  20     pub_day  counted_if
0  ab  xy       20  abc  02/02/2002 13:03   2  02/02/2002          24
1  de  it        4  aso  02/02/2002 11:08  32  02/02/2002          24
2  hi  as        3  asd  01/02/2002 17:30   8  01/02/2002           3
3  zu  lu        4  akr  31/01/2002 11:03  12  31/01/2002           5
4  da  fu        1  lts  31/01/2002 09:03  14  31/01/2002           5
5  la  di        6  unu  26/01/2002 08:07   3  26/01/2002           6

df['pub'] = pd.to_datetime(df.pub, format='%d/%m/%Y  %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()

df.set_index('pub', inplace=True)

#add columns pub_day (for grouping), and other columns for aggregating (counted_if, 20, ...)
df1 = df[['pub_day', 'counted_if','20']].groupby('pub_day').transform('sum').reset_index()
print df1
                  pub  counted_if  20
0 2002-02-02 13:03:00          48  34
1 2002-02-02 11:08:00          48  34
2 2002-02-01 17:30:00           3   8
3 2002-01-31 11:03:00          10  26
4 2002-01-31 09:03:00          10  26
5 2002-01-26 08:07:00           6   3

你能加上吗？嗨，当然会尝试提取线，请给我约10-15分钟。给你一个例子。谢谢。但我需要你的数据样本-5-6行，但解决方案必须是可验证的。也许你可以修改你的问题-添加输入（5,6行样本数据）、所需输出（来自输入）以及你尝试了什么，可能是什么错误。当然不用担心！：）-实际上，对于那些可能感兴趣的人来说，这里可能有一个小的附加问题：假设我们有多个基于groupby（所有整数）的counted if列，我们只想通过>>>print s.simply s s=output.colums['counted_if1'，'counted_if2'，'counted_if3'，'counted_if4']在您各自的行中？这给出了下一行“pub_day”的一个关键错误：df2=df1。删除重复项（子集=['pub_day'，keep='first'）我的意思是，例如，如何使它最终看起来像这样：['counted_if']['20'] 2002-01-26 6 3 2002-01-27 0 0 2002-01-28 0 0 2002-01-29 0 0 2002-01-30 0 0 2002-01-31 5 26 2002-02-01 3 8 2002-02-02 24 34有一个问题-

pub

和

pub_-day

列中的日期是相同的？是的，但是“pub_-day”如上所述被标准化为每日水平，因此“pub_-day”中的小时和分钟都变为00。对于“酒吧”，它们充满了特定的时间和分钟

import pandas as pd
import io

temp=u"""1;2;some_no;18;pub;20;pub_day;counted_if
ab;xy;20;abc;02/02/2002 13:03;2;02/02/2002;24
de;it;4;aso;02/02/2002 11:08;32;02/02/2002;24
hi;as;3;asd;01/02/2002 17:30;8;01/02/2002;3
zu;lu;4;akr;31/01/2002 11:03;12;31/01/2002;5
da;fu;1;lts;31/01/2002 09:03;14;31/01/2002;5
la;di;6;unu;26/01/2002 08:07;3;26/01/2002;6"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print df
    1   2  some_no   18               pub  20     pub_day  counted_if
0  ab  xy       20  abc  02/02/2002 13:03   2  02/02/2002          24
1  de  it        4  aso  02/02/2002 11:08  32  02/02/2002          24
2  hi  as        3  asd  01/02/2002 17:30   8  01/02/2002           3
3  zu  lu        4  akr  31/01/2002 11:03  12  31/01/2002           5
4  da  fu        1  lts  31/01/2002 09:03  14  31/01/2002           5
5  la  di        6  unu  26/01/2002 08:07   3  26/01/2002           6

df['pub'] = pd.to_datetime(df.pub, format='%d/%m/%Y  %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()
#added inplace=True
df.set_index('pub', inplace=True) # indexing in df remained accurate
#better syntax of groupby
df['counted_if'] = df.groupby('pub_day')['some_no'].transform('sum')
print df
                      1   2  some_no   18  20    pub_day  counted_if
pub                                                                 
2002-02-02 13:03:00  ab  xy       20  abc   2 2002-02-02          24
2002-02-02 11:08:00  de  it        4  aso  32 2002-02-02          24
2002-02-01 17:30:00  hi  as        3  asd   8 2002-02-01           3
2002-01-31 11:03:00  zu  lu        4  akr  12 2002-01-31           5
2002-01-31 09:03:00  da  fu        1  lts  14 2002-01-31           5
2002-01-26 08:07:00  la  di        6  unu   3 2002-01-26           6

#omited, not necessary
#df2=df
df2=df.drop_duplicates(subset=['pub_day'],keep='first')

#simplier is use subset of data by columns
df2=df2[['counted_if','pub_day']] 
print df2
                     counted_if    pub_day
pub                                       
2002-02-02 13:03:00          24 2002-02-02
2002-02-01 17:30:00           3 2002-02-01
2002-01-31 11:03:00           5 2002-01-31
2002-01-26 08:07:00           6 2002-01-26

#drops all 20 columns except for df2.counted_if and df2.pub_day
##hence only 2 columns remaining here: pub_day and counted_if

#you have to first reset index before change index to other value
df2.reset_index(inplace=True)
#set column pub_day as index
df2.set_index('pub_day', inplace=True)
#pub_day is index, so use df.index, not df2.pub_day 
idx=pd.date_range(df2.index.min(),df2.index.max())
print idx
DatetimeIndex(['2002-01-26', '2002-01-27', '2002-01-28', '2002-01-29',
               '2002-01-30', '2002-01-31', '2002-02-01', '2002-02-02'],
              dtype='datetime64[ns]', freq='D')

#series is column counted_if
s = df2.counted_if
print s
pub_day
2002-02-02    24
2002-02-01     3
2002-01-31     5
2002-01-26     6
Name: counted_if, dtype: int64

#index is Datetimeindex, omited
#s.index = pd.DatetimeIndex(s.index)
s=s.reindex(idx,fill_value=0)
print s
2002-01-26     6
2002-01-27     0
2002-01-28     0
2002-01-29     0
2002-01-30     0
2002-01-31     5
2002-02-01     3
2002-02-02    24
Freq: D, Name: counted_if, dtype: int64

print df
    1   2  some_no   18               pub  20     pub_day  counted_if
0  ab  xy       20  abc  02/02/2002 13:03   2  02/02/2002          24
1  de  it        4  aso  02/02/2002 11:08  32  02/02/2002          24
2  hi  as        3  asd  01/02/2002 17:30   8  01/02/2002           3
3  zu  lu        4  akr  31/01/2002 11:03  12  31/01/2002           5
4  da  fu        1  lts  31/01/2002 09:03  14  31/01/2002           5
5  la  di        6  unu  26/01/2002 08:07   3  26/01/2002           6

df['pub'] = pd.to_datetime(df.pub, format='%d/%m/%Y  %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()

df.set_index('pub', inplace=True)

#add columns pub_day (for grouping), and other columns for aggregating (counted_if, 20, ...)
df1 = df[['pub_day', 'counted_if','20']].groupby('pub_day').transform('sum').reset_index()
print df1
                  pub  counted_if  20
0 2002-02-02 13:03:00          48  34
1 2002-02-02 11:08:00          48  34
2 2002-02-01 17:30:00           3   8
3 2002-01-31 11:03:00          10  26
4 2002-01-31 09:03:00          10  26
5 2002-01-26 08:07:00           6   3

#if date in pub_date and pub is same, use dt.date
df1['pub_day'] = df1['pub'].dt.date 
print df1
                  pub  counted_if  20     pub_day
0 2002-02-02 13:03:00          48  34  2002-02-02
1 2002-02-02 11:08:00          48  34  2002-02-02
2 2002-02-01 17:30:00           3   8  2002-02-01
3 2002-01-31 11:03:00          10  26  2002-01-31
4 2002-01-31 09:03:00          10  26  2002-01-31
5 2002-01-26 08:07:00           6   3  2002-01-26

df2=df1.drop_duplicates(subset='pub_day',keep='first')
print df2
                  pub  counted_if  20     pub_day
0 2002-02-02 13:03:00          48  34  2002-02-02
2 2002-02-01 17:30:00           3   8  2002-02-01
3 2002-01-31 11:03:00          10  26  2002-01-31
5 2002-01-26 08:07:00           6   3  2002-01-26

#add other columns for aggregating (counted_if, 20, ...), column pub_day is for new index
df2=df2[['counted_if','pub_day', '20']] 
print df2
   counted_if     pub_day  20
0          48  2002-02-02  34
2           3  2002-02-01   8
3          10  2002-01-31  26
5           6  2002-01-26   3

df2.reset_index(inplace=True, drop=True)
df2.set_index('pub_day', inplace=True)

idx=pd.date_range(df2.index.min(),df2.index.max())
#print idx

df2=df2.reindex(idx,fill_value=0)
print df2
            counted_if  20
2002-01-26           6   3
2002-01-27           0   0
2002-01-28           0   0
2002-01-29           0   0
2002-01-30           0   0
2002-01-31          10  26
2002-02-01           3   8
2002-02-02          48  34