Python 按日期字符串选择数据帧切片_Python_Pandas

Python 按日期字符串选择数据帧切片

python pandas

Python 按日期字符串选择数据帧切片,python,pandas,Python,Pandas,我有一个像这样加载的数据帧 minData = pd.read_csv( currentSymbol["fullpath"], header = None, names = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Split Factor', 'Earnings', 'Dividends'],

我有一个像这样加载的数据帧

        minData = pd.read_csv(
                currentSymbol["fullpath"],
                header = None,
                names = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Split Factor', 'Earnings', 'Dividends'], 
                parse_dates = [["Date", "Time"]],
                date_parser = lambda x : datetime.datetime.strptime(x, '%Y%m%d %H%M'), 
                index_col = "Date_Time",
                sep=' ')

数据如下所示

>>> minData.index
<class 'pandas.tseries.index.DatetimeIndex'>
[1998-01-02 09:30:00, ..., 2013-12-09 16:00:00]
Length: 1373036, Freq: None, Timezone: None
>>> 

>>> minData.head(5)
                        Open     High      Low    Close   Volume  \
Date_Time                                                          
1998-01-02 09:30:00  8.70630  8.70630  8.70630  8.70630   420.73   
1998-01-02 09:35:00  8.82514  8.82514  8.82514  8.82514   420.73   
1998-01-02 09:42:00  8.79424  8.79424  8.79424  8.79424   420.73   
1998-01-02 09:43:00  8.76572  8.76572  8.76572  8.76572  1262.19   
1998-01-02 09:44:00  8.76572  8.76572  8.76572  8.76572   420.73   

                     Split Factor  Earnings  Dividends  Active  
Date_Time                                                       
1998-01-02 09:30:00             4         0          0     NaN  
1998-01-02 09:35:00             4         0          0     NaN  
1998-01-02 09:42:00             4         0          0     NaN  
1998-01-02 09:43:00             4         0          0     NaN  
1998-01-02 09:44:00             4         0          0     NaN  

[5 rows x 9 columns]

>>> minData["2004-12-20"]
                        Open     High      Low    Close     Volume  \
Date_Time                                                            
2004-12-20 09:30:00  35.8574  35.9373  35.8025  35.9273  154112.00   
2004-12-20 09:31:00  35.8924  35.9174  35.8824  35.8874   17021.50   
2004-12-20 09:32:00  35.8874  35.8924  35.8824  35.8824   17079.50   
2004-12-20 09:33:00  35.8874  35.9423  35.8724  35.9373   32491.50   
2004-12-20 09:34:00  35.9373  36.0023  35.9174  36.0023   40096.40   
2004-12-20 09:35:00  35.9923  36.2071  35.9923  36.1471   67088.90   
...

我有这样的日期（从不同的文件读取）

我想在这一天的所有分钟内将“活动”列设置为True

我可以用这个做这个

minData.loc['2004-12-20',"Active"] = True

我可以用这段疯狂的代码对我的时间戳日期做同样的事情

minData.loc[str(ts.year) + "-" + str(ts.month) + "-" + str(ts.day),"Active"] = True

是的，这是从TimeStamp对象创建字符串

我知道一定有更好的方法来做这件事。

事实上我会这样做的

In [20]: df = DataFrame(np.random.randn(10,1),index=date_range('20130101 23:55:00',periods=10,freq='T'))

In [21]: df['Active'] = False

In [22]: df
Out[22]: 
                            0 Active
2013-01-01 23:55:00  0.273194  False
2013-01-01 23:56:00  2.869795  False
2013-01-01 23:57:00  0.980566  False
2013-01-01 23:58:00  0.176711  False
2013-01-01 23:59:00 -0.354976  False
2013-01-02 00:00:00  0.258194  False
2013-01-02 00:01:00 -1.765781  False
2013-01-02 00:02:00  0.106163  False
2013-01-02 00:03:00 -1.169214  False
2013-01-02 00:04:00  0.224484  False

[10 rows x 2 columns]


In [28]: df['Active'] = False

正如@Andy Hayden指出的那样，

normalize

将时间设置为0，这样您就可以直接将时间与时间为0的时间戳进行比较

In [34]: df.loc[df.index.normalize() == Timestamp('20130102'),'Active'] = True

In [35]: df
Out[35]: 
                            0 Active
2013-01-01 23:55:00  0.273194  False
2013-01-01 23:56:00  2.869795  False
2013-01-01 23:57:00  0.980566  False
2013-01-01 23:58:00  0.176711  False
2013-01-01 23:59:00 -0.354976  False
2013-01-02 00:00:00  0.258194   True
2013-01-02 00:01:00 -1.765781   True
2013-01-02 00:02:00  0.106163   True
2013-01-02 00:03:00 -1.169214   True
2013-01-02 00:04:00  0.224484   True

[10 rows x 2 columns]

要实现真正精细的控制，请执行此操作（如果您只希望使用次数作为索引器，则可以在时间使用

索引器）。您可以始终使用and子句来执行更复杂的索引
In [29]: df.loc[df.index.indexer_between_time('20130101 23:59:00','20130102 00:03:00'),'Active'] = True

In [30]: df
Out[30]: 
                            0 Active
2013-01-01 23:55:00  0.273194  False
2013-01-01 23:56:00  2.869795  False
2013-01-01 23:57:00  0.980566  False
2013-01-01 23:58:00  0.176711  False
2013-01-01 23:59:00 -0.354976   True
2013-01-02 00:00:00  0.258194   True
2013-01-02 00:01:00 -1.765781   True
2013-01-02 00:02:00  0.106163   True
2013-01-02 00:03:00 -1.169214   True
2013-01-02 00:04:00  0.224484  False

[10 rows x 2 columns]

太棒了，谢谢你@Jeff！我在读关于normalize的书，但不知道如何在这个例子中使用它。我以前没有读过任何关于索引器\u-between\u-time方法的文章。我要做些调查。再次感谢！
In [29]: df.loc[df.index.indexer_between_time('20130101 23:59:00','20130102 00:03:00'),'Active'] = True

In [30]: df
Out[30]: 
                            0 Active
2013-01-01 23:55:00  0.273194  False
2013-01-01 23:56:00  2.869795  False
2013-01-01 23:57:00  0.980566  False
2013-01-01 23:58:00  0.176711  False
2013-01-01 23:59:00 -0.354976   True
2013-01-02 00:00:00  0.258194   True
2013-01-02 00:01:00 -1.765781   True
2013-01-02 00:02:00  0.106163   True
2013-01-02 00:03:00 -1.169214   True
2013-01-02 00:04:00  0.224484  False

[10 rows x 2 columns]