Python 2.7 数据帧过滤器_Python 2.7_Pandas_Dataframe

Python 2.7 数据帧过滤器

python-2.7 pandas dataframe

Python 2.7 数据帧过滤器,python-2.7,pandas,dataframe,Python 2.7,Pandas,Dataframe,我有一个pandas数据框架df，由过期日期、罢工、买入/卖出、买入和卖出列组成。索引是datetime。我想筛选ask=0和bid=0的行，这些行包含在数据帧df4中。我想检查该数据框中是否有其他行，它们确实包含相同的过期日期和调用/放入值、相同的日期时间索引项、不同于零的bid和ask值，以及一个带有定义步骤的罢工，该罢工在bid=ask=零的列的罢工上下。如果是，则应进行一些操作（在与罢工有关的投标和要求之间插入）我提出了以下代码，但它引发了KeyError:“标签[xy]不在[inde

我有一个pandas数据框架df，由过期日期、罢工、买入/卖出、买入和卖出列组成。索引是datetime。我想筛选ask=0和bid=0的行，这些行包含在数据帧df4中。我想检查该数据框中是否有其他行，它们确实包含相同的过期日期和调用/放入值、相同的日期时间索引项、不同于零的bid和ask值，以及一个带有定义步骤的罢工，该罢工在bid=ask=零的列的罢工上下。如果是，则应进行一些操作（在与罢工有关的投标和要求之间插入）

我提出了以下代码，但它引发了KeyError:“标签[xy]不在[index]中”，这显然是由于日期时间格式化问题。该脚本遍历数据帧行

以下是我的问题： a）如何对其进行编码以正确工作？ b）因为我的真实世界数据样本相当大，大约2GB，有没有一种方法可以将其完全矢量化

下面是代码，我希望它至少能解释我要做的事情：

# constructing a sample dataframe
import pandas
import numpy.random as rd
dates = pandas.date_range('1/1/2000', periods=8)
df = pandas.DataFrame(rd.randn(8, 5), index=dates, columns=['call/put', 'expiration',  'strike', 'ask', 'bid'])
df.iloc[2,4]=0
df.iloc[2,3]=0
df.iloc[3,4]=0
df.iloc[3,3]=0
df.iloc[2,2]=0.5
df=df.append(df.iloc[2:3])
df.iloc[8:9,3:5]=1
df.iloc[8:9,2:3]=0.6
df=df.append(df.iloc[8:9])
df.iloc[9,2]=0.4

#filtering for rows with bid=ask=0
df4=df[(df["ask"]==0) & (df["bid"]==0)]

#checking for rows that can be used for bid and ask interpolation
stepsize=0.1
counter=0
for index, row in df4.iterrows():
 print index
 df_upperbound = df.loc[index]
 df_upperbound = df_upperbound[(df_upperbound['call/put']== df4['call/put']) &  (df_upperbound['expiration']== df4['expiration']) & (df_upperbound['strike']== df4['strike']+stepsize)]
 df_lowerbound = df.loc[index]
 df_lowerbound = df_lowerbound[ (df_lowerbound['call/put']== df4['call/put']) & (df_lowerbound['expiration']== df4['expiration']) & (df_lowerbound['strike']== df4['strike']-stepsize)]
 if len(df_upperbound)>0 and len(df_lowerbound)>0:
    is_upperbound = df_upperbound.ask!=0 and df_upperbound.bid!=0
    is_lowerbound = df_lowerbound.ask!=0 and df_lowerbound.bid!=0  
    if is_upperbound and is_lowerbound:
        counter+=1

这只是一种变通方法，但使用strftime（）似乎很有效

In[8] stepsize=0.1
      counter=0
      for index,row in df4.iterrows():
          print df[index.strftime('%Y-%m-%d')]


Out[9] call/put  expiration  strike  ask  bid
       2000-01-03  0.181998   -2.371192     0.5    0    0
       2000-01-03  0.181998   -2.371192     0.6    1    1
       2000-01-03  0.181998   -2.371192     0.4    1    1
           call/put  expiration    strike  ask  bid
       2000-01-04  0.030905    1.142885 -1.268263    0    0

发布一些允许人们运行代码的示例数据，并发布该数据子集所需的输出。下面构建了名为df的示例数据框#构建示例数据框。整个给定的代码段可以通过复制和粘贴来运行。在df中，有两行bid=ask=0，对于其中一行，可以找到具有较高和较低罢工的行以及ask=bid>0。我的代码片段中的指定输出是counter=1。与此相反，我将使用较低和较高的罢工来执行bid和ask插值，从而用一些数字替换bid=ask=zero。对，但您应该为随机数据设定种子并显示预期输出，否则每次结果都会不同。不要担心随机数。在我的代码片段中，所有必须是非随机的东西都被其他东西替换了。每当运行代码时，df中的第三行和第四行应重新编码为bid=ask=0，第三行应重新编码为具有可用于插值的行（df中的最后两行），因此将计数器设置为1。在我这方面也有效。但为什么字符串和日期时间的比较有效，而两个相同的日期时间对象的比较会出现错误呢？这可能是熊猫虫吗？