Python Pandas dataframe.apply()将值误用到dataframe列

Python Pandas dataframe.apply()将值误用到dataframe列,python,pandas,apply,Python,Pandas,Apply,我的代码使用dataframe.apply()调用函数。该函数使用pandas.Series返回多个值。但是,dataframe.apply()将值应用于错误的列 下面的代码试图返回dte、mark和iv。这些值在返回语句之前打印出来,以验证这些值 import pandas as pd from pandas import Timestamp from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, Goo

我的代码使用dataframe.apply()调用函数。该函数使用pandas.Series返回多个值。但是,dataframe.apply()将值应用于错误的列

下面的代码试图返回dte、mark和iv。这些值在返回语句之前打印出来,以验证这些值

import pandas as pd
from pandas import Timestamp
from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, GoodFriday
from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from scipy.stats import norm


# dff = Daily Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
dff = pd.read_csv('https://research.stlouisfed.org/fred2/data/DFF.csv', parse_dates=[0], index_col='DATE')
rf = float('%.4f' % (dff['VALUE'][-1:][0] / 100))
tradingMinutesDay = 450                                 # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400                            # trading minutes per day * 252 trading days per year
USFedCal = get_calendar('USFederalHolidayCalendar')     # Load US Federal holiday calendar
USFedCal.rules.pop(7)                                   # Remove Veteran's Day
USFedCal.rules.pop(6)                                   # Remove Columbus Day
tradingCal = HolidayCalendarFactory('TradingCalendar', USFedCal, GoodFriday)    # Add Good Friday
cal = tradingCal()


def newtonRap(row):
    # Initialize variables
    dte, mark, iv = 0.0, 0.0, 0.0
    if row['Bid'] == 0.0 or row['Ask'] == 0.0 or row['RootPrice'] == 0.0 or row['Strike'] == 0.0 or \
       row['TimeStamp'] == row['Expiry']:
        iv, vega = 0.0, 0.0         # Set iv and vega to zero if option contract is invalid or expired
    else:
        # dte (Days to expiration) uses pandas bdate_range method to determine the number of business days to expiration
        #   minus USFederalHolidays minus constant of 1 for the TimeStamp date
        dte = float(len(pd.bdate_range(row['TimeStamp'], row['Expiry'])) -
                    len(cal.holidays(row['TimeStamp'], row['Expiry']).to_pydatetime()) - 1)
        mark = (row['Bid'] + row['Ask']) / 2
        cp = 1 if row['OptType'] == 'C' else -1
        S = row['RootPrice']
        K = row['Strike']
        T = (dte * tradingMinutesDay) / tradingMinutesAnnum
        iv = sqrt(2 * pi / T) * mark / S        # Initialize IV (Brenner and Subrahmanyam 1988)
        vega = 0.0                              # Initialize vega
        for i in range(1, 100):
            d1 = (log(S / K) + T * (rf + iv ** 2 / 2)) / (iv * sqrt(T))
            d2 = d1 - iv * sqrt(T)
            vega = S * norm.pdf(d1) * sqrt(T)
            model = cp * S * norm.cdf(cp * d1) - cp * K * exp(-rf * T) * norm.cdf(cp * d2)
            iv -= (model - mark) / vega
            if abs(model - mark) < 1.0e-5:
                break
        if isnan(iv) or isnan(vega):
            iv, vega = 0.0, 0.0
    print 'DTE', dte, 'Mark', mark, 'newtRaphIV', iv
    return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})


if __name__ == "__main__":
    # sample  data
    col_order = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last', 'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
    df = pd.DataFrame({'Ask': {0: 3.7000000000000002, 1: 2.4199999999999999, 2: 3.0, 3: 2.7999999999999998, 4: 2.4500000000000002, 5: 3.25, 6: 5.9500000000000002, 7: 6.2999999999999998},
                       'Bid': {0: 3.6000000000000001, 1: 2.3399999999999999, 2: 2.8599999999999999, 3: 2.7400000000000002, 4: 2.4399999999999999, 5: 3.1000000000000001, 6: 5.7000000000000002, 7: 6.0999999999999996},
                       'Expiry': {0: Timestamp('2015-10-16 16:00:00'), 1: Timestamp('2015-10-16 16:00:00'), 2: Timestamp('2015-10-16 16:00:00'), 3: Timestamp('2015-10-16 16:00:00'), 4: Timestamp('2015-10-16 16:00:00'), 5: Timestamp('2015-10-16 16:00:00'), 6: Timestamp('2015-11-20 16:00:00'), 7: Timestamp('2015-11-20 16:00:00')},
                       'IV': {0: 0.3497, 1: 0.3146, 2: 0.3288, 3: 0.3029, 4: 0.3187, 5: 0.2926, 6: 0.3635, 7: 0.3842},
                       'Last': {0: 3.46, 1: 2.34, 2: 3.0, 3: 2.81, 4: 2.35, 5: 3.20, 6: 5.90, 7: 6.15},
                       'OpenInt': {0: 1290.0, 1: 3087.0, 2: 28850.0, 3: 44427.0, 4: 2318.0, 5: 3773.0, 6: 17112.0, 7: 15704.0},
                       'OpraSymbol': {0: 'AAPL151016C00109000', 1: 'AAPL151016P00109000', 2: 'AAPL151016C00110000', 3: 'AAPL151016P00110000', 4: 'AAPL151016C00111000', 5: 'AAPL151016P00111000', 6: 'AAPL151120C00110000', 7: 'AAPL151120P00110000'},
                       'OptType': {0: 'C', 1: 'P', 2: 'C', 3: 'P', 4: 'C', 5: 'P', 6: 'C', 7: 'P'},
                       'RootPrice': {0: 109.95, 1: 109.95, 2: 109.95, 3: 109.95, 4: 109.95, 5: 109.95, 6: 109.95, 7: 109.95},
                       'RootSymbol': {0: 'AAPL', 1: 'AAPL', 2: 'AAPL', 3: 'AAPL', 4: 'AAPL', 5: 'AAPL', 6: 'AAPL', 7: 'AAPL'},
                       'Strike': {0: 109.0, 1: 109.0, 2: 110.0, 3: 110.0, 4: 111.0, 5: 111.0, 6: 110.0, 7: 110.0},
                       'TimeStamp': {0: Timestamp('2015-09-30 16:00:00'), 1: Timestamp('2015-09-30 16:00:00'), 2: Timestamp('2015-09-30 16:00:00'), 3: Timestamp('2015-09-30 16:00:00'), 4: Timestamp('2015-09-30 16:00:00'), 5: Timestamp('2015-09-30 16:00:00'), 6: Timestamp('2015-09-30 16:00:00'), 7: Timestamp('2015-09-30 16:00:00')},
                       'Volume': {0: 1565.0, 1: 3790.0, 2: 10217.0, 3: 12113.0, 4: 6674.0, 5: 2031.0, 6: 5330.0, 7: 3724.0}})
    df = df[col_order]


    df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
    print df[['DTE', 'Mark', 'newtRaphIV']]
这不是我所期望的行为。发生什么事了

df.apply(newtonRap, axis=1)
是具有列
['DTE','Mark','IV']
的数据帧,但不保证列的顺序(原因见下文)。因此,要修复DataFrame列的顺序,您可以 修复由
newtonRap
返回的系列索引的顺序:

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
或者固定
df之后的列顺序。apply
返回:

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]
第一种选择更好,因为

df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]
创建两个中间数据帧--
df.apply(newtonRap,axis=1)
df.apply(newtonRap,axis=1)[[DTE]、[Mark]、[IV']]
,而第一个选项从开始创建正确的数据帧


数据帧分配在索引上对齐,但不在列上对齐:

注意表单的赋值

df[['C','E','D']] = other_df
根据索引而不是列名对齐。因此,
df.apply(newtonRap,axis=1)
的列名是什么并不重要。例如,这无助于改变

return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})

使
df.apply(newtonRap,axis=1)的列名与
df[['DTE','Mark','newtRaphIV']]
。如果真是这样的话,那就是愚蠢的运气
df.apply(newtonRap,axis=1)
返回的列顺序恰好与所需顺序匹配。为了证实这种说法,请考虑例子

df = pd.DataFrame(np.random.randint(10, size=(3,2)), columns=list('AB'))
new = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('CDE'), index=[2,1,0])
#    C  D  E
# 2  0  1  2
# 1  3  4  5
# 0  6  7  8

df[['C','E','D']] = new
#    A  B  C  E  D
# 0  7  9  6  7  8
# 1  4  9  3  4  5
# 2  8  2  0  1  2
请注意,
new
df
的索引是对齐的,但是没有基于列标签的对齐


修复由
应用返回的数据帧列的顺序:

请注意,dict键是无序的。换句话说,当迭代时,dict键可能以任何顺序出现。事实上,在Python3中,
dict.keys()
可能会在每次运行相同代码时以不同的顺序返回相同的键

因为dict键的顺序不确定

pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})
是一个索引顺序不确定的系列,因此
df.apply(newtonRap,axis=1)
是一个列以不确定顺序显示的数据帧

但是,如果您使用

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
然后,序列索引的顺序是固定的。因此,
df.apply(newtonRap,axis=1)
具有固定的列顺序,然后

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)

将根据需要工作。

unutbu,令人惊讶的答案。你的解决方案非常有效。要学的东西太多了。谢谢。非常感谢你的回答
return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)