Python Pandas dataframe.apply()将值误用到dataframe列
我的代码使用dataframe.apply()调用函数。该函数使用pandas.Series返回多个值。但是,dataframe.apply()将值应用于错误的列 下面的代码试图返回dte、mark和iv。这些值在返回语句之前打印出来,以验证这些值Python Pandas dataframe.apply()将值误用到dataframe列,python,pandas,apply,Python,Pandas,Apply,我的代码使用dataframe.apply()调用函数。该函数使用pandas.Series返回多个值。但是,dataframe.apply()将值应用于错误的列 下面的代码试图返回dte、mark和iv。这些值在返回语句之前打印出来,以验证这些值 import pandas as pd from pandas import Timestamp from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, Goo
import pandas as pd
from pandas import Timestamp
from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, GoodFriday
from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from scipy.stats import norm
# dff = Daily Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
dff = pd.read_csv('https://research.stlouisfed.org/fred2/data/DFF.csv', parse_dates=[0], index_col='DATE')
rf = float('%.4f' % (dff['VALUE'][-1:][0] / 100))
tradingMinutesDay = 450 # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400 # trading minutes per day * 252 trading days per year
USFedCal = get_calendar('USFederalHolidayCalendar') # Load US Federal holiday calendar
USFedCal.rules.pop(7) # Remove Veteran's Day
USFedCal.rules.pop(6) # Remove Columbus Day
tradingCal = HolidayCalendarFactory('TradingCalendar', USFedCal, GoodFriday) # Add Good Friday
cal = tradingCal()
def newtonRap(row):
# Initialize variables
dte, mark, iv = 0.0, 0.0, 0.0
if row['Bid'] == 0.0 or row['Ask'] == 0.0 or row['RootPrice'] == 0.0 or row['Strike'] == 0.0 or \
row['TimeStamp'] == row['Expiry']:
iv, vega = 0.0, 0.0 # Set iv and vega to zero if option contract is invalid or expired
else:
# dte (Days to expiration) uses pandas bdate_range method to determine the number of business days to expiration
# minus USFederalHolidays minus constant of 1 for the TimeStamp date
dte = float(len(pd.bdate_range(row['TimeStamp'], row['Expiry'])) -
len(cal.holidays(row['TimeStamp'], row['Expiry']).to_pydatetime()) - 1)
mark = (row['Bid'] + row['Ask']) / 2
cp = 1 if row['OptType'] == 'C' else -1
S = row['RootPrice']
K = row['Strike']
T = (dte * tradingMinutesDay) / tradingMinutesAnnum
iv = sqrt(2 * pi / T) * mark / S # Initialize IV (Brenner and Subrahmanyam 1988)
vega = 0.0 # Initialize vega
for i in range(1, 100):
d1 = (log(S / K) + T * (rf + iv ** 2 / 2)) / (iv * sqrt(T))
d2 = d1 - iv * sqrt(T)
vega = S * norm.pdf(d1) * sqrt(T)
model = cp * S * norm.cdf(cp * d1) - cp * K * exp(-rf * T) * norm.cdf(cp * d2)
iv -= (model - mark) / vega
if abs(model - mark) < 1.0e-5:
break
if isnan(iv) or isnan(vega):
iv, vega = 0.0, 0.0
print 'DTE', dte, 'Mark', mark, 'newtRaphIV', iv
return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})
if __name__ == "__main__":
# sample data
col_order = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last', 'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
df = pd.DataFrame({'Ask': {0: 3.7000000000000002, 1: 2.4199999999999999, 2: 3.0, 3: 2.7999999999999998, 4: 2.4500000000000002, 5: 3.25, 6: 5.9500000000000002, 7: 6.2999999999999998},
'Bid': {0: 3.6000000000000001, 1: 2.3399999999999999, 2: 2.8599999999999999, 3: 2.7400000000000002, 4: 2.4399999999999999, 5: 3.1000000000000001, 6: 5.7000000000000002, 7: 6.0999999999999996},
'Expiry': {0: Timestamp('2015-10-16 16:00:00'), 1: Timestamp('2015-10-16 16:00:00'), 2: Timestamp('2015-10-16 16:00:00'), 3: Timestamp('2015-10-16 16:00:00'), 4: Timestamp('2015-10-16 16:00:00'), 5: Timestamp('2015-10-16 16:00:00'), 6: Timestamp('2015-11-20 16:00:00'), 7: Timestamp('2015-11-20 16:00:00')},
'IV': {0: 0.3497, 1: 0.3146, 2: 0.3288, 3: 0.3029, 4: 0.3187, 5: 0.2926, 6: 0.3635, 7: 0.3842},
'Last': {0: 3.46, 1: 2.34, 2: 3.0, 3: 2.81, 4: 2.35, 5: 3.20, 6: 5.90, 7: 6.15},
'OpenInt': {0: 1290.0, 1: 3087.0, 2: 28850.0, 3: 44427.0, 4: 2318.0, 5: 3773.0, 6: 17112.0, 7: 15704.0},
'OpraSymbol': {0: 'AAPL151016C00109000', 1: 'AAPL151016P00109000', 2: 'AAPL151016C00110000', 3: 'AAPL151016P00110000', 4: 'AAPL151016C00111000', 5: 'AAPL151016P00111000', 6: 'AAPL151120C00110000', 7: 'AAPL151120P00110000'},
'OptType': {0: 'C', 1: 'P', 2: 'C', 3: 'P', 4: 'C', 5: 'P', 6: 'C', 7: 'P'},
'RootPrice': {0: 109.95, 1: 109.95, 2: 109.95, 3: 109.95, 4: 109.95, 5: 109.95, 6: 109.95, 7: 109.95},
'RootSymbol': {0: 'AAPL', 1: 'AAPL', 2: 'AAPL', 3: 'AAPL', 4: 'AAPL', 5: 'AAPL', 6: 'AAPL', 7: 'AAPL'},
'Strike': {0: 109.0, 1: 109.0, 2: 110.0, 3: 110.0, 4: 111.0, 5: 111.0, 6: 110.0, 7: 110.0},
'TimeStamp': {0: Timestamp('2015-09-30 16:00:00'), 1: Timestamp('2015-09-30 16:00:00'), 2: Timestamp('2015-09-30 16:00:00'), 3: Timestamp('2015-09-30 16:00:00'), 4: Timestamp('2015-09-30 16:00:00'), 5: Timestamp('2015-09-30 16:00:00'), 6: Timestamp('2015-09-30 16:00:00'), 7: Timestamp('2015-09-30 16:00:00')},
'Volume': {0: 1565.0, 1: 3790.0, 2: 10217.0, 3: 12113.0, 4: 6674.0, 5: 2031.0, 6: 5330.0, 7: 3724.0}})
df = df[col_order]
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
print df[['DTE', 'Mark', 'newtRaphIV']]
这不是我所期望的行为。发生什么事了
df.apply(newtonRap, axis=1)
是具有列['DTE','Mark','IV']
的数据帧,但不保证列的顺序(原因见下文)。因此,要修复DataFrame列的顺序,您可以
修复由newtonRap
返回的系列索引的顺序:
return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
或者固定df之后的列顺序。apply
返回:
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]
第一种选择更好,因为
df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]
创建两个中间数据帧--df.apply(newtonRap,axis=1)
和
df.apply(newtonRap,axis=1)[[DTE]、[Mark]、[IV']]
,而第一个选项从开始创建正确的数据帧
数据帧分配在索引上对齐,但不在列上对齐: 注意表单的赋值
df[['C','E','D']] = other_df
根据索引而不是列名对齐。因此,df.apply(newtonRap,axis=1)
的列名是什么并不重要。例如,这无助于改变
return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})
到
使df.apply(newtonRap,axis=1)的列名与
df[['DTE','Mark','newtRaphIV']]
。如果真是这样的话,那就是愚蠢的运气
df.apply(newtonRap,axis=1)
返回的列顺序恰好与所需顺序匹配。为了证实这种说法,请考虑例子
df = pd.DataFrame(np.random.randint(10, size=(3,2)), columns=list('AB'))
new = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('CDE'), index=[2,1,0])
# C D E
# 2 0 1 2
# 1 3 4 5
# 0 6 7 8
df[['C','E','D']] = new
# A B C E D
# 0 7 9 6 7 8
# 1 4 9 3 4 5
# 2 8 2 0 1 2
请注意,new
和df
的索引是对齐的,但是没有基于列标签的对齐
修复由应用返回的数据帧列的顺序:
请注意,dict键是无序的。换句话说,当迭代时,dict键可能以任何顺序出现。事实上,在Python3中,dict.keys()
可能会在每次运行相同代码时以不同的顺序返回相同的键
因为dict键的顺序不确定
pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})
是一个索引顺序不确定的系列,因此df.apply(newtonRap,axis=1)
是一个列以不确定顺序显示的数据帧
但是,如果您使用
return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
然后,序列索引的顺序是固定的。因此,df.apply(newtonRap,axis=1)
具有固定的列顺序,然后
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
将根据需要工作。unutbu,令人惊讶的答案。你的解决方案非常有效。要学的东西太多了。谢谢。非常感谢你的回答
return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)