Python 映射到熊猫中的同一数据帧列
我有一个如下所示的数据帧Python 映射到熊猫中的同一数据帧列,python,pandas,dictionary,Python,Pandas,Dictionary,我有一个如下所示的数据帧 Token,Time,Path,Duration,Response 1, 142830,NaN , IOC, NEW 1,142832,0,NaN,NEW_CONFIRM 1,142836,1234,NaN,TRADED 2, 142830,NaN , IOC, NEW 2,142832,0,NaN,NEW_CONFIRM 2,142836,1234,NaN,NOT_TRADED 3, 142830,NaN , GTC, NEW 3,142832,0,NaN,NEW_C
Token,Time,Path,Duration,Response
1, 142830,NaN , IOC, NEW
1,142832,0,NaN,NEW_CONFIRM
1,142836,1234,NaN,TRADED
2, 142830,NaN , IOC, NEW
2,142832,0,NaN,NEW_CONFIRM
2,142836,1234,NaN,NOT_TRADED
3, 142830,NaN , GTC, NEW
3,142832,0,NaN,NEW_CONFIRM
3,142836,1234,NaN,NOT_TRADED
Token,Time,Path,Duration,Response
1, 142830,0,IOC,TRADED
2, 142830,0,IOC,NOT_TRADED
我的目的是获取所有持续时间为IOC的代币
orders = df.loc[df.Duration == 'IOC', 'Token'].unique()
从响应为确认的令牌获取路径值(这很棘手)
返回如下内容
Token,Time,Path,Duration,Response
1, 142830,NaN , IOC, NEW
1,142832,0,NaN,NEW_CONFIRM
1,142836,1234,NaN,TRADED
2, 142830,NaN , IOC, NEW
2,142832,0,NaN,NEW_CONFIRM
2,142836,1234,NaN,NOT_TRADED
3, 142830,NaN , GTC, NEW
3,142832,0,NaN,NEW_CONFIRM
3,142836,1234,NaN,NOT_TRADED
Token,Time,Path,Duration,Response
1, 142830,0,IOC,TRADED
2, 142830,0,IOC,NOT_TRADED
每个令牌将始终有一个确认(如果没有,则忽略该令牌)。每一个代币要么被交易,要么不被交易。最后,我将对路径0中有多少令牌被交易和未交易进行分组,类似于路径1、路径2等(但只有在响应类型为确认的情况下才能使用路径)任何其他路径值都可能是错误的(1234是垃圾值)
添加新示例
>>> df
OrderID TimeStamp ErrorCode Duration ResponseType \
0 3000000 1488948188555841641 NaN IOC NaN
1 3000000 1488948188556444675 0 NaN NEW_ORDER_CONFIRM
2 3000000 1488948188556448153 2 NaN TRADE_CONFIRM
3 3000001 1488948658787676012 NaN IOC NaN
4 3000001 1488948658787811582 1 NaN NEW_ORDER_CONFIRM
5 3000001 1488948658787824862 2 NaN TRADE_CONFIRM
6 3000002 1488949064945887091 NaN IOC NaN
7 3000003 1488949109654115659 NaN IOC NaN
8 3000003 1488949109654294973 1 NaN NEW_ORDER_CONFIRM
9 3000003 1488949109654299930 16388 NaN CANCEL_ORDER_CONFIRM
使用@jezrael解决方案生成的函数
>>> def f(x):
... #print (x)
... #check if NEW_CONFIRM and IOC in group
... if ((x.ResponseType == 'NEW_ORDER_CONFIRM').any() and (x.Duration == 'IOC').any()):
... #filter data - output scalar
... a = x.loc[x.Duration == 'IOC', ['TimeStamp','Duration']]
... print(a)
... a1 = str(a['TimeStamp'].item())
... a2 = a['Duration'].item()
... b = x.loc[x.Response == 'NEW_ORDER_CONFIRM', 'ErrorCode'].item()
... c = x.loc[x.Response.str.isin(['TRADE_CONFIRM', 'CANCEL_ORDER_CONFIRM']), 'ResponseType'].item()
... #return series with index for align data
... return pd.Series([a1, a2, b, c], index=df.columns[1:])
...
>>> df2 = df.groupby('ErrorCode').apply(f).dropna(how='all').reset_index()
>>> df2
Empty DataFrame
Columns: [index]
Index: []
预期op
OrderID, TimeStamp,ErrorCode,Duration,ResponseType
3000000,1488948188555841641,0,IOC,TRADE_CONFIRM
3000001,1488948658787676012,1,IOC,TRADE_CONFIRM
3000003,1488949109654115659,1,IOC,CANCEL_ORDER_CONFIRM
我认为你需要:
def f(x):
#print (x)
#check if NEW_CONFIRM and IOC in group
if ((x.Response == 'NEW_CONFIRM').any() and (x.Duration == 'IOC').any()):
#filter data - output scalar
a = x.loc[x.Duration == 'IOC', ['Time','Duration']]
a1 = str(a['Time'].item())
a2 = a['Duration'].item()
b = x.loc[x.Response == 'NEW_CONFIRM', 'Path'].item()
c = x.loc[x.Response.str.contains('TRADED'), 'Response'].item()
#return series with index for align data
return pd.Series([a1, a2, b, c], index=df.columns[1:])
#apply function anr remove NaN rows
df = df.groupby('Token').apply(f).dropna(how='all').reset_index()
print (df)
Token Time Path Duration Response
0 1 142830 IOC 0 TRADED
1 2 142830 IOC 0 NOT_TRADED
编辑:
只有列名中的一些拼写错误:
def f(x):
#print (x)
#check if NEW_CONFIRM and IOC in group
if ((x.ResponseType == 'NEW_ORDER_CONFIRM').any() and (x.Duration == 'IOC').any()):
#filter data - output scalar
a = x.loc[x.Duration == 'IOC', ['TimeStamp','Duration']]
a1 = str(a['TimeStamp'].item())
a2 = a['Duration'].item()
b = x.loc[x.ResponseType == 'NEW_ORDER_CONFIRM', 'ErrorCode'].item()
c = x.loc[x.ResponseType.isin(['TRADE_CONFIRM', 'CANCEL_ORDER_CONFIRM']), 'ResponseType'].item()
#return series with index for align data
return pd.Series([a1, a2, b, c], index=df.columns[1:])
#apply function anr remove NaN rows
df = df.groupby('OrderID').apply(f).dropna(how='all').reset_index()
print (df)
OrderID TimeStamp ErrorCode Duration ResponseType
0 3000000 1488948188555841641 IOC 0.0 TRADE_CONFIRM
1 3000001 1488948658787676012 IOC 1.0 TRADE_CONFIRM
2 3000003 1488949109654115659 IOC 1.0 CANCEL_ORDER_CONFIRM
编辑:
对于数据使用的测试长度:
def f(x):
#print (x)
#check if NEW_CONFIRM and IOC in group
if ((x.ResponseType == 'NEW_ORDER_CONFIRM').any() and (x.Duration == 'IOC').any()):
#filter data - output scalar
a = x.loc[x.Duration == 'IOC', ['TimeStamp','Duration']]
a1 = str(a['TimeStamp'].item())
a2 = a['Duration'].item()
b = x.loc[x.ResponseType == 'NEW_ORDER_CONFIRM', 'ErrorCode'].item()
c = x.loc[x.ResponseType.isin(['TRADE_CONFIRM', 'CANCEL_ORDER_CONFIRM']), 'ResponseType'].item()
#return series with index for align data
if len(a) > 1:
print (a['TimeStamp'])
if len(x.loc[x.ResponseType == 'NEW_ORDER_CONFIRM', 'ErrorCode']) > 1:
print (x.loc[x.ResponseType == 'NEW_ORDER_CONFIRM', 'ErrorCode'])
if len(x.loc[x.ResponseType.isin(['TRADE_CONFIRM', 'CANCEL_ORDER_CONFIRM']), 'ResponseType']) > 1:
print (x.loc[x.ResponseType.isin(['TRADE_CONFIRM', 'CANCEL_ORDER_CONFIRM']), 'ResponseType'])
return pd.Series([a1, a2, b, c], index=df.columns[1:])
#apply function anr remove NaN rows
df = df.groupby('OrderID').apply(f).dropna(how='all').reset_index()
print (df)
如果我们没有交易,而是有“是”和“否”,我们是否需要再增加一个变量?我正在给你写电子邮件,请稍等。但似乎你需要将
x.Response.str.contains('traded')
更改为x.Response.isin(['yes','no'))
,但现在我只在电话上发现输出有问题。路径和持续时间列名也已更改(组必须按路径创建,在样本数据中只有1个路径=0),因此是否可以创建具有所需输出的新样本数据?