Python 长度必须匹配以进行比较(根据两个标准进行选择)
我正在为每个用户生成值,如下所示:Python 长度必须匹配以进行比较(根据两个标准进行选择),python,pandas,Python,Pandas,我正在为每个用户生成值,如下所示: loDf = locDfs[user] # locDfs is a copy of locationDf elsewhere in the code... sorry for all the variable names. loDf.reset_index(inplace=True) loDf = pd.crosstab([loDf.date, loDf.uid], loDf.location_id) loDf.reset_index(inplace=True
loDf = locDfs[user] # locDfs is a copy of locationDf elsewhere in the code... sorry for all the variable names.
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.uid], loDf.location_id)
loDf.reset_index(inplace=True)
loDf.set_index('date', inplace=True)
loDf.drop('uid', axis=1, inplace=True)
# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['uid'].fillna(user, inplace=True)
这将获取位置数据并将位置id转换为列,并将其与timeseries中的其他数据组合
这基本上包括重塑数据。然后我需要规范化,要做到这一点,我需要查看每个列的值:
for loc in locationDf[(locationDf['location_id'] in loDf.columns.values) & (locationDf['uid'] == user)].location_id.unique():
以下是完整的函数:
def normalize(inputMetricDf, inputLocationDf):
'''
normalize, resample, and combine data into a single data source
'''
metricDf = inputMetricDf.copy()
locationDf = inputLocationDf.copy()
appDf = metricDf[['date', 'uid', 'app_id', 'metric']].copy()
locDf = locationDf[['date', 'uid', 'location_id']]
locDf.set_index('date', inplace=True)
# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('uid'):
locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()
aDf = appDf.copy()
aDf.set_index('date', inplace=True)
userLocAppDfs = {}
user = ''
for uid, a2_df in aDf.groupby('uid'):
user = uid
# per user, convert app data to 15m interval
userDf = a2_df.resample('15T').agg('max')
# assign metric for each app to an app column for each app, per user
userDf.reset_index(inplace=True)
userDf = pd.crosstab(index=userDf['date'], columns=userDf['app_id'],
values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')
userDf['uid'] = user
userDf.reset_index(inplace=True)
userDf.set_index('date', inplace=True)
# reapply 15m intervals now that we have new data per app
userLocAppDfs[user] = userDf.resample('15T').agg('max')
# assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created
loDf = locDfs[user]
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.uid], loDf.location_id)
loDf.reset_index(inplace=True)
loDf.set_index('date', inplace=True)
loDf.drop('uid', axis=1, inplace=True)
# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['uid'].fillna(user, inplace=True)
for loc in locationDf[(locationDf['location_id'] in loDf.columns.values) & (locationDf['uid'] == user)].location_id.unique():
# fill location NaNs
userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(
np.nan, 0)
# fill app NaNs
for app in a2_df['app_id'].unique():
userLocAppDfs[user][app].interpolate(
method='linear', limit_area='inside', inplace=True)
userLocAppDfs[user][app].fillna(value=0, inplace=True)
df = userLocAppDfs[user].copy()
# ensure actual normality
alpha = 0.05
for app in aDf['app_id'].unique():
_, p = normaltest(userLocAppDfs[user][app])
if(p > alpha):
raise DataNotNormal(args=(user, app))
# for loc in userLocAppDfs[user]:
# could also test location data
return df
但这会产生错误:
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 346, in run_http_function
result = _function_handler.invoke_user_function(flask.request)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 223, in invoke_user_function
loop.run_until_complete(future)
File "/opt/python3.7/lib/python3.7/asyncio/base_events.py", line 573, in run_until_complete
return future.result()
File "/user_code/main.py", line 31, in default_model
train, endog, exog, _, _, rawDf = preprocess(ledger, apps)
File "/user_code/Wrangling.py", line 67, in preprocess
rawDf = normalize(appDf, locDf)
File "/user_code/Wrangling.py", line 185, in normalize
for loc in locationDf[(locationDf['location_id'] in loDf.columns.values) & (locationDf['uid'] == user)].location_id.unique():
File "/env/local/lib/python3.7/site-packages/pandas/core/ops.py", line 1745, in wrapper
raise ValueError('Lengths must match to compare')
ValueError: Lengths must match to compare
在我注意到由于重塑可能会丢失locationsDf中的位置之前,我只是在做:
for loc in locationDf[locationDf['uid'] == user].location_id.unique():
这对其他任何情况都有效。但是,如果在同一个15t时间段中有两个位置,其中一个只出现在那里,但由于15t窗口而被删除,那么它会给我一个错误。所以我需要另一个条件
locationDf['location\u id']只是一个字符串,就像交叉表列名一样
为什么这会让我犯错误
尝试回答的错误:
将您的条件更改为(使用isin
)
更新
con1 = (locationDf['location_id'].isin(loDf.columns.values)
con2 = (locationDf['uid'].isin(pd.Series(user))
locationDf.loc[con1&con2,'location_id'].unique()
我现在把这个推到我的应用程序上,我想它会工作的。。但是为什么它们都必须是isin,而其中一个是?@robertotomás我不确定
user
是类型列表
还是字符串它是一个列表,我把它括在括号里,并把它推到了错误的底部,所以你需要在这里额外的保护(locationDf['uid'].isin(pd.Series(user)))
@robertomasás
locationDf.loc[(locationDf['location_id'].isin(loDf.columns.values))
& (locationDf['uid'].isin(user)),'location_id'].unique()
con1 = (locationDf['location_id'].isin(loDf.columns.values)
con2 = (locationDf['uid'].isin(pd.Series(user))
locationDf.loc[con1&con2,'location_id'].unique()