Python 熊猫,负采样数据准备
我正在建立一个推荐系统,这是关于为培训系统准备数据的 以Netflix为例,一个用户在Netflix中接触了大量电影,当我们(Netflix)推荐一部电影-X时,他会感兴趣吗 我得到了一个与项目的用户交互历史记录列表 交互类型(分级类型)包括“查看”、“共享”、“书签”等Python 熊猫,负采样数据准备,python,pandas,Python,Pandas,我正在建立一个推荐系统,这是关于为培训系统准备数据的 以Netflix为例,一个用户在Netflix中接触了大量电影,当我们(Netflix)推荐一部电影-X时,他会感兴趣吗 我得到了一个与项目的用户交互历史记录列表 交互类型(分级类型)包括“查看”、“共享”、“书签”等 用户id、项目id、评级类型、时间戳 根据上面的数据,我正在创建如下所示的培训数据: user\u id,prior\u item\u id,item\u id,target 用户id暴露于之前的项目id,当我们推荐项目id时
用户id、项目id、评级类型、时间戳
根据上面的数据,我正在创建如下所示的培训数据:
user\u id,prior\u item\u id,item\u id,target
用户id
暴露于之前的项目id
,当我们推荐项目id
时,他会喜欢它吗?(target=1
elsetarget=0
)
我正在创建如下数据。下面也给出了代码
这花了这么长时间,不知道是否有更好的战略,或者是否有更好的执行我的战略
for each positive rating
I make one positive training data.
By finding the prior ratings.
* user_id, item_id (of the positive rating), prior_ids, target=1
I make 4 negative training data as well
I randomly select 4 negative ratings which happend before the positive rating
I make sure it's truely negative by ensuring user didn't give positive rating(share/bookmark) afterwards (The given item is not included in the next 10 positive ratings)
for each negative ratings, find prior ratings
I have 4 of the following
* user_id, item_id (of the negative rating), prior_ids, target=0
If user has not positive rating, we build one negative training data
这是我的实现,需要很长时间
class Ranking(object):
def __init__(self):
self.num_prior = 10
def prepare_rating_data(self, file_path):
self.data = pd.read_csv(file_path, dtype={'review_meta_id': object, 'user_id': object}).sort_values('timestamp')
df = self.data
df.dropna(subset=['review_meta_id', 'user_id'], inplace=True)
num_prior = self.num_prior
results = []
for user_id, group in df.sort_values(
['user_id', 'timestamp'], ascending=[True, False]
).groupby('user_id'):
group = group.reset_index()
positive = None
for index, row in group.iterrows():
# print(index)
if row.rating_type not in [20, 90]:
positive = row
low = max(0, index - num_prior)
priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index]
result_positive_dict = {
'user_id': user_id,
'review_meta_id': positive.review_meta_id,
'prior_ids': ','.join(priors.review_meta_id),
'target': 1
}
results.append(result_positive_dict)
# 20, 90 = negative
positives = group[(group.index>=index) & (~group.rating_type.isin([20, 90]))][:10]
num_negative = 4
for i in range(num_negative):
index_sample = random.sample(range(index+1), 1)[0]
sample = group.iloc[index_sample]
low = max(0, index_sample - num_prior)
try_count = 5
for _ in range(try_count):
if sample.rating_type not in [20, 90] or sample.review_meta_id in positives.review_meta_id:
index_sample = random.sample(range(index+1), 1)[0]
sample = group.iloc[index_sample]
low = max(0, index_sample - num_prior)
priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index_sample]
negative = sample
result_negative_dict = {
'user_id': user_id,
'review_meta_id': negative.review_meta_id,
'prior_ids': ','.join(priors.review_meta_id),
'target': 0
}
results.append(result_negative_dict)
if positive is None:
group = group.drop_duplicates(
subset=['user_id', 'review_meta_id'])
n = min(len(group), num_prior + 1)
group = group.sample(n)
result_negative_dict = {
'user_id': user_id,
'review_meta_id': group.tail(1)['review_meta_id'].iloc[0],
'prior_ids': ','.join(group.review_meta_id[:-1]),
'target': 0
}
results.append(result_negative_dict)
df_result = pd.DataFrame(results, columns=['review_meta_id', 'prior_ids', 'target', 'user_id'])
df = self.apply_prior_ids_pad(df_result)
return df_result
def apply_prior_ids_pad(self, df):
def pad(x):
x = x.strip()
result = x.split(',') or []
result = result + ['0'] * (self.num_prior - len(result))
return result
df['prior_ids'] = df['prior_ids'].apply(pad)
return df
我已经对代码/数据进行了git回购