Python 熊猫,负采样数据准备

Python 熊猫,负采样数据准备,python,pandas,Python,Pandas,我正在建立一个推荐系统,这是关于为培训系统准备数据的 以Netflix为例,一个用户在Netflix中接触了大量电影,当我们(Netflix)推荐一部电影-X时,他会感兴趣吗 我得到了一个与项目的用户交互历史记录列表 交互类型(分级类型)包括“查看”、“共享”、“书签”等 用户id、项目id、评级类型、时间戳 根据上面的数据,我正在创建如下所示的培训数据: user\u id,prior\u item\u id,item\u id,target 用户id暴露于之前的项目id,当我们推荐项目id时

我正在建立一个推荐系统,这是关于为培训系统准备数据的

以Netflix为例,一个用户在Netflix中接触了大量电影,当我们(Netflix)推荐一部电影-X时,他会感兴趣吗

我得到了一个与项目的用户交互历史记录列表 交互类型(分级类型)包括“查看”、“共享”、“书签”等

用户id、项目id、评级类型、时间戳

根据上面的数据,我正在创建如下所示的培训数据:

user\u id,prior\u item\u id,item\u id,target

用户id
暴露于
之前的项目id
,当我们推荐
项目id
时,他会喜欢它吗?(
target=1
else
target=0

我正在创建如下数据。下面也给出了代码 这花了这么长时间,不知道是否有更好的战略,或者是否有更好的执行我的战略

for each positive rating


  I make one positive training data.
    By finding the prior ratings.
    * user_id, item_id (of the positive rating), prior_ids, target=1

  I make 4 negative training data as well
    I randomly select 4 negative ratings which happend before the positive rating
    I make sure it's truely negative by ensuring user didn't give positive rating(share/bookmark) afterwards (The given item is not included in the next 10 positive ratings)
    for each negative ratings, find prior ratings

    I have 4 of the following
    * user_id, item_id (of the negative rating), prior_ids, target=0

  If user has not positive rating, we build one negative training data
这是我的实现,需要很长时间

class Ranking(object):
    def __init__(self):
        self.num_prior = 10

    def prepare_rating_data(self, file_path):

        self.data = pd.read_csv(file_path, dtype={'review_meta_id': object, 'user_id': object}).sort_values('timestamp')
        df = self.data
        df.dropna(subset=['review_meta_id', 'user_id'], inplace=True)

        num_prior = self.num_prior
        results = []
        for user_id, group in df.sort_values(
            ['user_id', 'timestamp'], ascending=[True, False]
        ).groupby('user_id'):
          group = group.reset_index()
          positive = None

          for index, row in group.iterrows():
              # print(index)
              if row.rating_type not in [20, 90]:
                positive = row

                low = max(0, index - num_prior)
                priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index]

                result_positive_dict = {
                  'user_id': user_id,
                  'review_meta_id': positive.review_meta_id,
                  'prior_ids': ','.join(priors.review_meta_id),
                  'target': 1
                }
                results.append(result_positive_dict)
                # 20, 90 = negative
                positives = group[(group.index>=index) & (~group.rating_type.isin([20, 90]))][:10]
                num_negative = 4

                for i in range(num_negative):
                  index_sample = random.sample(range(index+1), 1)[0]
                  sample = group.iloc[index_sample]

                  low = max(0, index_sample - num_prior)

                  try_count = 5
                  for _ in range(try_count):
                      if sample.rating_type not in [20, 90] or sample.review_meta_id in positives.review_meta_id:
                        index_sample = random.sample(range(index+1), 1)[0]
                        sample = group.iloc[index_sample]

                  low = max(0, index_sample - num_prior)
                  priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index_sample]

                  negative = sample
                  result_negative_dict = {
                    'user_id': user_id,
                    'review_meta_id': negative.review_meta_id,
                    'prior_ids': ','.join(priors.review_meta_id),
                    'target': 0
                  }

                  results.append(result_negative_dict)

          if positive is None:
            group = group.drop_duplicates(
              subset=['user_id', 'review_meta_id'])
            n = min(len(group), num_prior + 1)
            group = group.sample(n)

            result_negative_dict = {
                'user_id': user_id,
                'review_meta_id': group.tail(1)['review_meta_id'].iloc[0],
                'prior_ids': ','.join(group.review_meta_id[:-1]),
                'target': 0
              }

            results.append(result_negative_dict)

        df_result = pd.DataFrame(results, columns=['review_meta_id', 'prior_ids', 'target', 'user_id'])

        df = self.apply_prior_ids_pad(df_result)
        return df_result

    def apply_prior_ids_pad(self, df):
      def pad(x):
        x = x.strip()
        result = x.split(',') or []
        result = result + ['0'] * (self.num_prior - len(result))

        return result
      df['prior_ids'] = df['prior_ids'].apply(pad)

      return df
我已经对代码/数据进行了git回购