Python 基于概率的数据帧行分类-熊猫

Python 基于概率的数据帧行分类-熊猫,python,pandas,dataframe,Python,Pandas,Dataframe,我有两个数据帧。 第一个与用户相关,如下所示: user_id city_id 0 a 1 a 2 b 3 a 4 c .. and so on city_id district_id probability a a1 0.01 a a2 0.02

我有两个数据帧。 第一个与用户相关,如下所示:

user_id    city_id
  0           a
  1           a
  2           b
  3           a
  4           c
.. and so on
 city_id     district_id    probability
    a             a1           0.01
    a             a2           0.02
    a             a3           0.02
    a             a4           0.56
    a             a5           0.39
    b             b1           0.63
    b             b2           0.07
    b             b3           0.30
 and so on.. 
第二种方法给出了每个城市中有多少百分比属于每个地区,如下所示:

user_id    city_id
  0           a
  1           a
  2           b
  3           a
  4           c
.. and so on
 city_id     district_id    probability
    a             a1           0.01
    a             a2           0.02
    a             a3           0.02
    a             a4           0.56
    a             a5           0.39
    b             b1           0.63
    b             b2           0.07
    b             b3           0.30
 and so on.. 
我需要根据用户所属城市的概率来组织用户。所以(例如)我得到大约56%的居住在a市的用户来自a4区等等。基本上,最终df将有与
用户id、城市id和地区id相关的行

我的第一条线索是给每个用户一个随机数,并与概率进行比较

我的第二个想法是按city_id按行分组,查找第二个表并按概率选择(给第三列赋值)。基本上对于城市a,这意味着我将在组中选择56%的行,并给它区域值a4,依此类推。
但我不确定数学上是否是最好的方法。

我建议做以下几点:

for city in city_info.city_id.unique():
    probs = city_info[city_info.city_id == city]
    in_city = users.city_id == city
    n_citizens = in_city.sum()
    n_districts = len(probs)
    district = np.random.choice(range(n_district), n_citizens, p=probs) + 1 # adding one as range is base 0
    users.loc[in_city, 'district_id'] = city + pd.Series(district).astype(str)

我建议采取以下措施:

for city in city_info.city_id.unique():
    probs = city_info[city_info.city_id == city]
    in_city = users.city_id == city
    n_citizens = in_city.sum()
    n_districts = len(probs)
    district = np.random.choice(range(n_district), n_citizens, p=probs) + 1 # adding one as range is base 0
    users.loc[in_city, 'district_id'] = city + pd.Series(district).astype(str)

如果
df1
df2
是您的两个数据帧:

import numpy as np
def get_district(city):
    dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
    p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
    return np.random.choice(dlist, p=p) #give weighed random choice from list
并适用于:

df['district_id'] = df.city_id.apply(get_district)
在@JoeCondron的有用评论之后,另一种方法是:

def get_city_district(city,df1,df2):
    l = len(df1[df1.city_id==city])
    d = df2[df2['city_id']==city]
    ds, p = list(d['district_id']),list(d['probability'])
    df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
    return df1

def f(df1,df2):
    df1['district_id'] = None
    for i in set(df1.city_id):
        df1 = get_city_district(i,df1,df2)

    return df1

测试时速度要快得多,但只有少数几个城市。

如果
df1
df2
是您的两个数据帧:

import numpy as np
def get_district(city):
    dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
    p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
    return np.random.choice(dlist, p=p) #give weighed random choice from list
并适用于:

df['district_id'] = df.city_id.apply(get_district)
在@JoeCondron的有用评论之后,另一种方法是:

def get_city_district(city,df1,df2):
    l = len(df1[df1.city_id==city])
    d = df2[df2['city_id']==city]
    ds, p = list(d['district_id']),list(d['probability'])
    df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
    return df1

def f(df1,df2):
    df1['district_id'] = None
    for i in set(df1.city_id):
        df1 = get_city_district(i,df1,df2)

    return df1

测试时速度要快得多,但只有少数几个城市。

这将为
df
中的每一行调用
get_district
,这将不必要地对
df2
每次进行切片。我们只需要获得每个独特城市的权重一次。另外,您将两次生成相同的布尔键。这将为
df
中的每一行调用
get_district
,这将依次对
df2
进行不必要的切片。我们只需要获得每个独特城市的权重一次。此外,还将生成相同的布尔键两次。