Python 基于概率的数据帧行分类-熊猫
我有两个数据帧。 第一个与用户相关,如下所示:Python 基于概率的数据帧行分类-熊猫,python,pandas,dataframe,Python,Pandas,Dataframe,我有两个数据帧。 第一个与用户相关,如下所示: user_id city_id 0 a 1 a 2 b 3 a 4 c .. and so on city_id district_id probability a a1 0.01 a a2 0.02
user_id city_id
0 a
1 a
2 b
3 a
4 c
.. and so on
city_id district_id probability
a a1 0.01
a a2 0.02
a a3 0.02
a a4 0.56
a a5 0.39
b b1 0.63
b b2 0.07
b b3 0.30
and so on..
第二种方法给出了每个城市中有多少百分比属于每个地区,如下所示:
user_id city_id
0 a
1 a
2 b
3 a
4 c
.. and so on
city_id district_id probability
a a1 0.01
a a2 0.02
a a3 0.02
a a4 0.56
a a5 0.39
b b1 0.63
b b2 0.07
b b3 0.30
and so on..
我需要根据用户所属城市的概率来组织用户。所以(例如)我得到大约56%的居住在a市的用户来自a4区等等。基本上,最终df将有与用户id、城市id和地区id相关的行
我的第一条线索是给每个用户一个随机数,并与概率进行比较
我的第二个想法是按city_id按行分组,查找第二个表并按概率选择(给第三列赋值)。基本上对于城市a,这意味着我将在组中选择56%的行,并给它区域值a4,依此类推。
但我不确定数学上是否是最好的方法。我建议做以下几点:
for city in city_info.city_id.unique():
probs = city_info[city_info.city_id == city]
in_city = users.city_id == city
n_citizens = in_city.sum()
n_districts = len(probs)
district = np.random.choice(range(n_district), n_citizens, p=probs) + 1 # adding one as range is base 0
users.loc[in_city, 'district_id'] = city + pd.Series(district).astype(str)
我建议采取以下措施:
for city in city_info.city_id.unique():
probs = city_info[city_info.city_id == city]
in_city = users.city_id == city
n_citizens = in_city.sum()
n_districts = len(probs)
district = np.random.choice(range(n_district), n_citizens, p=probs) + 1 # adding one as range is base 0
users.loc[in_city, 'district_id'] = city + pd.Series(district).astype(str)
如果df1
和df2
是您的两个数据帧:
import numpy as np
def get_district(city):
dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
return np.random.choice(dlist, p=p) #give weighed random choice from list
并适用于:
df['district_id'] = df.city_id.apply(get_district)
在@JoeCondron的有用评论之后,另一种方法是:
def get_city_district(city,df1,df2):
l = len(df1[df1.city_id==city])
d = df2[df2['city_id']==city]
ds, p = list(d['district_id']),list(d['probability'])
df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
return df1
def f(df1,df2):
df1['district_id'] = None
for i in set(df1.city_id):
df1 = get_city_district(i,df1,df2)
return df1
测试时速度要快得多,但只有少数几个城市。如果df1
和df2
是您的两个数据帧:
import numpy as np
def get_district(city):
dlist = list(df2.loc[df2['city_id']==city, 'district_id']) #get list of districts
p = list(df2.loc[df2['city_id']==city, 'probability']) #get corresponding odds
return np.random.choice(dlist, p=p) #give weighed random choice from list
并适用于:
df['district_id'] = df.city_id.apply(get_district)
在@JoeCondron的有用评论之后,另一种方法是:
def get_city_district(city,df1,df2):
l = len(df1[df1.city_id==city])
d = df2[df2['city_id']==city]
ds, p = list(d['district_id']),list(d['probability'])
df1.loc[df1.city_id==city,'district_id'] = np.random.choice(ds, size=l,p=p)
return df1
def f(df1,df2):
df1['district_id'] = None
for i in set(df1.city_id):
df1 = get_city_district(i,df1,df2)
return df1
测试时速度要快得多,但只有少数几个城市。这将为df
中的每一行调用get_district
,这将不必要地对df2
每次进行切片。我们只需要获得每个独特城市的权重一次。另外,您将两次生成相同的布尔键。这将为df
中的每一行调用get_district
,这将依次对df2
进行不必要的切片。我们只需要获得每个独特城市的权重一次。此外,还将生成相同的布尔键两次。