Python-Pandas数据处理计算基尼系数_Python_Python 3.x_Pandas_Data Science

Python-Pandas数据处理计算基尼系数

python python-3.x pandas

Python-Pandas数据处理计算基尼系数,python,python-3.x,pandas,data-science,Python,Python 3.x,Pandas,Data Science,我正在使用以下形状的数据集： tconst GreaterEuropean British WestEuropean Italian French Jewish Germanic Nordic Asian GreaterEastAsian Japanese Hispanic GreaterAfrican Africans EastAsian Muslim IndianSubContinent total_ethnicities 0 t

我正在使用以下形状的数据集：

tconst  GreaterEuropean British WestEuropean    Italian French  Jewish  Germanic    Nordic  Asian   GreaterEastAsian    Japanese    Hispanic    GreaterAfrican  Africans    EastAsian   Muslim  IndianSubContinent  total_ethnicities
0   tt0000001   3   1   2   0   1   0   0   1   0   0   0   0   0   0   0   0   0   8
1   tt0000002   2   0   2   0   2   0   0   0   0   0   0   0   0   0   0   0   0   6
2   tt0000003   4   0   3   0   3   1   0   0   0   0   0   0   0   0   0   0   0   11
3   tt0000004   2   0   2   0   2   0   0   0   0   0   0   0   0   0   0   0   0   6
4   tt0000005   3   2   1   0   0   0   1   0   0   0   0   0   0   0   0   0   0   7

这是IMDB数据，经过处理后，我创建了这些列，表示电影中有这么多种族演员（tcons）

我想创建另一个列

df[“多样性”]

，它是：

（多样性得分“基尼指数”）

例如：假设每部电影有10名演员；3名亚洲人、3名英国人、3名非裔美国人和1名法国人。所以我们除以总数 3/10 3/ 10 3/10 1/10 然后1减去（3/10）平方（3/10）平方（3/10）平方（1/10）平方的和将每个演员的分数作为多样性添加到一列中

我正在尝试简单的操作，但没有达到目的

编辑：

第一排,，我们共有8个种族

3 GreaterEuropean
1 British
2 WestEuropean
1 French
1 nordic

所以分数会很高

1-[（3/8）^2+（1/8）^2+（2/8）^2+（1/8）^2+（1/8）^2]

您可以在此处使用numpy矢量化，即

one = df.drop(['total_ethnicities'],1).values
# Select the values other than total_ethnicities
two = df['total_ethnicities'].values[:,None]
# Select the values of total_ethnicities
df['diversity'] = 1 - pd.np.sum((one/two)**2, axis=1)
# Divide the values of one by two, square them. Sum over the axis. Then subtract from 1. 
df['diversity']

tconst
tt0000001    0.750000
tt0000002    0.666667
tt0000003    0.710744
tt0000004    0.666667
tt0000005    0.693878
Name: diversity, dtype: float64

除此之外，我总是尝试将原始数据与解析数据分开，因此我会将列

total_etnicies

保持在一个单独的系列中，并且只有在报告结果时需要将它们合并

如果您确实希望将此结果作为

df

中的一个额外列，可以通过以下方式执行此操作：

df = df.join(result, on='tconst')

最好的方法是将所有列与给定列进行比较，因为基尼系数定义了分布的差异。您将生成一个比较分布的基尼系数，例如意大利、法国、犹太。然后，与给定的专栏相比较，您甚至可以将这些种族划分为类似分布的集群

假设df2是您的数据帧。基尼指数公式为：

在以下位置选择轴柱（放置y）：

place_y=df2.columns.get_loc("price_doc")

gini=[]
for i in range(0,df2.shape[1]):
    gini.append((df2.shape[0]+1-2*(np.sum((df2.shape[0]+1-df2.ix[:,i])*df2.ix[:,place_y])/np.sum(df2.ix[:,place_y])))/df2.shape[0])

然后选择与阈值最匹配的列，假设为0.2，最相似的分布：

np.where(np.array(np.abs(gini))<.2)[0]

np.where（np.array（np.abs（gini））我们可以看到上述数据的预期输出吗。@Dark我编辑了它，希望它是清楚的。如果将最后一行放在df[“diversity”]中，感谢在所有值中设置NaN我得到了这个ZeroDivision错误：长除法或零模，你在用Python2吗？是的，我在用Python2。
place_y=df2.columns.get_loc("price_doc")

gini=[]
for i in range(0,df2.shape[1]):
    gini.append((df2.shape[0]+1-2*(np.sum((df2.shape[0]+1-df2.ix[:,i])*df2.ix[:,place_y])/np.sum(df2.ix[:,place_y])))/df2.shape[0])

np.where(np.array(np.abs(gini))<.2)[0]