Python 使用pandas和numpy参数化堆栈溢出'；用户数量和声誉_Python_Numpy_Pandas

Python 使用pandas和numpy参数化堆栈溢出'；用户数量和声誉

python numpy pandas

Python 使用pandas和numpy参数化堆栈溢出'；用户数量和声誉,python,numpy,pandas,Python,Numpy,Pandas,我注意到Stack Overflow的用户数量和他们的声誉遵循一个有趣的分布。我创建了一个参数拟合，以查看是否可以创建一个参数拟合： import pandas as pd import numpy as np soDF = pd.read_excel('scores.xls') print soDF 其中返回以下内容： total_rep users 0 1 4364226 1 200 269110 2 500 15

我注意到Stack Overflow的用户数量和他们的声誉遵循一个有趣的分布。我创建了一个参数拟合，以查看是否可以创建一个参数拟合：

import pandas as pd
import numpy as np
soDF = pd.read_excel('scores.xls')
print soDF

其中返回以下内容：

    total_rep    users
0           1  4364226
1         200   269110
2         500   158824
3        1000    90368
4        2000    48609
5        3000    32604
6        5000    18921
7       10000     8618
8       25000     2802
9       50000     1000
10     100000      334

          6           5          4          3          2
-0.00258 x + 0.04187 x - 0.2541 x + 0.6774 x - 0.7697 x - 0.2513 x + 6.64

如果我将其绘制成图表，我将得到以下图表：

该分布似乎遵循。因此，为了更好地将其可视化，我添加了以下内容：

soDF['log_total_rep'] = soDF['total_rep'].apply(np.log10)
soDF['log_users'] = soDF['users'].apply(np.log10)
soDF.plot(x='log_total_rep', y='log_users')

产生了以下结果：

有没有一种简单的方法可以让熊猫找到最适合这些数据的呢？虽然拟合看起来是线性的，但多项式拟合可能更好，因为我现在处理的是对数标度。

NumPy有很多函数可以进行拟合。对于多项式拟合，我们使用numpy.polyfit（）

初始化数据集：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = [k.split() for k in '''0           1  4364226
1         200   269110
2         500   158824
3        1000    90368
4        2000    48609
5        3000    32604
6        5000    18921
7       10000     8618
8       25000     2802
9       50000     1000
10     100000      334'''.split('\n')]

soDF = pd.DataFrame(data, columns=('index', 'total_rep', 'users'))

soDF['total_rep'] = pd.to_numeric(soDF['total_rep'])
soDF['users'] = pd.to_numeric(soDF['users'])

soDF['log_total_rep'] = soDF['total_rep'].apply(np.log10)
soDF['log_users'] = soDF['users'].apply(np.log10)
soDF.plot(x='log_total_rep', y='log_users')

拟合二次多项式

coefficients = np.polyfit(soDF['log_total_rep'] , soDF['log_users'], 2)

print "Coefficients: ", coefficients

接下来，让我们绘制原始+拟合：

polynomial = np.poly1d(coefficients)
xp = np.linspace(-2, 6, 100)

plt.plot(soDF['log_total_rep'], soDF['log_users'], '.', xp, polynomial(xp), '-')

python

，

pandas

，和

scipy

，噢，天哪！ scientific python生态系统有几个免费的库。没有一个图书馆是按设计做每件事的

pandas

提供了操作类似于表的数据和时间序列的工具。但是，它故意不包括您正在寻找的功能类型

为了拟合统计分布，您通常会使用另一个包，例如

scipy.stats

然而，在这种情况下，我们没有“原始”数据（即一长串的声誉分数）。相反，我们有类似于直方图的东西。因此，我们需要在比

scipy.stats.powerlaw.fit

低一点的级别上进行拟合

独立的例子现在，让我们完全放弃熊猫。在这里使用它没有任何好处，而且我们很快就会将数据帧转换为其他数据结构

pandas

很棒，在这种情况下，这简直是杀伤力过大

作为复制绘图的快速独立示例：

import matplotlib.pyplot as plt

total_rep = [1, 200, 500, 1000, 2000, 3000, 5000, 10000,
             25000, 50000, 100000]
num_users = [4364226, 269110, 158824, 90368, 48609, 32604, 
             18921, 8618, 2802, 1000, 334]

fig, ax = plt.subplots()
ax.loglog(total_rep, num_users)
ax.set(xlabel='Total Reputation', ylabel='Number of Users',
       title='Log-Log Plot of Stackoverflow Reputation')
plt.show()

这些数据代表什么？接下来，我们需要知道我们在做什么。我们绘制的图类似于柱状图，因为它是给定声誉级别上用户数量的原始计数。但是，请注意声誉表中每个箱子旁边的小“+”。这意味着，例如，2082个用户的声誉得分为25000或更高

我们的数据基本上是对互补累积分布函数（CCDF）的估计，这与直方图是对概率分布函数（PDF）的估计是一样的。我们只需要通过样本中的用户总数对其进行标准化，以获得CCDF的估计值。在这种情况下，我们可以简单地除以

num\u users

的第一个元素。声誉永远不能小于1，因此x轴上的1根据定义对应于1的概率。（在其他情况下，我们需要估计这个数字。）例如：

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

ccdf = num_users.astype(float) / num_users.max()

fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', lw=2, marker='o',
          clip_on=False, zorder=10)
ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X')
plt.show()

你可能想知道为什么我们要把东西转换成“标准化”版本。最简单的答案是它更有用。它允许我们说一些与样本量没有直接关系的话。明天，Stackoverflow用户总数（以及每个声誉级别的用户数）将有所不同。然而，任何给定用户拥有特定声誉的总概率不会发生显著变化。如果我们想预测John Skeet在网站达到500万注册用户时的声誉（最高代表用户），那么使用概率而不是原始计数要容易得多

幂律分布的朴素拟合接下来，让我们将幂律分布拟合到CCDF。同样，如果我们有一长串声誉分数的“原始”数据，最好使用一个统计包来处理。特别是，

scipy.stats.powerlaw.fit

然而，我们没有原始数据。幂律分布的CCDF的形式为

CCDF=x**（-a+1）

。因此，我们将在日志空间中拟合一条直线，我们可以从

a=1-slope

中获得分布的

参数

目前，让我们使用

np.polyfit

来拟合该行。我们需要自己处理从日志空间来回的转换：

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

ccdf = num_users.astype(float) / num_users.max()

# Fit a line in log-space
logx = np.log(total_rep)
logy = np.log(ccdf)
params = np.polyfit(logx, logy, 1)
est = np.exp(np.polyval(params, logx))

fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
          clip_on=False, zorder=10, label='Observations')

ax.plot(total_rep, est, color='salmon', label='Fit', ls='--')

ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X')

plt.show()

这件衣服有一个直接的问题。我们的估计表明，用户拥有1的声誉的概率大于1。那是不可能的

问题是，我们让

polyfit

为我们的线路选择最合适的y轴截距。如果我们看一下上面代码中的a

params

，它是第二个数字：

In [11]: params
Out[11]: array([-0.81938338,  1.15955974])

根据定义，y截距应为1。相反，最佳拟合截距约为

1.16

。我们需要固定这个数字，并且只允许斜率在线性拟合中变化

将y形截距固定在配合中首先，请注意

日志（1）-->0

。因此，我们实际上想要强制日志空间中的y截距为0，而不是1

使用

np.linalg.lstsq

而不是

np.polyfit

来解决问题是最简单的。无论如何，你会做类似的事情：

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

ccdf = num_users.astype(float) / num_users.max()

# Fit a line with a y-intercept of 1 in log-space
logx = np.log(total_rep)
logy = np.log(ccdf)
slope, _, _, _ = np.linalg.lstsq(logx[:,np.newaxis], logy)

params = [slope, 0]
est = np.exp(np.polyval(params, logx))

fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
          clip_on=False, zorder=10, label='Observations')

ax.plot(total_rep, est, color='salmon', label='Fit', ls='--')

ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X')

plt.show()

嗯。。。现在我们有了一个新问题。我们的新产品线与我们的数据不太相符。这是幂律分布的常见问题

仅使用fit中的“尾部” 在现实生活中，观测到的分布几乎从不完全遵循幂律。然而，它们的“长尾巴”通常是这样。您可以在这个数据集中非常清楚地看到这一点。如果我们排除前两个数据点（低声誉/高概率），我们将得到一条非常不同的线，它将更适合剩余的数据

只有分布的尾部遵循幂律这一事实解释了为什么我们在修正y截距时无法很好地拟合数据

有很多不同的修正幂律

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

top_5_rep = [832131, 632105, 618926, 596889, 576697]
top_5_ccdf = np.array([1, 2, 3, 4, 5], dtype=float) / num_users.max()

ccdf = num_users.astype(float) / num_users.max()

# Previous fits
naive_params = [-0.81938338,  1.15955974]
fixed_intercept_params = [-0.68845134, 0]
long_tail_params = [-1.26172528, 5.24883471]

fits = [naive_params, fixed_intercept_params, long_tail_params]
fit_names = ['Naive Fit', 'Fixed Intercept Fit', 'Long Tail Fit']


fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
          clip_on=False, zorder=10, label='Observations')

# Plot reputation of top 5 users
ax.loglog(top_5_rep, top_5_ccdf, ls='', marker='o', color='darkred',
          zorder=10, label='Top 5 Users')

# Plot different fits
for params, name in zip(fits, fit_names):
    x = [1, 1e7]
    est = np.exp(np.polyval(params, np.log(x)))
    ax.loglog(x, est, label=name, ls='--')

ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X',
       ylim=[1e-7, 1])
ax.legend()

plt.show()

import numpy as np

# Jon Skeet's actual reputation
skeet_prob = 1.0 / 4364226
true_rep = 832131

# Previous fits
naive_params = [-0.81938338,  1.15955974]
fixed_intercept_params = [-0.68845134, 0]
long_tail_params = [-1.26172528, 5.24883471]

fits = [naive_params, fixed_intercept_params, long_tail_params]
fit_names = ['Naive Fit', 'Fixed Intercept Fit', 'Long Tail Fit']

for params, name in zip(fits, fit_names):
    inv_params = [1 / params[0], -params[1]/params[0]]
    est = np.exp(np.polyval(inv_params, np.log(skeet_prob)))

    print '{}:'.format(name)
    print '    Pred. Rep.: {}'.format(est)
    print ''

print 'True Reputation: {}'.format(true_rep)

Naive Fit:
    Pred. Rep.: 522562573.099

Fixed Intercept Fit:
    Pred. Rep.: 4412664023.88

Long Tail Fit:
    Pred. Rep.: 11728612.2783

True Reputation: 832131

0           1  4364226
1         200   269110
2         500   158824
3        1000    90368
4        2000    48609
5        3000    32604
6        5000    18921
7       10000     8618
8       25000     2802
9       50000     1000
10     100000      334
11     193000      100
12     261000       50
13     441000       10
14     578000        5
15     833000        1

soDF['log_total_rep'] = soDF['total_rep'].apply(np.log10)
soDF['log_users']     = soDF['users'].apply(np.log10)
coefficients = np.polyfit(soDF['log_total_rep'] , soDF['log_users'], 6)
polynomial = np.poly1d(coefficients)
print polynomial

          6           5          4          3          2
-0.00258 x + 0.04187 x - 0.2541 x + 0.6774 x - 0.7697 x - 0.2513 x + 6.64

xp = np.linspace(0, 6, 100)
plt.figure(figsize=(18,6))
plt.title('Stackoverflow Reputation', fontsize =15)
plt.xlabel('Log reputation', fontsize =15)
plt.ylabel('Log probability that reputation is greater than X', fontsize = 15)
plt.plot(soDF['log_total_rep'], soDF['log_users'],'o', label ='Data')
plt.plot(xp, polynomial(xp), color='red', label='Fit', ls='--')
plt.legend(loc='upper right', fontsize = 15)

total_users = 4407194
def predicted_rank(total_rep):
    parametric_rank_position   = 10**polynomial(np.log10(total_rep))
    parametric_rank_percentile = parametric_rank_position/total_users
    print "Position is " + str(int(parametric_rank_position)) + ", and rank is top " +  "{:.4%}".format(parametric_rank_percentile)

predicted_rank(165671)
Position is 133, and rank is top 0.0030%

predicted_rank(374507)
Position is 18, and rank is top 0.0004%

predicted_rank(579042)
Position is 4, and rank is top 0.0001%

predicted_rank(1242)
Position is 75961, and rank is top 1.7236%