Python 熊猫-在每日普查数据中寻找独特的条目_Python_Pandas_Dataframe_Grouping_Data Cleaning

Python 熊猫-在每日普查数据中寻找独特的条目

python pandas dataframe

Python 熊猫-在每日普查数据中寻找独特的条目,python,pandas,dataframe,grouping,data-cleaning,Python,Pandas,Dataframe,Grouping,Data Cleaning,我有一个月的人口普查数据，看起来像这样，我想知道这个月有多少独特的囚犯。这些信息是每天采集的，因此有倍数 _id,Date,Gender,Race,Age at Booking,Current Age 1,2016-06-01,M,W,32,33 2,2016-06-01,M,B,25,27 3,2016-06-01,M,W,31,33 我现在的方法是按天对它们进行分组，然后将那些未计入数据框的数据添加到数据框中。我的问题是如何解释两个拥有相同信息的人。它们都不会被

我有一个月的人口普查数据，看起来像这样，我想知道这个月有多少独特的囚犯。这些信息是每天采集的，因此有倍数

  _id,Date,Gender,Race,Age at Booking,Current Age
    1,2016-06-01,M,W,32,33
    2,2016-06-01,M,B,25,27
    3,2016-06-01,M,W,31,33

我现在的方法是按天对它们进行分组，然后将那些未计入数据框的数据添加到数据框中。我的问题是如何解释两个拥有相同信息的人。它们都不会被添加到新的数据帧中，因为其中一个已经存在？我想弄清楚这段时间监狱里总共有多少人

_id是增量的，例如，这里是第二天的一些数据

2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39

链接到此处的数据集：

您可以使用

df.drop\u duplicates（）

，它将只返回具有唯一值的数据帧，然后对条目进行计数

像这样的方法应该会奏效：

import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)

uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)

结果:

>> 11845

这种方法/数据的问题在于，可能会有许多年龄/性别/种族相同的个体囚犯被过滤掉。

我认为这里的诀窍是尽可能多地分组，并检查这些（小）组在一个月内的差异：

inmates = pd.read_csv('inmates.csv')

# group by everything except _id and count number of entries
grouped = inmates.groupby(
    ['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()

# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)

# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()

# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]

# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)

# sum total column
diffed['total'].sum()  # 3393

我经历了大约10个月，每个月都有大约50000条记录。如果我也这么做了，我会手工做，但我正在寻找一种更精明的方法。更新后，尝试上面的代码，看看这是否更像你想要的。id是唯一的还是递增的？什么构成唯一的囚犯？它是性别、种族、预订年龄、当前年龄的组合吗？如果一个囚犯有一个生日，并且在一个月内有两个唯一的“当前年龄”值怎么办？有没有一种方法可以将该囚犯区分为一名囚犯和两名囚犯？这是“性别”、“种族”、“预约年龄”、“当前年龄”的组合。这就是所提供的全部信息。一个棘手的问题是，如果有人在进来的那天留下了相同的信息。我这里有一个数据集的链接，以获取更多信息