Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/324.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/322.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 当前数据中唯一值的计数_Python_Pandas_Count_Unique - Fatal编程技术网

Python 当前数据中唯一值的计数

Python 当前数据中唯一值的计数,python,pandas,count,unique,Python,Pandas,Count,Unique,我正在尝试返回df中unique值的count。这是每一行的累积计数。我的目标是合并一个函数,该函数确定在任何时间点当前发生的值的数量 import pandas as pd df = pd.DataFrame({ 'A' : ['8:06:00','11:00:00','11:30:00','12:00:00','13:00:00','13:30:00','14:00:00','17:00:00'], 'B' : ['ABC','ABC','DEF','X

我正在尝试返回
df
unique
值的
count
。这是每一行的累积计数。我的目标是合并一个函数,该函数确定在任何时间点当前发生的值的数量

import pandas as pd

df = pd.DataFrame({          
    'A' : ['8:06:00','11:00:00','11:30:00','12:00:00','13:00:00','13:30:00','14:00:00','17:00:00'],
    'B' : ['ABC','ABC','DEF','XYZ','ABC','LMN','DEF','ABC'],          
    'C' : [1,2,1,1,3,1,2,4],            
    })

          A    B  C
0   8:06:00  ABC  1
1  11:00:00  ABC  2
2  11:30:00  DEF  1
3  12:00:00  XYZ  1
4  13:00:00  ABC  3
5  13:30:00  LMN  1
6  14:00:00  DEF  2
7  17:00:00  ABC  4
因此,在
col['B']
中有4个
唯一的
值。这是我正在测量的

df1 = df['B'].nunique()
但是我希望合并一个函数,
通过
迭代
,以确定是否有任何特定值再次出现。如果不是,我希望计数减少。如果是第一次出现该值,我希望增加计数。如果该值已出现并再次出现,则计数应保持不变。这将显示在任何时间点发生的值的数量

import pandas as pd

df = pd.DataFrame({          
    'A' : ['8:06:00','11:00:00','11:30:00','12:00:00','13:00:00','13:30:00','14:00:00','17:00:00'],
    'B' : ['ABC','ABC','DEF','XYZ','ABC','LMN','DEF','ABC'],          
    'C' : [1,2,1,1,3,1,2,4],            
    })

          A    B  C
0   8:06:00  ABC  1
1  11:00:00  ABC  2
2  11:30:00  DEF  1
3  12:00:00  XYZ  1
4  13:00:00  ABC  3
5  13:30:00  LMN  1
6  14:00:00  DEF  2
7  17:00:00  ABC  4
使用@jpp的代码,我们生成以下内容:

cum_maxer = pd.Series(pd.factorize(df['B'])[0] + 1).cummax()
df['res'] = cum_maxer - df['B'].duplicated().cumsum()

print(df)
输出:

'res'

0  1
1  1
2  2
3  3
4  2
5  3
6  2
7  1
基本上,如果
值第一次出现,我想将其添加到
累积计数中。如果该值结束(以后不会出现),则计数应减少。如果值已经出现并再次出现,则计数应保持不变

每行的详细信息和预期输出:

1st row
ABC
第一次出现,以后再出现<代码>计数=+1

第二行
ABC
再次出现,因此没有增加。之后也会出现,因此不会减少<代码>计数=无变化

3行
DEF
第一次出现,以后再出现<代码>计数=+1

4行
XYZ
第一次出现,但以后不会出现。但此时,出现了3个值,因此
计数为3
。当
XYZ完成时,计数自动下降到下一行

5行
,如上所述
XYZ
已完成,因此当前仅启用了
ABC
DEF
ABC
值也会再次出现,因此
计数为2

6行
LMN
第一次出现,因此计数增加。这意味着
ABC、DEF、LMN
在该时间点是当前的。与第4行类似,
LMN
不会再次出现,因此当
LMN
完成时,下一行的计数将减少<代码>计数为3

第七行,
DEF
ABC
当前处于启用状态,因此
计数为2
。由于
DEF
不再出现,下一行的计数将减少

第8行,
ABC
当前唯一启用的值,因此
计数为1

您可以使用该值为每个唯一值分配一个整数标识符,然后在结果上使用滚动计数

df['id'] = pd.factorize(df['B'])[0] + 1
df['count'] = df['id'].cummax()

print(df)

          A    B  C  id  count
0   8:06:00  ABC  1   1      1
1  11:00:00  DEF  1   2      2
2  12:00:00  XYZ  1   3      3
3  13:00:00  ABC  2   1      3
4  13:30:00  LMN  1   4      4
5  14:00:00  DEF  2   2      4
6  17:00:00  ABC  3   1      4

更新

对于所需的输出,您可以像以前一样计算
cummax
,并减去重复的累积计数:

cum_maxer = pd.Series(pd.factorize(df['B'])[0] + 1).cummax()
df['res'] = cum_maxer - df['B'].duplicated().cumsum()

print(df)

          A    B  C  res
0   8:06:00  ABC  1    1
1  11:00:00  DEF  1    2
2  12:00:00  XYZ  1    3
3  13:00:00  ABC  2    2
4  13:30:00  LMN  1    3
5  14:00:00  DEF  2    2
6  17:00:00  ABC  3    1

您还可以使用
np.unique

u = np.unique(df.B, return_index=True)
df['id'] = df.B.map(dict(zip(*u))) + 1

0    1
1    2
2    3
3    1
4    2
5    1

编辑问题 对于您编辑的问题,这里有一个解决方案。首先,在倒置的数据帧中使用
cumcount
,以查看未来

这样,
u
表示未来出现当前
B
的次数。然后,
zip
B
u
使用您的逻辑,使用
S\u n=S\u{n-1}+new\u value+dec
其中
new\u value
如果当前
val
是一个新值,则为
True
,如果前一行是该值的最后一次出现,则为
dec
(即当时的
u==0

ids = [1]
seen = set([df.iloc[0].B])
dec = False
for val, u in zip(df.B[1:], df.u[1:]):
    ids.append(ids[-1] + (val not in seen) - dec)
    seen.add(val)
    dec = u == 0

df['S'] = ids

    A           B   C   u   S   expected
0   8:06:00     ABC 1   3   1          1
1   11:00:00    ABC 2   2   1          1
2   11:30:00    DEF 1   1   2          2
3   12:00:00    XYZ 1   0   3          3
4   13:00:00    ABC 3   1   2          2
5   13:30:00    LMN 1   0   3          3
6   14:00:00    DEF 2   0   2          2
7   17:00:00    ABC 4   0   1          1
在哪里


时间安排 结果
以更快的速度更新答案

我希望我注意到@RafaelC的
groupby.cumcount()
technology,然后我给出了下面的答案。这给了我一个更快的方法的想法。正如@RafaelC所注意到的,当你处理行时,不需要使用完整的观察列表;只需知道当前符号出现的时间早或晚就足够了。事实上,正如你在更新中所指出的,你真正需要知道的是w是指当前行上的符号是否第一次出现(将1添加到计数中),以及前一行上的符号是否最后一次出现(从计数中减去1)。考虑到这一点,您可以使用以下相当简单且简化的代码:

将numpy作为np导入,将pandas作为pd导入

import numpy as np, pandas as pd

df = pd.DataFrame({          
    'A' : ['8:06:00','11:00:00','11:30:00','12:00:00','13:00:00','13:30:00','14:00:00','17:00:00'],
    'B' : ['ABC','ABC','DEF','XYZ','ABC','LMN','DEF','ABC'],          
    'C' : [1,2,1,1,3,1,2,4],            
})

groups = df.groupby('B')['B']
# flag the first and last appearance of each symbol
first_appearance = (groups.cumcount() == 0).astype(int)
last_appearance = (groups.cumcount(False) == 0).astype(int)
# delay effect of last_appearance by one step
last_appearance = pd.np.concatenate(([0], last_appearance.values[:-1]))
df['res'] = (first_appearance - last_appearance).cumsum()
print df
#           A    B  C  res
# 0   8:06:00  ABC  1    1
# 1  11:00:00  ABC  2    1
# 2  11:30:00  DEF  1    2
# 3  12:00:00  XYZ  1    3
# 4  13:00:00  ABC  3    2
# 5  13:30:00  LMN  1    3
# 6  14:00:00  DEF  2    2
# 7  17:00:00  ABC  4    1
调用此
matthias2
并重新运行@RafaelC的基准测试将得到以下结果:

%timeit matthias1(df)
10 loops, best of 3: 109 ms per loop
%timeit raf(df)
1 loops, best of 3: 230 ms per loop
%timeit matthias2(df)
100 loops, best of 3: 7 ms per loop
原始答案,相对较慢

下面的代码如何?其想法是使用两个累积集:一个显示从列表开始到当前为止看到的所有项目,另一个显示列表中尚未看到的所有项目。后一个集合可以与第一个集合相同的方式创建,只需颠倒列表,构建累积集,然后n再次颠倒列表

Pandas没有一个通用的
累积
函数来实现这一点。您可能可以通过
pd.Series.expansing
来实现,但这会在每一步重新累积大量序列片段,这会产生缓慢的n^2时间依赖关系。因此,我使用
numpy
累积
函数来构建集合,如图所示下面。这应该运行得相当有效,并且非常清晰

import numpy as np, pandas as pd

df = pd.DataFrame({          
    'A' : ['8:06:00','11:00:00','11:30:00','12:00:00','13:00:00','13:30:00','14:00:00','17:00:00'],
    'B' : ['ABC','ABC','DEF','XYZ','ABC','LMN','DEF','ABC'],          
    'C' : [1,2,1,1,3,1,2,4],            
})

# convert individual values to sets to make the next steps easier
valsets = df['B'].apply(lambda x: {x})

# define numpy ufuncs to get union of sets and size of intersection of sets
# note that union_sets.accumulate() will give a "cumulative union" of sets
union_sets = np.frompyfunc(lambda x, y: x | y, 2, 1)
intersect_count = np.frompyfunc(lambda x, y: len(x & y), 2, 1)

# create numpy vectors showing how many unique values have been seen up to 
# each point, and how many will be seen from there to the end
seen = union_sets.accumulate(valsets, dtype=np.object)
to_be_seen = union_sets.accumulate(valsets[::-1], dtype=np.object)[::-1]

# count how many are in both the have-been-seen and to-be-seen sets
df['res'] = intersect_count(seen, to_be_seen)

# add intermediate vectors for illustration
df['seen'] = seen
df['to_be_seen'] = to_be_seen

print(df)
          A    B  C res                  seen            to_be_seen
0   8:06:00  ABC  1   1                 {ABC}  {XYZ, ABC, DEF, LMN}
1  11:00:00  ABC  2   1                 {ABC}  {XYZ, ABC, LMN, DEF}
2  11:30:00  DEF  1   2            {ABC, DEF}  {XYZ, ABC, DEF, LMN}
3  12:00:00  XYZ  1   3       {XYZ, ABC, DEF}  {XYZ, ABC, LMN, DEF}
4  13:00:00  ABC  3   2       {XYZ, ABC, DEF}       {ABC, DEF, LMN}
5  13:30:00  LMN  1   3  {XYZ, ABC, LMN, DEF}       {ABC, LMN, DEF}
6  14:00:00  DEF  2   2  {XYZ, ABC, DEF, LMN}            {ABC, DEF}
7  17:00:00  ABC  4   1  {XYZ, ABC, LMN, DEF}                 {ABC}

请注意,我将中间向量存储在数据帧中,这样您就可以看到算法是如何工作的。但在生产代码中不需要这样做。

为什么行
3
不是
3 1
在您的预期输出中?我没有跟踪-您的意思是“当XYZ不再出现时,它会下降到2”?这是某种累积计数,还是按每个
B
值分组的计数,还是某种索引?有点奇怪
%timeit matt(df)
168 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit raf(df)
64.2 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numpy as np, pandas as pd

df = pd.DataFrame({          
    'A' : ['8:06:00','11:00:00','11:30:00','12:00:00','13:00:00','13:30:00','14:00:00','17:00:00'],
    'B' : ['ABC','ABC','DEF','XYZ','ABC','LMN','DEF','ABC'],          
    'C' : [1,2,1,1,3,1,2,4],            
})

groups = df.groupby('B')['B']
# flag the first and last appearance of each symbol
first_appearance = (groups.cumcount() == 0).astype(int)
last_appearance = (groups.cumcount(False) == 0).astype(int)
# delay effect of last_appearance by one step
last_appearance = pd.np.concatenate(([0], last_appearance.values[:-1]))
df['res'] = (first_appearance - last_appearance).cumsum()
print df
#           A    B  C  res
# 0   8:06:00  ABC  1    1
# 1  11:00:00  ABC  2    1
# 2  11:30:00  DEF  1    2
# 3  12:00:00  XYZ  1    3
# 4  13:00:00  ABC  3    2
# 5  13:30:00  LMN  1    3
# 6  14:00:00  DEF  2    2
# 7  17:00:00  ABC  4    1
%timeit matthias1(df)
10 loops, best of 3: 109 ms per loop
%timeit raf(df)
1 loops, best of 3: 230 ms per loop
%timeit matthias2(df)
100 loops, best of 3: 7 ms per loop
import numpy as np, pandas as pd

df = pd.DataFrame({          
    'A' : ['8:06:00','11:00:00','11:30:00','12:00:00','13:00:00','13:30:00','14:00:00','17:00:00'],
    'B' : ['ABC','ABC','DEF','XYZ','ABC','LMN','DEF','ABC'],          
    'C' : [1,2,1,1,3,1,2,4],            
})

# convert individual values to sets to make the next steps easier
valsets = df['B'].apply(lambda x: {x})

# define numpy ufuncs to get union of sets and size of intersection of sets
# note that union_sets.accumulate() will give a "cumulative union" of sets
union_sets = np.frompyfunc(lambda x, y: x | y, 2, 1)
intersect_count = np.frompyfunc(lambda x, y: len(x & y), 2, 1)

# create numpy vectors showing how many unique values have been seen up to 
# each point, and how many will be seen from there to the end
seen = union_sets.accumulate(valsets, dtype=np.object)
to_be_seen = union_sets.accumulate(valsets[::-1], dtype=np.object)[::-1]

# count how many are in both the have-been-seen and to-be-seen sets
df['res'] = intersect_count(seen, to_be_seen)

# add intermediate vectors for illustration
df['seen'] = seen
df['to_be_seen'] = to_be_seen

print(df)
          A    B  C res                  seen            to_be_seen
0   8:06:00  ABC  1   1                 {ABC}  {XYZ, ABC, DEF, LMN}
1  11:00:00  ABC  2   1                 {ABC}  {XYZ, ABC, LMN, DEF}
2  11:30:00  DEF  1   2            {ABC, DEF}  {XYZ, ABC, DEF, LMN}
3  12:00:00  XYZ  1   3       {XYZ, ABC, DEF}  {XYZ, ABC, LMN, DEF}
4  13:00:00  ABC  3   2       {XYZ, ABC, DEF}       {ABC, DEF, LMN}
5  13:30:00  LMN  1   3  {XYZ, ABC, LMN, DEF}       {ABC, LMN, DEF}
6  14:00:00  DEF  2   2  {XYZ, ABC, DEF, LMN}            {ABC, DEF}
7  17:00:00  ABC  4   1  {XYZ, ABC, LMN, DEF}                 {ABC}