Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/298.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何有效地计算时间序列中的滚动唯一计数?_Python_Pandas_Time Series_Distinct Values_Rolling Computation - Fatal编程技术网

Python 如何有效地计算时间序列中的滚动唯一计数?

Python 如何有效地计算时间序列中的滚动唯一计数?,python,pandas,time-series,distinct-values,rolling-computation,Python,Pandas,Time Series,Distinct Values,Rolling Computation,我有一个参观大楼的人的时间序列。每个人都有一个唯一的ID。对于时间序列中的每个记录,我想知道过去365天内访问大楼的唯一人数(即365天窗口的滚动唯一计数) pandas似乎没有用于此计算的内置方法。当存在大量唯一访问者和/或大窗口时,计算会变得非常密集。(实际数据大于此示例。) 有没有比我下面所做的更好的计算方法?我不知道为什么我所做的快速方法,windowed_nunique(在“速度测试3”下)被关闭了1 谢谢你的帮助 相关链接: 来源:Jupyter笔记本: 相关pandas问题:

我有一个参观大楼的人的时间序列。每个人都有一个唯一的ID。对于时间序列中的每个记录,我想知道过去365天内访问大楼的唯一人数(即365天窗口的滚动唯一计数)

pandas
似乎没有用于此计算的内置方法。当存在大量唯一访问者和/或大窗口时,计算会变得非常密集。(实际数据大于此示例。)

有没有比我下面所做的更好的计算方法?我不知道为什么我所做的快速方法,
windowed_nunique
(在“速度测试3”下)被关闭了1

谢谢你的帮助

相关链接:

  • 来源:Jupyter笔记本:
  • 相关
    pandas
    问题:
初始化 [1]中的

# Import libraries.
import pandas as pd
import numba
import numpy as np
%%timeit
windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
[2]中的

# Create data of people visiting a building.

np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)

df = pd\
    .DataFrame(
        data=[(date, pid)
              for (pid, prob) in zip(range(num_pids), probs)
              for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
        columns=['Date', 'PersonId'])\
    .sort_values(by='Date')\
    .reset_index(drop=True)

print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns
Created data of people visiting a building:

|   | Date       | PersonId | 
|---|------------|----------| 
| 0 | 2010-01-01 | 76       | 
| 1 | 2010-01-01 | 63       | 
| 2 | 2010-01-01 | 89       | 
| 3 | 2010-01-01 | 81       | 
| 4 | 2010-01-01 | 7        | 
# Check accuracy of results.
test = windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))
Out[2]:

# Create data of people visiting a building.

np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)

df = pd\
    .DataFrame(
        data=[(date, pid)
              for (pid, prob) in zip(range(num_pids), probs)
              for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
        columns=['Date', 'PersonId'])\
    .sort_values(by='Date')\
    .reset_index(drop=True)

print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns
Created data of people visiting a building:

|   | Date       | PersonId | 
|---|------------|----------| 
| 0 | 2010-01-01 | 76       | 
| 1 | 2010-01-01 | 63       | 
| 2 | 2010-01-01 | 89       | 
| 3 | 2010-01-01 | 81       | 
| 4 | 2010-01-01 | 7        | 
# Check accuracy of results.
test = windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))
速度基准 [3]中的

%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()
# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns
Where reference ('ref') calculation of number of unique people doesn't match 'test':

|    | Date       | PersonId | DateEpoch | ref  | test | 
|----|------------|----------|-----------|------|------| 
| 78 | 2010-01-19 | 99       | 14628     | 56.0 | 55   | 
| 79 | 2010-01-19 | 96       | 14628     | 56.0 | 55   | 
| 80 | 2010-01-19 | 88       | 14628     | 56.0 | 55   | 
| 81 | 2010-01-20 | 94       | 14629     | 56.0 | 55   | 
| 82 | 2010-01-20 | 48       | 14629     | 57.0 | 56   | 
3.32 ms±124µs/圈(7次运行的平均值±标准偏差,每次100圈)

速度测试1 [4]中的

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())
# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique_corrected(dates, pids, window):
    r"""Track number of unique persons in window,
    reading through arrays only once.

    Args:
        dates (numpy.ndarray): Array of dates as number of days since epoch.
        pids (numpy.ndarray): Array of integer person identifiers.
            Required: min(pids) >= 0
        window (int): Width of window in units of difference of `dates`.
            Required: window >= 1

    Returns:
        ucts (numpy.ndarray): Array of unique counts.

    Raises:
        AssertionError: Raised if not...
            * len(dates) == len(pids)
            * min(pids) >= 0
            * window >= 1

    Notes:
        * Matches `pandas.core.window.Rolling`
            with a time series alias offset.

    """

    # Check arguments.
    assert len(dates) == len(pids)
    assert np.min(pids) >= 0
    assert window >= 1

    # Initialize counters.
    idx_min = 0
    idx_max = dates.shape[0]
    date_min = dates[idx_min]
    pid_min = pids[idx_min]
    pid_max = np.max(pids) + 1
    pid_cts = np.zeros(pid_max, dtype=np.int64)
    pid_cts[pid_min] = 1
    uct = 1
    ucts = np.zeros(idx_max, dtype=np.int64)
    ucts[idx_min] = uct
    idx = 1

    # For each (date, person)...
    while idx < idx_max:

        # Lookup date, person.
        date = dates[idx]
        pid = pids[idx]

        # If person count went from 0 to 1, increment unique person count.
        pid_cts[pid] += 1
        if pid_cts[pid] == 1:
            uct += 1

        # For past dates outside of window...
        # Note: If window=3, it includes day0,day1,day2.
        while (date - date_min + 1) > window:

            # If person count went from 1 to 0, decrement unique person count.
            pid_cts[pid_min] -= 1
            if pid_cts[pid_min] == 0:
                uct -= 1
            idx_min += 1
            date_min = dates[idx_min]
            pid_min = pids[idx_min]

        # Record unique person count.
        ucts[idx] = uct
        idx += 1

    return ucts
2.42 s±282 ms/圈(7次运行的平均值±标准偏差,每次1圈)

[5]中的

# Save results as a reference to check calculation accuracy.
ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values
# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)
速度测试2 [6]中的

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def nunique(arr):
    return len(set(arr))
%%timeit
windowed_nunique_corrected(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
[7]中的

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)
# Check accuracy of results.
test = windowed_nunique_corrected(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
assert all(ref == test)
430ms±31.1ms/循环(7次运行的平均值±标准偏差,每个循环1次)

[8]中的

# Check accuracy of results.
test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values
assert all(ref == test)
速度测试3 [9]中的

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique(dates, pids, window):
    r"""Track number of unique persons in window,
    reading through arrays only once.

    Args:
        dates (numpy.ndarray): Array of dates as number of days since epoch.
        pids (numpy.ndarray): Array of integer person identifiers.
        window (int): Width of window in units of difference of `dates`.

    Returns:
        ucts (numpy.ndarray): Array of unique counts.

    Raises:
        AssertionError: Raised if `len(dates) != len(pids)`

    Notes:
        * May be off by 1 compared to `pandas.core.window.Rolling`
            with a time series alias offset.

    """

    # Check arguments.
    assert dates.shape == pids.shape

    # Initialize counters.
    idx_min = 0
    idx_max = dates.shape[0]
    date_min = dates[idx_min]
    pid_min = pids[idx_min]
    pid_max = np.max(pids)
    pid_cts = np.zeros(pid_max, dtype=np.int64)
    pid_cts[pid_min] = 1
    uct = 1
    ucts = np.zeros(idx_max, dtype=np.int64)
    ucts[idx_min] = uct
    idx = 1

    # For each (date, person)...
    while idx < idx_max:

        # If person count went from 0 to 1, increment unique person count.
        date = dates[idx]
        pid = pids[idx]
        pid_cts[pid] += 1
        if pid_cts[pid] == 1:
            uct += 1

        # For past dates outside of window...
        while (date - date_min) > window:

            # If person count went from 1 to 0, decrement unique person count.
            pid_cts[pid_min] -= 1
            if pid_cts[pid_min] == 0:
                uct -= 1
            idx_min += 1
            date_min = dates[idx_min]
            pid_min = pids[idx_min]

        # Record unique person count.
        ucts[idx] = uct
        idx += 1

    return ucts
[11]中的

# Import libraries.
import pandas as pd
import numba
import numpy as np
%%timeit
windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
107µs±63.5µs/循环(7次运行的平均值±标准偏差,每个循环1次)

[12]中的

# Create data of people visiting a building.

np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)

df = pd\
    .DataFrame(
        data=[(date, pid)
              for (pid, prob) in zip(range(num_pids), probs)
              for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
        columns=['Date', 'PersonId'])\
    .sort_values(by='Date')\
    .reset_index(drop=True)

print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns
Created data of people visiting a building:

|   | Date       | PersonId | 
|---|------------|----------| 
| 0 | 2010-01-01 | 76       | 
| 1 | 2010-01-01 | 63       | 
| 2 | 2010-01-01 | 89       | 
| 3 | 2010-01-01 | 81       | 
| 4 | 2010-01-01 | 7        | 
# Check accuracy of results.
test = windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))
[13]中的

%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()
# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns
Where reference ('ref') calculation of number of unique people doesn't match 'test':

|    | Date       | PersonId | DateEpoch | ref  | test | 
|----|------------|----------|-----------|------|------| 
| 78 | 2010-01-19 | 99       | 14628     | 56.0 | 55   | 
| 79 | 2010-01-19 | 96       | 14628     | 56.0 | 55   | 
| 80 | 2010-01-19 | 88       | 14628     | 56.0 | 55   | 
| 81 | 2010-01-20 | 94       | 14629     | 56.0 | 55   | 
| 82 | 2010-01-20 | 48       | 14629     | 57.0 | 56   | 
Out[13]:

%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()
# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns
Where reference ('ref') calculation of number of unique people doesn't match 'test':

|    | Date       | PersonId | DateEpoch | ref  | test | 
|----|------------|----------|-----------|------|------| 
| 78 | 2010-01-19 | 99       | 14628     | 56.0 | 55   | 
| 79 | 2010-01-19 | 96       | 14628     | 56.0 | 55   | 
| 80 | 2010-01-19 | 88       | 14628     | 56.0 | 55   | 
| 81 | 2010-01-20 | 94       | 14629     | 56.0 | 55   | 
| 82 | 2010-01-20 | 48       | 14629     | 57.0 | 56   | 

如果您只需要在过去365天内进入大楼的唯一人员的数量,您可以首先使用.loc:

df = df.loc[df['date'] > '2016-09-28',:]
通过groupby,你可以得到与进来的独特的人一样多的行,如果你计数,你还可以得到他们进来的次数:

df = df.groupby('PersonID').count()
这似乎对你的问题有效,但也许我弄错了。
祝你度过愉快的一天

非常接近你在第二次种子测试中的时间,但作为一条直线,在一年内重新取样

 df.resample('AS',on='Date')['PersonId'].expanding(0).apply(lambda x: np.unique(x).shape[0])
时间结果

1 loop, best of 3: 483 ms per loop

我在快速方法中有2个错误,现在在下面的
windowed\u nunique\u corrected
中更正:

  • 用于记录窗口中每个人员ID的唯一计数数的数组大小太小,
    pid\u cts
  • 由于窗口的前缘和后缘包括整数天,因此当
    (date-date\u min+1)>窗口
    时,应更新
    日期
  • 相关链接:

    • 源Jupyter笔记本已更新解决方案:
    [14]中的

    %%timeit
    df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())
    
    # Define a custom function and implement a just-in-time compiler.
    @numba.jit(nopython=True)
    def windowed_nunique_corrected(dates, pids, window):
        r"""Track number of unique persons in window,
        reading through arrays only once.
    
        Args:
            dates (numpy.ndarray): Array of dates as number of days since epoch.
            pids (numpy.ndarray): Array of integer person identifiers.
                Required: min(pids) >= 0
            window (int): Width of window in units of difference of `dates`.
                Required: window >= 1
    
        Returns:
            ucts (numpy.ndarray): Array of unique counts.
    
        Raises:
            AssertionError: Raised if not...
                * len(dates) == len(pids)
                * min(pids) >= 0
                * window >= 1
    
        Notes:
            * Matches `pandas.core.window.Rolling`
                with a time series alias offset.
    
        """
    
        # Check arguments.
        assert len(dates) == len(pids)
        assert np.min(pids) >= 0
        assert window >= 1
    
        # Initialize counters.
        idx_min = 0
        idx_max = dates.shape[0]
        date_min = dates[idx_min]
        pid_min = pids[idx_min]
        pid_max = np.max(pids) + 1
        pid_cts = np.zeros(pid_max, dtype=np.int64)
        pid_cts[pid_min] = 1
        uct = 1
        ucts = np.zeros(idx_max, dtype=np.int64)
        ucts[idx_min] = uct
        idx = 1
    
        # For each (date, person)...
        while idx < idx_max:
    
            # Lookup date, person.
            date = dates[idx]
            pid = pids[idx]
    
            # If person count went from 0 to 1, increment unique person count.
            pid_cts[pid] += 1
            if pid_cts[pid] == 1:
                uct += 1
    
            # For past dates outside of window...
            # Note: If window=3, it includes day0,day1,day2.
            while (date - date_min + 1) > window:
    
                # If person count went from 1 to 0, decrement unique person count.
                pid_cts[pid_min] -= 1
                if pid_cts[pid_min] == 0:
                    uct -= 1
                idx_min += 1
                date_min = dates[idx_min]
                pid_min = pids[idx_min]
    
            # Record unique person count.
            ucts[idx] = uct
            idx += 1
    
        return ucts
    
    [16]中的

    # Define a custom function and implement a just-in-time compiler.
    @numba.jit(nopython=True)
    def nunique(arr):
        return len(set(arr))
    
    %%timeit
    windowed_nunique_corrected(
        dates=df['DateEpoch'].values,
        pids=df['PersonId'].values,
        window=window)
    
    98.8µs±41.3µs/循环(7次运行的平均值±标准偏差,每个循环1次)

    [17]中的

    %%timeit
    df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)
    
    # Check accuracy of results.
    test = windowed_nunique_corrected(
        dates=df['DateEpoch'].values,
        pids=df['PersonId'].values,
        window=window)
    assert all(ref == test)
    

    抱歉,如果这是一个愚蠢的评论,那么365个滚动计数的唯一ID就不会像:
    df.rolling(365)['PersonId'].apply(lambda x:len(set(x))
    ?@WoodyPride谢谢,这是我在“速度测试2”中做的,但是使用了一个即时编译器(请参阅函数
    nunique
    )。计算是正确的,但效率低下,因为每次执行窗口计算时,
    set
    都会对窗口中的每个元素进行操作。保存每个元素的运行记录更有效,如“速度测试3”(通过将“速度测试2”和“速度测试3”进行~4000x比较,在示例数据上更有效)。然而,我的实现
    windowed_nunique
    关闭了1,我想知道是否有人可以帮助找到问题。明白了!我想我对这个问题读得还不够深入。谢谢,但我正在寻找一个有效的滚动唯一计数。输出必须与输入具有相同的
    len
    (示例中,
    len(df)==len(ref)==9181
    ),并且比“速度测试2”快。@samuelharold,滚动唯一计数是什么意思?您在一年中的滚动周期是什么?@djk47463滚动唯一计数示例(类似于上面“速度测试2”中定义的函数
    nunique
    df.滚动(window='365D',on='Date')。应用(lambda arr:len(set(arr))
    。挑战在于提高效率(比较“速度测试2”和“速度测试3”)。我几乎成功了,但我的解决方案
    windowed_nunique
    关闭了1,我想知道是否有人能找到我的错误。@djk47463
    windowed_nunique
    在记录78处关闭了1(示例中是从
    Out[13]
    ),相应的日期是“2010-01-19”,在任何额外的闰日之前。这接近于“速度测试2”表示速度,但
    np.unique正在窗口中的每个元素上运行。像“速度测试3”中那样保持每个元素的运行计数更有效。(请参阅我的评论,Woody自豪地说。)不过,我的运行计数的实现
    窗口化\u nunique
    ,被关闭了1。还有其他想法吗?谢谢