Python 在大熊猫中有效地从n种可能性中选择r结果_Python_Pandas_Discrete Mathematics_Apriori

Python 在大熊猫中有效地从n种可能性中选择r结果

python pandas

Python 在大熊猫中有效地从n种可能性中选择r结果,python,pandas,discrete-mathematics,apriori,Python,Pandas,Discrete Mathematics,Apriori,我有50年的数据。我需要从中选择30年的组合，以使对应的值达到特定阈值，但50C30的可能组合数为47129212243960。如何有效地计算它 Prs_100 Yrs 2012 425.189729 2013 256.382494 2014 363.309507 2015 578.728535 2016 309.3

我有50年的数据。我需要从中选择30年的组合，以使对应的值达到特定阈值，但

50C30

的可能组合数为

47129212243960

。如何有效地计算它

          Prs_100      
  Yrs                                                 
  2012  425.189729  
  2013  256.382494  
  2014  363.309507  
  2015  578.728535  
  2016  309.311562  
  2017  476.388839  
  2018  441.479570  
  2019  342.267756  
  2020  388.133403  
  2021  405.007245  
  2022  316.108551  
  2023  392.193322  
  2024  296.545395  
  2025  467.388190  
  2026  644.588971  
  2027  301.086631  
  2028  478.492618  
  2029  435.868944  
  2030  467.464995  
  2031  323.465049  
  2032  391.201598  
  2033  548.911349  
  2034  381.252838  
  2035  451.175339  
  2036  281.921215  
  2037  403.840004  
  2038  460.514250  
  2039  409.134409  
  2040  312.182576 
  2041  320.246886  
  2042  290.163454  
  2043  381.432168  
  2044  259.228592  
  2045  393.841815  
  2046  342.999972  
  2047  337.491898  
  2048  486.139010  
  2049  318.278012  
  2050  385.919542  
  2051  309.472316  
  2052  307.756455  
  2053  338.596315  
  2054  322.508536  
  2055  385.428138  
  2056  339.379743  
  2057  420.428529  
  2058  417.143175 
  2059  361.643381  
  2060  459.861622  
  2061  374.359335

我只需要30年的组合，其

Prs_100

平均值达到某个阈值，我就可以不再计算进一步的结果。在搜索时，我使用

apriori

算法找到了一种特殊的方法，但无法真正计算出其中的支持值

我使用了python的组合方法

 list(combinations(dftest.index,30))

但在这种情况下，它不起作用。

预期结果- 假设我找到了一个30年的集合，它的

Prs_100

平均值大于460，那么我将保存30年的输出结果，这也是我想要的结果。如何操作？

您可以使用numpy的：

我之前的回答不正确，所以我要再试一次。重新阅读您的问题，您似乎在寻找30年的结果，其中Prs_100值的平均值大于460

下面的代码可以做到这一点，但当我运行它时，在平均值为415之后，我开始遇到困难

运行后，您会得到一个年份列表“years_list”和一个值列表“Prs_100_list”，符合平均值>460（以下示例中为415）的标准

这是我的代码，希望这是在你正在寻找的领域

from math import factorial
import numpy as np
import pandas as pd
from itertools import combinations
import time

# start a timer
start = time.time()

# array of values to work with, corresponding to the years 2012 - 2062
prs_100 = np.array([
       425.189729, 256.382494, 363.309507, 578.728535, 309.311562,
       476.388839, 441.47957 , 342.267756, 388.133403, 405.007245,
       316.108551, 392.193322, 296.545395, 467.38819 , 644.588971,
       301.086631, 478.492618, 435.868944, 467.464995, 323.465049,
       391.201598, 548.911349, 381.252838, 451.175339, 281.921215,
       403.840004, 460.51425 , 409.134409, 312.182576, 320.246886,
       290.163454, 381.432168, 259.228592, 393.841815, 342.999972,
       337.491898, 486.13901 , 318.278012, 385.919542, 309.472316,
       307.756455, 338.596315, 322.508536, 385.428138, 339.379743,
       420.428529, 417.143175, 361.643381, 459.861622, 374.359335])

# build dataframe with prs_100 as index and years as values, so that  years can be returned easily.
df = pd.DataFrame(list(range(2012, 2062)), index=prs_100, columns=['years'])

df.index.name = 'Prs_100'

# set combination parameters
r =  30
n = len(prs_100)

Prs_100_list = []
years_list = []
count = 0    

for p in combinations(prs_100, r):
    if np.mean(p) > 391 and np.mean(p) < 400:
        Prs_100_list.append(p)
        years_list.append(df.loc[p,'years'].values.tolist())
        # build in some exit
        count += 1
        if count > 100: 
            break

从数学导入阶乘
将numpy作为np导入
作为pd进口熊猫
从itertools导入组合
导入时间
#启动计时器
开始=时间。时间（）
#要使用的值数组，对应于2012-2062年
prs_100=np.array([
425.189729, 256.382494, 363.309507, 578.728535, 309.311562,
476.388839, 441.47957 , 342.267756, 388.133403, 405.007245,
316.108551, 392.193322, 296.545395, 467.38819 , 644.588971,
301.086631, 478.492618, 435.868944, 467.464995, 323.465049,
391.201598, 548.911349, 381.252838, 451.175339, 281.921215,
403.840004, 460.51425 , 409.134409, 312.182576, 320.246886,
290.163454, 381.432168, 259.228592, 393.841815, 342.999972,
337.491898, 486.13901 , 318.278012, 385.919542, 309.472316,
307.756455, 338.596315, 322.508536, 385.428138, 339.379743,
420.428529, 417.143175, 361.643381, 459.861622, 374.359335])
#以prs_100为索引，以年份为值构建数据框架，以便轻松返回年份。
df=pd.DataFrame（list（range（2012，2062）），index=prs_100，columns=['years']））
df.index.name='Prs_100'
#设置组合参数
r=30
n=长度（prs_100）
Prs_100_列表=[]
年份列表=[]
计数=0
对于组合中的p（prs_100，r）：
如果np.平均值（p）>391且np.平均值（p）<400：
Prs_100_列表。附加（p）
年份列表.append（df.loc[p，'years'].values.tolist（））
#在某个出口处建造
计数+=1
如果计数>100：
打破

我认为您的预期结果有点不清楚。你能取最大的n吗？在这种情况下，n是50，这是总的年数。这里有一个问题可以帮助你更清楚地理解这个问题——这也是——一种方法是多次使用np.choice（直到它超过阈值……但可能无法达到阈值）。最简单的解决方法是使用最大的30年，对吗？实际上，Prs_100值并不取决于所用的年份，因此，我们可以在前5个组合中得到正确的组合，也可能是最后5个组合。因此，使用np。选择几次可能会收敛，也可能不会收敛到结果。这种方法的问题是，随机选择可以两次生成相同的数，从而呈现整个子集useless@Bing您可以使用replace=False，最高30个值的平均值为435.88。你不会发现30年的平均值大于460。如果按降序对numpy数组进行排序，您将很快得到结果-np.sort（-prs_100）好的，这是可行的，但在gpu上仍然需要花费大量的时间，有没有任何方法我可以用优化的方式编写它，以便它在gpu上运行得更快。我认为对于给定的数据集，尝试获得460个平均值是不可能的。最高平均值为435。在415，如果按降序排列数组，我可以快速找到值。超过418次，我无法得到结果。我认为数据就是这样。您正在寻找47万亿组合集中的统计异常值。实际上，发布数据时存在问题，我需要与给定数据391-400范围相对应的平均值，以及组合中p的变异系数（15,25和35）（prs_100，r）：如果np.mean（p）>391和np.mean（p）<400:Prs_100_list=p years_list=df.loc[p，'years'].values.tolist（）中断

from math import factorial
import numpy as np
import pandas as pd
from itertools import combinations
import time

# start a timer
start = time.time()

# array of values to work with, corresponding to the years 2012 - 2062
prs_100 = np.array([
       425.189729, 256.382494, 363.309507, 578.728535, 309.311562,
       476.388839, 441.47957 , 342.267756, 388.133403, 405.007245,
       316.108551, 392.193322, 296.545395, 467.38819 , 644.588971,
       301.086631, 478.492618, 435.868944, 467.464995, 323.465049,
       391.201598, 548.911349, 381.252838, 451.175339, 281.921215,
       403.840004, 460.51425 , 409.134409, 312.182576, 320.246886,
       290.163454, 381.432168, 259.228592, 393.841815, 342.999972,
       337.491898, 486.13901 , 318.278012, 385.919542, 309.472316,
       307.756455, 338.596315, 322.508536, 385.428138, 339.379743,
       420.428529, 417.143175, 361.643381, 459.861622, 374.359335])

# build dataframe with prs_100 as index and years as values, so that  years can be returned easily.
df = pd.DataFrame(list(range(2012, 2062)), index=prs_100, columns=['years'])

df.index.name = 'Prs_100'

# set combination parameters
r =  30
n = len(prs_100)

Prs_100_list = []
years_list = []
count = 0    

for p in combinations(prs_100, r):
    if np.mean(p) > 391 and np.mean(p) < 400:
        Prs_100_list.append(p)
        years_list.append(df.loc[p,'years'].values.tolist())
        # build in some exit
        count += 1
        if count > 100: 
            break