Python 在大型数据集中创建组合和维护组合计数_Python_Pandas

Python 在大型数据集中创建组合和维护组合计数

python pandas

Python 在大型数据集中创建组合和维护组合计数,python,pandas,Python,Pandas,我有一个DataFrame，看起来像这样： memberid created firstencodedid questionid 123 <some date> <some ID> 4fc 123 <some date> <some ID> daf 123 <some date> <some ID>

我有一个

DataFrame

，看起来像这样：

memberid    created    firstencodedid    questionid
123       <some date>    <some ID>          4fc
123       <some date>    <some ID>          daf
123       <some date>    <some ID>          f82
123       <some date>    <some ID>          cfd
123       <some date>    <some ID>          730

d = {'memberid': [123,123,123,456,456], 'questionid': ['4fc', 'daf', 'f82', 'cfd', '730']}
df = pd.DataFrame(d)


    memberid    questionid
0   123         4fc
1   123         daf
2   123         f82
3   456         cfd
4   456         730

df.groupby('memberid').apply(lambda x: len(list(combinations(x['questionid'], 2))))

因此，作为第一步，我尝试生成所有的问题对。RAM（16GB）显然无法保存此类数据，因此我考虑使用以下代码将这些数据（问题对）写入文件：

import itertools
import csv
start_time = time.time()
def generate_combination_of_questions(dataframe):
    return [
        pair
        for _, questions in dataframe.groupby('memberid')
        for pair in itertools.combinations(questions.questionid, 2)
    ]

with open('file_name', 'wb') as f:
    writer = csv.writer(f)
    for memberid in IncorrectQuestions['memberid'].unique():
        for pair in generate_combination_of_questions(IncorrectQuestions[IncorrectQuestions['memberid']==memberid]):             
            writer.writerow(pair)

print("--- %s seconds ---" % (time.time() - start_time))

from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
    pass

from collections import Counter
q1AndQ2Occurrences = OrderedCounter()
for memberid in IncorrectQuestions['memberid'].unique():  
  subset_IncorrectQuestions = IncorrectQuestions[IncorrectQuestions['memberid'] == memberid]
  q1AndQ2Occurrences = q1AndQ2Occurrences + OrderedCounter(generate_combination_of_questions(subset_IncorrectQuestions))

这段代码可以工作，但它生成了一个210GB的文件，然后我的磁盘空间用完了。显然，每个“对”的计数都是在成功写入文件后计算出来的，但这并没有发生

我尝试了另一种方法，尝试使用以下代码创建

OrderedCounter

：

import itertools
import csv
start_time = time.time()
def generate_combination_of_questions(dataframe):
    return [
        pair
        for _, questions in dataframe.groupby('memberid')
        for pair in itertools.combinations(questions.questionid, 2)
    ]

with open('file_name', 'wb') as f:
    writer = csv.writer(f)
    for memberid in IncorrectQuestions['memberid'].unique():
        for pair in generate_combination_of_questions(IncorrectQuestions[IncorrectQuestions['memberid']==memberid]):             
            writer.writerow(pair)

print("--- %s seconds ---" % (time.time() - start_time))

from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
    pass

from collections import Counter
q1AndQ2Occurrences = OrderedCounter()
for memberid in IncorrectQuestions['memberid'].unique():  
  subset_IncorrectQuestions = IncorrectQuestions[IncorrectQuestions['memberid'] == memberid]
  q1AndQ2Occurrences = q1AndQ2Occurrences + OrderedCounter(generate_combination_of_questions(subset_IncorrectQuestions))

这个方法的速度非常慢，我也很确定我的记忆会在某个时候消失

鉴于这个庞大的数据集，创建这些“问题对”并维护每个“问题对”的计数的最佳方法是什么

任何帮助都将不胜感激

蒂亚

编辑

我不想将整个数据集保存在内存中，但我想知道每个

memberid

值的每个组合的计数。一些组合可能会在

memberid

值之间重复，我想添加这样的计数

@Boud的解决方案告诉我每个

memberid

的组合数，但没有告诉我哪个组合有什么价值。

为什么要创建大量数据来计算它们，而不是应用组合数学函数

import scipy as sp

N = df.groupby('memberid').questionid.count()
N.apply(lambda x : sp.misc.comb(x, 2))
Out[10]: 
          questionid
memberid            
123             10.0

同意@Boud在内存中存储列表的必要性。但是如果你必须这么做，考虑一下这样的数据框：

memberid    created    firstencodedid    questionid
123       <some date>    <some ID>          4fc
123       <some date>    <some ID>          daf
123       <some date>    <some ID>          f82
123       <some date>    <some ID>          cfd
123       <some date>    <some ID>          730

d = {'memberid': [123,123,123,456,456], 'questionid': ['4fc', 'daf', 'f82', 'cfd', '730']}
df = pd.DataFrame(d)


    memberid    questionid
0   123         4fc
1   123         daf
2   123         f82
3   456         cfd
4   456         730

df.groupby('memberid').apply(lambda x: len(list(combinations(x['questionid'], 2))))

你可以

df.groupby('memberid').apply(lambda x: list(combinations(x['questionid'], 2)))

它会给你

memberid
123    [(4fc, daf), (4fc, f82), (daf, f82)]
456                            [(cfd, 730)]

编辑：

您可以获得每个memberid的组合计数，如下所示：

memberid    created    firstencodedid    questionid
123       <some date>    <some ID>          4fc
123       <some date>    <some ID>          daf
123       <some date>    <some ID>          f82
123       <some date>    <some ID>          cfd
123       <some date>    <some ID>          730

d = {'memberid': [123,123,123,456,456], 'questionid': ['4fc', 'daf', 'f82', 'cfd', '730']}
df = pd.DataFrame(d)


    memberid    questionid
0   123         4fc
1   123         daf
2   123         f82
3   456         cfd
4   456         730

df.groupby('memberid').apply(lambda x: len(list(combinations(x['questionid'], 2))))

它会回来的

memberid
123        3
456        1
dtype: int64

或者只是按需生成计数和/或配对-努力想找到一个理由，让任何人都想一次记住整个批次…@Boud:谢谢你的回答。但此代码不会告诉我问题对是什么。这是每个

memberid

的组合数。我想知道每种组合的计数。还有一些组合将出现在各种

memberid

值中。我还想加上这些数字。“可能是我走错了方向，在这种情况下，我希望我的手臂能扭动一下，并展示出正确的方向。@JonClements好的，看到最新的评论，我们处于XY问题区域。我只想给出我的答案，因为我认为提醒组合数学的结果应该是从数学计算出来的，而不是从一般情况下的数据。我在问题中添加了更多的说明。这可能有用。我怀疑内存是否能够保存整个命令的输出，因为我预计将有数以亿计的

问题对。但是我可以在上面编写一个生成器，并将这些单独的memberid
结果存储在一个平面文件中。然后我就可以对这些对进行分组，并在memberid
值中添加samme对的计数。一旦我实施了，我会接受你的回答。非常感谢您的帮助，非常感谢。因此我运行了以下代码：IncorrectQuestions[IncorrectQuestions['memberid']==123]。groupby（'memberid'）。apply（lambda x:len（list（itertools.compositions）（x['questionid']，2））
并给出了以下结果：memberID123 561数据类型：int64
。不幸的是，它仍然没有给出我所期望的组合和计数。我所期望的是：memberid对计数123[4fc，daf]1123[4fc，f82]1123[daf，f82]1456[4fc，f82]1456[daf，f82]1我很惊讶使用len的计数不起作用，它在这里工作，并给出我在答案中打印的输出