Python Pandas：从DataFrame列生成字典的最有效方法_Python_Pandas_Hash_Machine Learning_Dataframe

Python Pandas：从DataFrame列生成字典的最有效方法

python pandas hash machine-learning dataframe

Python Pandas：从DataFrame列生成字典的最有效方法,python,pandas,hash,machine-learning,dataframe,Python,Pandas,Hash,Machine Learning,Dataframe,DataFrame看起来像： import pandas as pd import numpy as np import random labels = ["c1","c2","c3"] c1 = ["one","one","one","two","two","three","three","three","three"] c2 = [random.random() for i in range(len(c1))] c3 = ["alpha","beta","gamma","alpha","g

DataFrame看起来像：

import pandas as pd
import numpy as np
import random

labels = ["c1","c2","c3"]
c1 = ["one","one","one","two","two","three","three","three","three"]
c2 = [random.random() for i in range(len(c1))]
c3 = ["alpha","beta","gamma","alpha","gamma","alpha","beta","gamma","zeta"]
DF = pd.DataFrame(np.array([c1,c2,c3])).T
DF.columns = labels

我能想到制作这本词典的唯一方法是：

      c1               c2     c3
0    one   0.440958516531  alpha
1    one   0.476439953723   beta
2    one   0.254235673552  gamma
3    two   0.882724336464  alpha
4    two    0.79817899139  gamma
5  three   0.677464637887  alpha
6  three   0.292927670096   beta
7  three  0.0971956881825  gamma
8  three   0.993934915508   zeta

生成的字典如下所示：

D_greek_value = {}

for greek in set(DF["c3"]):
    D_c1_c2 = {}
    for i in range(DF.shape[0]):
        row = DF.iloc[i,:]
        if row[2] == greek:
            D_c1_c2[row[0]] = row[1]
    D_greek_value[greek] = D_c1_c2
D_greek_value

我不想假设c1会以块的形式出现（“一个”每次都在一起）。我是在一个几百MB的csv上做这件事的，我觉得我做错了。如果你有任何想法，请帮忙

对于每个唯一的希腊字母，在数据帧上迭代多次。最好只迭代一次

由于您需要字典字典，因此可以使用

collections.defaultdict

和

dict

作为嵌套dict的默认构造函数：

{'alpha': {'one': '0.67919712421',
  'three': '0.67171020684',
  'two': '0.571150669821'},
 'beta': {'one': '0.895090207979', 'three': '0.489490074662'},
 'gamma': {'one': '0.964777504708',
  'three': '0.134397632659',
  'two': '0.10302290374'},
 'zeta': {'three': '0.0204226923557'}}

或者使用常规字典和调用

setdefault

创建嵌套dict

from collections import defaultdict

result = defaultdict(dict)
for dx, num_word, val, greek in DF.itertuples():
    result[greek][num_word] = val

IIUC，您可以利用

groupby

来处理大部分工作：

result = {}
for dx, num_word, val, greek in DF.itertuples():
    result.setdefault(greek, {})[num_word] = val

一些解释：首先我们按c3分组，并选择c1和c2列。这为我们提供了要转化为词典的组：

>>> result = df.groupby("c3")[["c1","c2"]].apply(lambda x: dict(x.values)).to_dict()
>>> pprint.pprint(result)
{'alpha': {'one': 0.440958516531,
           'three': 0.677464637887,
           'two': 0.8827243364640001},
 'beta': {'one': 0.47643995372299996, 'three': 0.29292767009599996},
 'gamma': {'one': 0.254235673552,
           'three': 0.0971956881825,
           'two': 0.79817899139},
 'zeta': {'three': 0.993934915508}}

考虑到这些子框架中的任何一个，比如从下一个到最后一个，我们需要找到一种方法将其转化为字典。例如：

>>> grouped = df.groupby("c3")[["c1", "c2"]]
>>> grouped.apply(lambda x: print(x,"\n","--")) # just for display purposes
      c1                   c2
0    one    0.679926178687387
3    two  0.11495090934413166
5  three   0.7458197179794177 
 --
      c1                   c2
0    one    0.679926178687387
3    two  0.11495090934413166
5  three   0.7458197179794177 
 --
      c1                   c2
1    one  0.12943266757277916
6  three  0.28944292691097817 
 --
      c1                   c2
2    one  0.36642834809341274
4    two   0.5690944224514624
7  three   0.7018221838129789 
 --
      c1                  c2
8  three  0.7195852795555373 
 --

如果我们尝试

dict

或

进行dict

，我们将无法得到我们想要的，因为索引和列标签会妨碍我们：

>>> d3
      c1        c2
2    one  0.366428
4    two  0.569094
7  three  0.701822

但我们可以忽略这一点，方法是使用

.values

下拉到底层数据，然后将其传递到

dict

：

>>> dict(d3)
{'c1': 2      one
4      two
7    three
Name: c1, dtype: object, 'c2': 2    0.366428
4    0.569094
7    0.701822
Name: c2, dtype: float64}
>>> d3.to_dict()
{'c1': {2: 'one', 4: 'two', 7: 'three'}, 'c2': {2: 0.36642834809341279, 4: 0.56909442245146236, 7: 0.70182218381297889}}

>>> d3.values
array([['one', 0.3664283480934128],
       ['two', 0.5690944224514624],
       ['three', 0.7018221838129789]], dtype=object)
>>> dict(d3.values)
{'three': 0.7018221838129789, 'one': 0.3664283480934128, 'two': 0.5690944224514624}

因此，如果我们应用它，我们将得到一个系列，其中索引是我们想要的c3键，值是字典，我们可以使用

。to_dict（）

：

很不错的。我想知道这是否比我发布的更快。我希望groupby的速度非常快，但lambda可能会减慢速度。不过我太懒了，没法计时。@StevenRumbalski:我也是。：-）我试着看看我是否可以得到同样的结果，只使用矢量运算，但反弹；其他人可能有更聪明的东西。但是我认为你已经把你的手指放在了大问题上（太多的迭代），相比之下，超出这个问题的一切都是次要的。@DSM我知道如何使用lambda函数进行排序，但实际上是从“.apply”到“.to_dict（）”？@O.rka:我添加了一些解释，一步一步地分解它。

>>> result = df.groupby("c3")[["c1", "c2"]].apply(lambda x: dict(x.values))
>>> result
c3
alpha    {'three': '0.7458197179794177', 'one': '0.6799...
beta     {'one': '0.12943266757277916', 'three': '0.289...
gamma    {'three': '0.7018221838129789', 'one': '0.3664...
zeta                       {'three': '0.7195852795555373'}
dtype: object
>>> result.to_dict()
{'zeta': {'three': '0.7195852795555373'}, 'gamma': {'three': '0.7018221838129789', 'one': '0.36642834809341274', 'two': '0.5690944224514624'}, 'beta': {'one': '0.12943266757277916', 'three': '0.28944292691097817'}, 'alpha': {'three': '0.7458197179794177', 'one': '0.679926178687387', 'two': '0.11495090934413166'}}