将具有排序唯一值的嵌套数据帧转换为Python中的嵌套字典
我试图获取一个嵌套的数据帧并将其转换为一个嵌套的字典 这是我的原始数据帧,具有以下唯一值: 输入:将具有排序唯一值的嵌套数据帧转换为Python中的嵌套字典,python,pandas,dictionary,dataframe,nested,Python,Pandas,Dictionary,Dataframe,Nested,我试图获取一个嵌套的数据帧并将其转换为一个嵌套的字典 这是我的原始数据帧,具有以下唯一值: 输入:df.head(5) 输出: reviewerName title reviewerRatings 0 Charles Harry Potter Book Seven News:... 3.0 1 Katherine Harry Potter
df.head(5)
输出:
reviewerName title reviewerRatings
0 Charles Harry Potter Book Seven News:... 3.0
1 Katherine Harry Potter Boxed Set, Books... 5.0
2 Lora Harry Potter and the Sorcerer... 5.0
3 Cait Harry Potter and the Half-Blo... 5.0
4 Diane Harry Potter and the Order of... 5.0
reviewerRatings
reviewerName title
Charles Harry Potter Book Seven News:... 3.0
Harry Potter and the Half-Blo... 3.5
Harry Potter and the Order of... 4.0
Katherine Harry Potter Boxed Set, Books... 5.0
Harry Potter and the Half-Blo... 2.5
Harry Potter and the Order of... 5.0
...
230898 rows x 1 columns
{'reviewerRatings':
{
('Charles', 'Harry Potter Book Seven News:...'): 3.0,
('Charles', 'Harry Potter and the Half-Blo...'): 3.5,
('Charles', 'Harry Potter and the Order of...'): 4.0,
('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0,
('Katherine', 'Harry Potter and the Half-Blo...'): 2.5,
('Katherine', 'Harry Potter and the Order of...'): 5.0,
...}
}
输入:len(df['reviewerName'].unique())
输出:66130
考虑到66130 unqiue值中的每个值都有多个值(即“Charles”将出现3次),我将66130唯一的“reviewerName”作为新嵌套数据框中的键分配给它们,然后使用“title”和“reviewerRatings”分配值作为同一嵌套数据帧中的另一层key:value
{'Charles':
{'Harry Potter Book Seven News:...': 3.0,
'Harry Potter and the Half-Blo...': 3.5,
'Harry Potter and the Order of...': 4.0},
'Katherine':
{'Harry Potter Boxed Set, Books...': 5.0,
'Harry Potter and the Half-Blo...': 2.5,
'Harry Potter and the Order of...': 5.0},
...}
输入:df=df.set_索引(['reviewerName','title'])。排序_索引()
输出:
reviewerName title reviewerRatings
0 Charles Harry Potter Book Seven News:... 3.0
1 Katherine Harry Potter Boxed Set, Books... 5.0
2 Lora Harry Potter and the Sorcerer... 5.0
3 Cait Harry Potter and the Half-Blo... 5.0
4 Diane Harry Potter and the Order of... 5.0
reviewerRatings
reviewerName title
Charles Harry Potter Book Seven News:... 3.0
Harry Potter and the Half-Blo... 3.5
Harry Potter and the Order of... 4.0
Katherine Harry Potter Boxed Set, Books... 5.0
Harry Potter and the Half-Blo... 2.5
Harry Potter and the Order of... 5.0
...
230898 rows x 1 columns
{'reviewerRatings':
{
('Charles', 'Harry Potter Book Seven News:...'): 3.0,
('Charles', 'Harry Potter and the Half-Blo...'): 3.5,
('Charles', 'Harry Potter and the Order of...'): 4.0,
('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0,
('Katherine', 'Harry Potter and the Half-Blo...'): 2.5,
('Katherine', 'Harry Potter and the Order of...'): 5.0,
...}
}
作为对
,我尝试将嵌套数据框转换为嵌套字典
上面新的嵌套数据框列索引在第1行(第3列)显示“reviewerRatings”,在第2行(第1列和第2列)显示“reviewerName”和“title”,当我运行下面的df.to_dict()
方法时,输出显示{reviewerRatingsIndexName:{(reviewerName,title):reviewerRatings}
输入:df.to_dict()
输出:
reviewerName title reviewerRatings
0 Charles Harry Potter Book Seven News:... 3.0
1 Katherine Harry Potter Boxed Set, Books... 5.0
2 Lora Harry Potter and the Sorcerer... 5.0
3 Cait Harry Potter and the Half-Blo... 5.0
4 Diane Harry Potter and the Order of... 5.0
reviewerRatings
reviewerName title
Charles Harry Potter Book Seven News:... 3.0
Harry Potter and the Half-Blo... 3.5
Harry Potter and the Order of... 4.0
Katherine Harry Potter Boxed Set, Books... 5.0
Harry Potter and the Half-Blo... 2.5
Harry Potter and the Order of... 5.0
...
230898 rows x 1 columns
{'reviewerRatings':
{
('Charles', 'Harry Potter Book Seven News:...'): 3.0,
('Charles', 'Harry Potter and the Half-Blo...'): 3.5,
('Charles', 'Harry Potter and the Order of...'): 4.0,
('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0,
('Katherine', 'Harry Potter and the Half-Blo...'): 2.5,
('Katherine', 'Harry Potter and the Order of...'): 5.0,
...}
}
但是对于下面我想要的输出,我希望得到的输出是{reviewerName:{title:reviewerRating}}
,这正是我在嵌套数据框架中排序的方式
{'Charles':
{'Harry Potter Book Seven News:...': 3.0,
'Harry Potter and the Half-Blo...': 3.5,
'Harry Potter and the Order of...': 4.0},
'Katherine':
{'Harry Potter Boxed Set, Books...': 5.0,
'Harry Potter and the Half-Blo...': 2.5,
'Harry Potter and the Order of...': 5.0},
...}
有没有办法操纵嵌套的数据帧或嵌套的字典,以便在运行df.to_dict()
方法时,它将显示{reviewerName:{title:reviewerRating}}
谢谢 与lambda函数一起用于字典
每个审阅者姓名
,然后通过以下方式输出系列
转换:
有两种方法。您可以将
groupby
与一起使用来记录,或者使用集合来迭代行。defaultdict
。值得注意的是,后者并不一定效率较低
+
从每个groupby
对象构造一个序列,并将其转换为字典以给出一系列字典值。最后,通过另一个to_dict
调用将其转换为字典字典
res = df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
定义dict
对象的defaultdict
,并逐行迭代数据帧
from collections import defaultdict
res = defaultdict(dict)
for row in df.itertuples(index=False):
res[row.reviewerName][row.title] = row.reviewerRatings
生成的defaultdict
不需要转换回常规dict
,因为defaultdict
是dict
的子类
绩效基准
基准测试是建立和数据相关的。您应该使用自己的数据进行测试,以查看哪些数据最有效
# Python 3.6.5, Pandas 0.19.2
from collections import defaultdict
from random import sample
# construct sample dataframe
np.random.seed(0)
n = 10**4 # number of rows
names = np.random.choice(['Charles', 'Lora', 'Katherine', 'Matthew',
'Mark', 'Luke', 'John'], n)
books = [f'Book_{i}' for i in sample(range(10**5), n)]
ratings = np.random.randint(0, 6, n)
df = pd.DataFrame({'reviewerName': names, 'title': books, 'reviewerRatings': ratings})
def jez(df):
return df.groupby('reviewerName')['title','reviewerRatings']\
.apply(lambda x: dict(x.values))\
.to_dict()
def jpp1(df):
return df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
def jpp2(df):
dd = defaultdict(dict)
for row in df.itertuples(index=False):
dd[row.reviewerName][row.title] = row.reviewerRatings
return dd
%timeit jez(df) # 33.5 ms per loop
%timeit jpp1(df) # 17 ms per loop
%timeit jpp2(df) # 21.1 ms per loop