Python 3.x 如何拆分数据不唯一的Dataframe列
我在dataframe中有一个名为users的列,它没有唯一的格式。我正在做一个数据清理项目,因为数据看起来不可读Python 3.x 如何拆分数据不唯一的Dataframe列,python-3.x,pandas,dataframe,Python 3.x,Pandas,Dataframe,我在dataframe中有一个名为users的列,它没有唯一的格式。我正在做一个数据清理项目,因为数据看起来不可读 company Users A [{"Name":"Martin","Email":"name_1@email.com","EmpType":"Full"},{"Name":"Rick&
company Users
A [{"Name":"Martin","Email":"name_1@email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2@email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2@email.com","EmpType":"Full","Dept":"Sales" }]
我使用了下面的查询,将数据框分解如下
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1@email.com","EmpType":"Full" "Name":"Rick","Email":"name_2@email.com","Dept":"HR"
B "Name":"John","Email":"name_2@email.com","EmpType":"Full","Dept":"Sales"
使用“,”进一步打破上述df,使用相同的查询,我得到的输出为
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1@email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2@email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2@email.com" "EmpType":"Full" "Dept":"Sales"
由于这个数据帧看起来很混乱,我希望得到如下输出。我觉得给列命名的最好方法是使用“name”中的列值“name”:“Martin”本身,如果我们使用df.rename,则列名将不匹配
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1@email.com Full Rick name_2@email.com "HR"
2 John name_2@email.com" Full Sales
是否有任何方法可以从原始数据帧获得上述输出。使用:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
详细信息:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
首先,我们使用计算Users
列中的字符串,然后在Users
列上使用创建数据帧d
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1@email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2@email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2@email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
print(d)
company Name Email EmpType Dept
0 A Martin name_1@email.com Full NaN
1 A Rick name_2@email.com NaN HR
2 B John name_2@email.com Full Sales
从d
中的Users
列创建一个新的数据框,并使用将此新数据框与d
连接起来
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1@email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2@email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2@email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
print(d)
company Name Email EmpType Dept
0 A Martin name_1@email.com Full NaN
1 A Rick name_2@email.com NaN HR
2 B John name_2@email.com Full Sales
在列company
上使用,然后使用为每组创建一个计数器,然后使用将d
的索引设置为company
+counter
。然后使用重塑数据框的形状,创建多索引
列
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1@email.com name_2@email.com Full NaN NaN HR
B John NaN name_2@email.com NaN Full NaN Sales NaN
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1@email.com name_2@email.com Full NaN NaN HR
B John NaN name_2@email.com NaN Full NaN Sales NaN
最后使用map
和.join
将多索引列展平
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1@email.com name_2@email.com Full NaN NaN HR
B John NaN name_2@email.com NaN Full NaN Sales NaN
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1@email.com name_2@email.com Full NaN NaN HR
B John NaN name_2@email.com NaN Full NaN Sales NaN
谢谢你@Shubham Sharma。当我的行数限制为80时,你的答案有效。但是,如果行数超过90,我会得到“ValueError:格式错误的节点或字符串”。不确定问题是否与数据本身有关..您好。。我感谢你的帮助。。我终于发现问题出在数据上。我把它拆下来修理好了。@JackJack太棒了!快乐编码。