Python 3.x 如何拆分数据不唯一的Dataframe列

Python 3.x 如何拆分数据不唯一的Dataframe列,python-3.x,pandas,dataframe,Python 3.x,Pandas,Dataframe,我在dataframe中有一个名为users的列,它没有唯一的格式。我正在做一个数据清理项目,因为数据看起来不可读 company Users A [{"Name":"Martin","Email":"name_1@email.com","EmpType":"Full"},{"Name":"Rick&

我在dataframe中有一个名为users的列,它没有唯一的格式。我正在做一个数据清理项目,因为数据看起来不可读

   company                Users
    A   [{"Name":"Martin","Email":"name_1@email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2@email.com","Dept":"HR"}]
    B   [{"Name":"John","Email":"name_2@email.com","EmpType":"Full","Dept":"Sales" }]
我使用了下面的查询,将数据框分解如下

df2 = df 
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))

company                   Users0                                                     Users1
    A   "Name":"Martin","Email":"name_1@email.com","EmpType":"Full"              "Name":"Rick","Email":"name_2@email.com","Dept":"HR"
    B   "Name":"John","Email":"name_2@email.com","EmpType":"Full","Dept":"Sales" 
使用“,”进一步打破上述df,使用相同的查询,我得到的输出为

  Company      Users01               Users02        Users03                Users10             Users11            Users12                             
  1     "Name":"Martin" "Email":"name_1@email.com"  "EmpType":"Full" "Name":"Rick"  "Email":"name_2@email.com"  "Dept":"HR" 
  2     "Name":"John"   "Email":"name_2@email.com"  "EmpType":"Full"  "Dept":"Sales" 
由于这个数据帧看起来很混乱,我希望得到如下输出。我觉得给列命名的最好方法是使用“name”中的列值“name”:“Martin”本身,如果我们使用df.rename,则列名将不匹配

Company  Name_1        Email_1    EmpType_1 Dept_1  Name_2  Email_2         Dept_2                             
  1     Martin    name_1@email.com   Full           Rick   name_2@email.com  "HR" 
  2     John     name_2@email.com"   Full   Sales
是否有任何方法可以从原始数据帧获得上述输出。

使用:

df['Users'] = df['Users'].apply(ast.literal_eval)

d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
详细信息:

df['Users'] = df['Users'].apply(ast.literal_eval)

d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
首先,我们使用计算
Users
列中的字符串,然后在
Users
列上使用创建数据帧
d

print(d)
  company                                                                              Users
0       A                 {'Name': 'Martin', 'Email': 'name_1@email.com', 'EmpType': 'Full'}
1       A                        {'Name': 'Rick', 'Email': 'name_2@email.com', 'Dept': 'HR'}
2       B  {'Name': 'John', 'Email': 'name_2@email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
print(d)
  company    Name             Email EmpType   Dept
0       A  Martin  name_1@email.com    Full    NaN
1       A    Rick  name_2@email.com     NaN     HR
2       B    John  name_2@email.com    Full  Sales
d
中的
Users
列创建一个新的数据框,并使用将此新数据框与
d
连接起来

print(d)
  company                                                                              Users
0       A                 {'Name': 'Martin', 'Email': 'name_1@email.com', 'EmpType': 'Full'}
1       A                        {'Name': 'Rick', 'Email': 'name_2@email.com', 'Dept': 'HR'}
2       B  {'Name': 'John', 'Email': 'name_2@email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
print(d)
  company    Name             Email EmpType   Dept
0       A  Martin  name_1@email.com    Full    NaN
1       A    Rick  name_2@email.com     NaN     HR
2       B    John  name_2@email.com    Full  Sales
在列
company
上使用,然后使用为每组创建一个计数器,然后使用将
d
的索引设置为
company
+
counter
。然后使用重塑数据框的形状,创建
多索引

print(d)
           Name                   Email                   EmpType        Dept     
              1     2                 1                 2       1    2      1    2
company                                                                           
A        Martin  Rick  name_1@email.com  name_2@email.com    Full  NaN    NaN   HR
B          John   NaN  name_2@email.com               NaN    Full  NaN  Sales  NaN
print(d)
         Name_1 Name_2           Email_1           Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company                                                                                     
A        Martin   Rick  name_1@email.com  name_2@email.com      Full       NaN    NaN     HR
B          John    NaN  name_2@email.com               NaN      Full       NaN  Sales    NaN
最后使用
map
.join
多索引列展平

print(d)
           Name                   Email                   EmpType        Dept     
              1     2                 1                 2       1    2      1    2
company                                                                           
A        Martin  Rick  name_1@email.com  name_2@email.com    Full  NaN    NaN   HR
B          John   NaN  name_2@email.com               NaN    Full  NaN  Sales  NaN
print(d)
         Name_1 Name_2           Email_1           Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company                                                                                     
A        Martin   Rick  name_1@email.com  name_2@email.com      Full       NaN    NaN     HR
B          John    NaN  name_2@email.com               NaN      Full       NaN  Sales    NaN

谢谢你@Shubham Sharma。当我的行数限制为80时,你的答案有效。但是,如果行数超过90,我会得到“ValueError:格式错误的节点或字符串”。不确定问题是否与数据本身有关..您好。。我感谢你的帮助。。我终于发现问题出在数据上。我把它拆下来修理好了。@JackJack太棒了!快乐编码。