Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/336.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在将列表转换为数据帧时优化时间?(第二部分)_Python_Pandas_Dataframe_Numpy_Time - Fatal编程技术网

Python 如何在将列表转换为数据帧时优化时间?(第二部分)

Python 如何在将列表转换为数据帧时优化时间?(第二部分),python,pandas,dataframe,numpy,time,Python,Pandas,Dataframe,Numpy,Time,我之前的问题没有得到任何正确的答案: 让我进一步解释一下这个例子: 我们把数据帧更精确地看成 First Name Last Name Country Address Age Age-Group Photo1 Photo2 Phototype Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg PP Andy Carnot GE

我之前的问题没有得到任何正确的答案:

让我进一步解释一下这个例子:

<>我们把数据帧更精确地看成

First Name   Last Name    Country  Address      Age  Age-Group  Photo1 Photo2 Phototype
Mark         Shelby        US      Petersburg    42   Adult     1.jpg  2.jpg    PP
Andy         Carnot        GE      Freiburg      16    Teen     1.jpg           PP
当转换为csv时,我希望输出数据帧为

N,Mark,Shelby,US
AG,43,Adult
AD,Petersburg
PH,1.jpg,PP
PH,2.jpg,PP
N,Andy,Carnot,GE
AG,16,Teen
AD,Freiburg
PH,1.jpg,PP
不应映射字符PH、AG、AD、N。它可以是任何字符

在循环此列表并映射和转换为数据帧时,此操作效果良好。但是,对于大型数据集来说,这需要很多时间。此过程的确切代码在上一个问题中

[['N','First Name','Last Name', 'Country'],
 ['AG','Age','Age-Group'],
 ['AD','Address'],
 ['PH','Photo1','Phototype'],
 ['PH','Photo2','Phototype'], 
 ]
使用:

First是根据最终列表中的第一个值为键定义的字典,所有列都用字符串分层:

d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo','Phototype']}
然后根据字典中的列表筛选数据帧:

out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}
对于
PH
,更改格式需要:

out['PH'] = (out['PH'].melt('Phototype', 
                           value_name='Photo',
                           ignore_index=False)
                      .drop('variable',1)[['Photo','Phototype']]
                      .dropna(subset=['Photo']))
最后创建相同的列并通过
concat
进行连接,并进行排序以获得正确的顺序:

out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()}

df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('')
print (df)
  level_0           0       1   2
0       N        Mark  Shelby  US
0      AG          42   Adult    
0      AD  Petersburg            
0      PH       1.jpg      PP    
0      PH       2.jpg      PP    
1       N        Andy  Carnot  GE
1      AG          16    Teen    
1      AD    Freiburg            
1      PH       1.jpg      PP   
最后通过删除空字符串创建不同长度的列表:

fin = [x[x!= ''].tolist() for x in df.to_numpy() ]
print (fin)
[['N', 'Mark', 'Shelby', 'US'],
 ['AG', 42, 'Adult'],
 ['AD', 'Petersburg'], 
 ['PH', '1.jpg', 'PP'], 
 ['PH', '2.jpg', 'PP'], 
 ['N', 'Andy', 'Carnot', 'GE'], 
 ['AG', 16, 'Teen'], 
 ['AD', 'Freiburg'], 
 ['PH', '1.jpg', 'PP']]
编辑:对于带有数字的match
Photo
使用regex,因此使用
startswith
代替
contains
,通过
|
为regex
使用列表的连接值:

d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo\d+','Phototype']}

out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out)
{'N':   First Name Last Name Country
0       Mark    Shelby      US
1       Andy    Carnot      GE, 'AG':    Age Age-Group
0   42     Adult
1   16      Teen, 'AD':       Address
0  Petersburg
1    Freiburg, 'PH':   Photo1 Photo2 Phototype
0  1.jpg  2.jpg        PP
1  1.jpg    NaN        PP}
编辑:技巧是将
^
添加到字符串的开头,并将
$
添加到字符串的结尾以获得精确的匹配值,然后是正确工作所必需的
照片
+“数字”:

print (df)
  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  \
0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   
1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN   

  Phototype Age Detail Address Detail  
0        PP      Young            Far  
1        PP  Too Young           Near  


d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo\d+','Phototype']}

d = {k: [rf'^{x}$' for x in v] for k, v in d.items()}
print (d)
{'N': ['^First Name$', '^Last Name$', '^Country$'], 
 'AG': ['^Age$', '^Age-Group$'], 
 'AD': ['^Address$'], 
 'PH': ['^Photo\\d+$', '^Phototype$']}

out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}

print (out['AG'])
   Age Age-Group
0   42     Adult
1   16      Teen

print (out['AD'])
      Address
0  Petersburg
1    Freiburg

在行中,
out={k:df.loc[:,df.columns.str.startswith(tuple(v))]对于k,v In d.items()
,str.startswithtuple(v)过滤器???@AtomStore-Idea是匹配的
Photo1
Photo2
。。。如果仅定义了
Photo
,则按起始子字符串筛选列名称,对于通用解决方案,还可以匹配所有其他列名称。@AtomStore-因此在前面的回答中,Photo\u df=df1.filter(like='Photo')product\u df=df1.filter(like='Description')?因此需要将
df.columns.str.startswith(tuple(v))
更改为
df.columns.str.contains(“|”).join(v))
?@AtomStore-添加了答案。
print (df)
  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  \
0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   
1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN   

  Phototype Age Detail Address Detail  
0        PP      Young            Far  
1        PP  Too Young           Near  


d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo\d+','Phototype']}

d = {k: [rf'^{x}$' for x in v] for k, v in d.items()}
print (d)
{'N': ['^First Name$', '^Last Name$', '^Country$'], 
 'AG': ['^Age$', '^Age-Group$'], 
 'AD': ['^Address$'], 
 'PH': ['^Photo\\d+$', '^Phototype$']}

out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}

print (out['AG'])
   Age Age-Group
0   42     Adult
1   16      Teen

print (out['AD'])
      Address
0  Petersburg
1    Freiburg