Python 数据帧中特定列的快速拆分行_Python_Pandas

Python 数据帧中特定列的快速拆分行

python pandas

Python 数据帧中特定列的快速拆分行,python,pandas,Python,Pandas,我有以下数据框： import pandas as pd df = pd.DataFrame({'Probes':["1415693_at","1415693_at"], 'Genes':["Canx","LOC101056688 /// Wars "], 'cv_filter':[ 0.134,0.290], 'Organ' :["LN","LV"]} ) df = d

我有以下数据框：

import pandas as pd
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"],
                   'Genes':["Canx","LOC101056688 /// Wars "],
                   'cv_filter':[ 0.134,0.290],
                   'Organ' :["LN","LV"]}   )    
df = df[["Probes","Genes","cv_filter","Organ"]]

看起来是这样的：

In [16]: df
Out[16]:
       Probes                   Genes  cv_filter Organ
0  1415693_at                    Canx      0.134    LN
1  1415693_at  LOC101056688 /// Wars       0.290    LV

我想做的是根据它输入的Genes列拆分行由“//”分隔

我想得到的结果是

       Probes                   Genes  cv_filter Organ
0  1415693_at                    Canx      0.134    LN
1  1415693_at            LOC101056688      0.290    LV
2  1415693_at                    Wars      0.290    LV

我总共要检查约15万行。有没有快速的处理方法？

您可以尝试第一列

基因

，创建新的

系列

，并将其转换为原始的

df

：

import pandas as pd
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"],
                   'Genes':["Canx","LOC101056688 /// Wars "],
                   'cv_filter':[ 0.134,0.290],
                   'Organ' :["LN","LV"]}   )    
df = df[["Probes","Genes","cv_filter","Organ"]]  
print df
       Probes                   Genes  cv_filter Organ
0  1415693_at                    Canx      0.134    LN
1  1415693_at  LOC101056688 /// Wars       0.290    LV

s = pd.DataFrame([ x.split('///') for x in df['Genes'].tolist() ], index=df.index).stack()
#or you can use approach from comment
#s = df['Genes'].str.split('///', expand=True).stack()

s.index = s.index.droplevel(-1) 
s.name = 'Genes' 
print s
0             Canx
1    LOC101056688 
1            Wars 
Name: Genes, dtype: object

#remove original columns, because error:
#ValueError: columns overlap but no suffix specified: Index([u'Genes'], dtype='object')    
df = df.drop('Genes', axis=1)

df = df.join(s).reset_index(drop=True)
print df[["Probes","Genes","cv_filter","Organ"]] 
       Probes          Genes  cv_filter Organ
0  1415693_at           Canx      0.134    LN
1  1415693_at  LOC101056688       0.290    LV
2  1415693_at          Wars       0.290    LV

为什么不

df['Genes'].str.split（'///'，expand=True）.stack（）

而不是

df['Genes'].str.split（'//'）。apply（pd.Series，1）.stack（）

。大约是2倍faster@AntonProtopopov-谢谢。我把它作为替代解决方案添加到我的答案中（只是比

DataFrame

constructor慢一点）。你是对的，所以

index

被添加到

DataFrame

构造函数中。