Python 字符串在表中每隔N个空间拆分一行_Python_Pandas_Dataframe

Python 字符串在表中每隔N个空间拆分一行

python pandas dataframe

Python 字符串在表中每隔N个空间拆分一行,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据集，我正在尝试拆分OrtoB列，以允许我的数据在多对多交互中从A到B进行组织示例数据集 new_name Score OrtoA OrtoB 0 1 3064 g2797.t1 1.000 YHR165C 1.000 1 2 2820 g2375.t1 1.000

我有一个数据集，我正在尝试拆分OrtoB列，以允许我的数据在多对多交互中从A到B进行组织

示例数据集

   new_name  Score            OrtoA                                      OrtoB
0         1   3064   g2797.t1 1.000                              YHR165C 1.000
1         2   2820   g2375.t1 1.000                              YJL130C 1.000
2         3   2711   g1023.t1 1.000                              YLR106C 1.000
3         4   2710  g15922.t1 1.000                              YNR016C 1.000
4         5   2568   g3549.t1 1.000                              YDL171C 1.000
5         6   2494  g10464.t1 1.000  YOR153W 1.000 YDR406W 0.585 YOR328W 0.454
6         7   2402  g15604.t1 1.000                YGR032W 1.000 YLR342W 0.679

        OrtoA       OrtoB   
g2797.t1    1   YHR165C 1
g2375.t1    1   YJL130C 1
g1023.t1    1   YLR106C 1
g15922.t1   1   YNR016C 1
g3549.t1    1   YDL171C 1
g10464.t1   1   YOR153W 1
g10464.t1   1   YDR406W 0.585
g10464.t1   1   YOR328W 0.454
g15604.t1   1   YGR032W 1
g15604.t1   1   YLR342W 0.679

到目前为止，我已经能够在python中使用下面的代码拆分字符串，并遵循前面回答的帖子中的示例

然而，只有当只有一个空间我试图分割时，它才起作用。我所寻找的是帮助分裂每2个空间，以获得如下结果

期望的结果

   new_name  Score            OrtoA                                      OrtoB
0         1   3064   g2797.t1 1.000                              YHR165C 1.000
1         2   2820   g2375.t1 1.000                              YJL130C 1.000
2         3   2711   g1023.t1 1.000                              YLR106C 1.000
3         4   2710  g15922.t1 1.000                              YNR016C 1.000
4         5   2568   g3549.t1 1.000                              YDL171C 1.000
5         6   2494  g10464.t1 1.000  YOR153W 1.000 YDR406W 0.585 YOR328W 0.454
6         7   2402  g15604.t1 1.000                YGR032W 1.000 YLR342W 0.679

        OrtoA       OrtoB   
g2797.t1    1   YHR165C 1
g2375.t1    1   YJL130C 1
g1023.t1    1   YLR106C 1
g15922.t1   1   YNR016C 1
g3549.t1    1   YDL171C 1
g10464.t1   1   YOR153W 1
g10464.t1   1   YDR406W 0.585
g10464.t1   1   YOR328W 0.454
g15604.t1   1   YGR032W 1
g15604.t1   1   YLR342W 0.679

如果所需的某些列中有空格，则合并拆分后所需的内容。也可以不带任何参数而使用split（），而不是split（“”）。即使使用了选项卡或其他空白，也可以工作

 def concat_pairs(l)
     return [ "%s %s" % (l[i], l[i+1] for i, x in enumerate(l) if not i % 2]
 concat_pairs( z['OrtoB'].str ).apply( ...

一个更简单的解决方法是重新拆分

 re.split('[a-f]+ [a-f]+', z['OrtoB'].str).apply(...

根据你所引用的答案，我有了一些有用的东西：

import pandas as pd
s = df.OrtoB.str.split(' ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
#merge 2 consecutive OrtoB values in to 1 and separated by ' '.
s = pd.DataFrame(data = s.values.reshape(-1,2),index=s.index[::2]).apply(lambda x: ' '.join(x), axis=1)
del(df['OrtoB'])
s.name = 'OrtoB'
df.join(s)
Out[148]: 
   new_name  Score            OrtoA          OrtoB
0         1   3064   g2797.t1 1.000  YHR165C 1.000
1         2   2820   g2375.t1 1.000  YJL130C 1.000
2         3   2711   g1023.t1 1.000  YLR106C 1.000
3         4   2710  g15922.t1 1.000  YNR016C 1.000
4         5   2568   g3549.t1 1.000  YDL171C 1.000
5         6   2494  g10464.t1 1.000  YOR153W 1.000
5         6   2494  g10464.t1 1.000  YDR406W 0.585
5         6   2494  g10464.t1 1.000  YOR328W 0.454
6         7   2402  g15604.t1 1.000  YGR032W 1.000
6         7   2402  g15604.t1 1.000  YLR342W 0.679

您可以使用：

#column into lists
orto = z['OrtoB'].str.split()
#remove all empty lists
orto = orto[orto.astype(bool)]
#get lengths of lists, but floor divide by 2 because pairs
lens = orto.str.len() // 2
#explode nested lists to array
orto2 = np.concatenate(orto.values)
#repeat index to explode
idx = z.index.repeat(lens)
#create DataFrame and join both column together
s = pd.DataFrame(orto2.reshape(-1,2), index=idx).apply(' '.join, axis=1).rename('OrtoB')
#remove original column and join s
z = z.drop('OrtoB', axis=1).join(s).reset_index(drop=True)
print (z)
   new_name  Score            OrtoA          OrtoB
0         1   3064   g2797.t1 1.000  YHR165C 1.000
1         2   2820   g2375.t1 1.000  YJL130C 1.000
2         3   2711   g1023.t1 1.000  YLR106C 1.000
3         4   2710  g15922.t1 1.000  YNR016C 1.000
4         5   2568   g3549.t1 1.000  YDL171C 1.000
5         6   2494  g10464.t1 1.000  YOR153W 1.000
6         6   2494  g10464.t1 1.000  YDR406W 0.585
7         6   2494  g10464.t1 1.000  YOR328W 0.454
8         7   2402  g15604.t1 1.000  YGR032W 1.000
9         7   2402  g15604.t1 1.000  YLR342W 0.679

以下是我的解决方案：

# split `OrtoB` into lists
df['OrtoB'] = df['OrtoB'].str.findall(r'([A-Z\d]{6,}\s[\d\.]+)')

# now we can use the same technique as in: http://stackoverflow.com/a/40449726/5741205    
def split_list_in_cols_to_rows(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    return pd.DataFrame({
        col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
        for col in idx_cols
    }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
      .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
      .loc[:, df.columns]

结果:

In [30]: %paste
df['OrtoB'] = df['OrtoB'].str.findall(r'([A-Z\d]{6,}\s[\d\.]+)')
new = split_list_in_cols_to_rows(df, 'OrtoB')

new
## -- End pasted text --
Out[30]:
   new_name  Score            OrtoA          OrtoB
0         1   3064   g2797.t1 1.000  YHR165C 1.000
1         2   2820   g2375.t1 1.000  YJL130C 1.000
2         3   2711   g1023.t1 1.000  YLR106C 1.000
3         4   2710  g15922.t1 1.000  YNR016C 1.000
4         5   2568   g3549.t1 1.000  YDL171C 1.000
5         6   2494  g10464.t1 1.000  YOR153W 1.000
6         6   2494  g10464.t1 1.000  YDR406W 0.585
7         6   2494  g10464.t1 1.000  YOR328W 0.454
8         7   2402  g15604.t1 1.000  YGR032W 1.000
9         7   2402  g15604.t1 1.000  YLR342W 0.679

我认为在打印/复制粘贴示例数据集时，格式出现了问题。列似乎与列名不匹配。这非常有效，感谢您将其放入函数中。