Python 2.7 为重复记录创建新列:Python

Python 2.7 为重复记录创建新列:Python,python-2.7,pandas,Python 2.7,Pandas,我有一个在运行时生成的输入文件,其格式如下: 案例1: 生成的文件也可以是以下形式: 案例2: 预期产出: 案例1: 案例2: Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2 0 123459876 3 A3 1000 1 NaN None NaN NaN 1 1234567890 1 A1 200

我有一个在运行时生成的输入文件,其格式如下: 案例1:

生成的文件也可以是以下形式: 案例2:

预期产出: 案例1:

案例2:

Numbers  ID_1 P_ID_1  Cores_1  Count_1  ID_2 P_ID_2  Cores_2  Count_2
0   123459876     3     A3     1000        1   NaN   None      NaN      NaN
1  1234567890     1     A1      200        3   Nan   None      Nan      Nan
在输入文件中,可能有0行、1行或2行(但不超过2行)具有相同的编号(1234567890)。这两行,我试图总结成一行(如输出文件所示)

我想把我的输入文件转换成上面的结构。我怎么做?我对熊猫真的很陌生。请帮我做这件事。提前谢谢

在案例2中:

输出文件的结构必须保持不变,即列名应相同。

我认为您需要:

  • 首先使用创建新列,用于计数
    数字

  • 然后通过+

  • 列中的
    多索引
    转换为具有列表理解功能的
    索引

编辑:

对于转换为
int
,可以使用自定义函数,该函数仅在无
错误时转换,因此带有
NaN
s的列不会更改:

def f(x):
    try:
        return x.astype(int)
    except (TypeError, ValueError):
        return x

df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
df1 = df1.apply(f).reset_index()
print (df1)
      Numbers  ID_1 P_ID_1  Cores_1  Count_1  ID_2 P_ID_2  Cores_2  Count_2
0   123459876     3     A3     1000        1   NaN   None      NaN      NaN
1  1234567890     1     A1      200        3   2.0     A2    150.0      3.0
编辑1:

每组必须有1或2行,因此可以使用:

def f(x):
    try:
        return x.astype(int)
    except (TypeError, ValueError):
        return x

df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
cols = ['ID_1','P_ID_1','Cores_1','Count_1','ID_2','P_ID_2','Cores_2','Count_2']
df1 = df1.apply(f).reindex_axis(cols, axis=1).reset_index()
print (df1)
      Numbers  ID_1 P_ID_1  Cores_1  Count_1  ID_2  P_ID_2  Cores_2  Count_2
0   123459876     3     A3     1000        1   NaN     NaN      NaN      NaN
1  1234567890     1     A1      200        3   NaN     NaN      NaN      NaN

这太完美了。非常感谢。只有我注意到所有的整数值都被转换成float。有没有其他方法来解决这个问题,而不是显式地键入强制转换每个列名?嗯,您是否需要不带
NaN
s的列转换为
int
?不完全是这样,我只对ID列,即ID_1和ID_2有问题是的,但输出中只有列
ID_1
可以转换为
int
,因为如果列中至少有一个
NaN
值,就不可能进行转换(
ID_2
),因为
NaN
总是浮动的。另外,你能花点时间向我解释一下你做了什么吗?如果不是太麻烦的话,是吗?
Numbers  ID_1 P_ID_1  Cores_1  Count_1  ID_2 P_ID_2  Cores_2  Count_2
0   123459876     3     A3     1000        1   NaN   None      NaN      NaN
1  1234567890     1     A1      200        3   Nan   None      Nan      Nan
df['g'] = df.groupby('Numbers').cumcount()
df = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df.columns]
df = df.reset_index()
print (df)
      Numbers  ID_1 P_ID_1  Cores_1  Count_1  ID_2 P_ID_2  Cores_2  Count_2
0   123459876   3.0     A3   1000.0      1.0   NaN   None      NaN      NaN
1  1234567890   1.0     A1    200.0      3.0   2.0     A2    150.0      3.0
def f(x):
    try:
        return x.astype(int)
    except (TypeError, ValueError):
        return x

df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
df1 = df1.apply(f).reset_index()
print (df1)
      Numbers  ID_1 P_ID_1  Cores_1  Count_1  ID_2 P_ID_2  Cores_2  Count_2
0   123459876     3     A3     1000        1   NaN   None      NaN      NaN
1  1234567890     1     A1      200        3   2.0     A2    150.0      3.0
def f(x):
    try:
        return x.astype(int)
    except (TypeError, ValueError):
        return x

df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
cols = ['ID_1','P_ID_1','Cores_1','Count_1','ID_2','P_ID_2','Cores_2','Count_2']
df1 = df1.apply(f).reindex_axis(cols, axis=1).reset_index()
print (df1)
      Numbers  ID_1 P_ID_1  Cores_1  Count_1  ID_2  P_ID_2  Cores_2  Count_2
0   123459876     3     A3     1000        1   NaN     NaN      NaN      NaN
1  1234567890     1     A1      200        3   NaN     NaN      NaN      NaN