Python 3.x 根据下面不断变化的行数添加列
我正在为一个大学项目解决一个机器学习问题。作为输入,我得到了一个excel表格。 需要访问特定行下面的信息(条件:df[c1]!=0)并使用它创建新列。但是特定行之后的行数不是固定的 我尝试运行各种pandas函数(例如:While循环与iloc、iterrows相结合),但似乎什么都不起作用。现在我想知道是否需要创建一个函数,在这个函数中,我为每个顶部元素下面的每个组创建一个新的df。我想一定有更好的选择。我使用Python 3.6和Pandas 0.25.0 我试图得到以下结果 输入: 输出应如下所示:Python 3.x 根据下面不断变化的行数添加列,python-3.x,pandas,Python 3.x,Pandas,我正在为一个大学项目解决一个机器学习问题。作为输入,我得到了一个excel表格。 需要访问特定行下面的信息(条件:df[c1]!=0)并使用它创建新列。但是特定行之后的行数不是固定的 我尝试运行各种pandas函数(例如:While循环与iloc、iterrows相结合),但似乎什么都不起作用。现在我想知道是否需要创建一个函数,在这个函数中,我为每个顶部元素下面的每个组创建一个新的df。我想一定有更好的选择。我使用Python 3.6和Pandas 0.25.0 我试图得到以下结果 输入: 输出
Out[191]:
name c1 c2 ka tz zz
0 ab 1 info even more info more info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info more info
4 zz 0 more info
输出:
您可以按如下方式执行此操作:
# make sure c1 is of type int (if it isn't already)
# if it is string, just change the comparison further below
df['c1']= df['c1'].astype('int32')
# create two temporary aux columns in the original dataframe
# the first contains 1 for each row where c1 is nonzero
df['nonzero']= (df['c1'] != 0).astype('int')
# the second contains a "group index" to give
# all rows that belong together the same number
df['group']= df['nonzero'].cumsum()
# create a working copy from the original dataframe
df2= df[['c1', 'c2', 'group']].copy()
# add another column which contains the name of the
# column under which the text should appear
df2['col']= df['name'].where(df['nonzero']==0, 'c2')
# add a dummy column with all ones
# (needed to merge the original dataframe
# with the "transposed" dataframe later)
df2['nonzero']= 1
# now the main part
# use the prepared copy and index it on
# group, nonzero(1) and col
df3= df2[['group', 'nonzero', 'col', 'c2']].set_index(['group', 'nonzero', 'col'])
# unstack it, meaning col is "split off" to create a new column
# level (like pivoting), the rest remains in the index
df3= df3.unstack()
# now df3 has a multilevel column index
# to get rid of it and have regular column names
# just rename the columns and remove c2 which
# we get from the original dataframe
df3_names= ['{1}'.format(*tup) for tup in df3.columns]
df3.columns= df3_names
df3.drop(['c2'], axis='columns', inplace=True)
# df3 now contains the "transposed" infos in column c1
# which should appear in the row for which 'nonzero' contains 1
# to get this, use merge
result= df.merge(df3, left_on=['group', 'nonzero'], right_index=True, how='left')
# if you don't like the NaN values (for the rows with nonzero=0), use fillna
result.fillna('', inplace=True)
# remove the aux columns and the merged c2_1 column
# for c2_1 we can use the original c2 column from df
result.drop(['group', 'nonzero'], axis='columns', inplace=True)
# therefore we rename it to get the same naming schema
result.rename({'c2': 'c2_1'}, axis='columns', inplace=True)
结果如下所示:
Out[191]:
name c1 c2 ka tz zz
0 ab 1 info even more info more info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info more info
4 zz 0 more info
对于此输入数据:
Out[166]:
name c1 c2
0 ab 1 info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info
4 zz 0 more info
# created by the following code:
import io
raw=""" name c1 c2
0 ab 1 info
1 tz 0 more_info
2 ka 0 even_more_info
3 cd 2 info
4 zz 0 more_info"""
df= pd.read_csv(io.StringIO(raw), sep='\s+', index_col=0)
df['c2']=df['c2'].str.replace('_', ' ')
更多信息的行索引是什么?空单元格应该用“”还是nan填充?
Out[166]:
name c1 c2
0 ab 1 info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info
4 zz 0 more info
# created by the following code:
import io
raw=""" name c1 c2
0 ab 1 info
1 tz 0 more_info
2 ka 0 even_more_info
3 cd 2 info
4 zz 0 more_info"""
df= pd.read_csv(io.StringIO(raw), sep='\s+', index_col=0)
df['c2']=df['c2'].str.replace('_', ' ')