Python 基于特定行名称对数据帧中的行进行分组和转置
这个问题与本帖有关[ 我有一个数据框,每行都有在线抓取的文本,其中包含运动选择信息(都在同一列中)。链接帖子中的解决方案工作得很好,但我发现了更多的问题,因为文本中没有一致的模式。以下是我的DF:Python 基于特定行名称对数据帧中的行进行分组和转置,python,pandas,Python,Pandas,这个问题与本帖有关[ 我有一个数据框,每行都有在线抓取的文本,其中包含运动选择信息(都在同一列中)。链接帖子中的解决方案工作得很好,但我发现了更多的问题,因为文本中没有一致的模式。以下是我的DF: print(df): Col A Race 1 - Handicap 14 - NAME 3 - NAME 5 - NAME 6 - NAME 4 - NAME Race Overview: lorem ipsum etc etc Race 2 - Sprint 12 - NAME
print(df):
Col A
Race 1 - Handicap
14 - NAME
3 - NAME
5 - NAME
6 - NAME
4 - NAME
Race Overview: lorem ipsum etc etc
Race 2 - Sprint
12 - NAME
10 - NAME
8 - NAME
11 - NAME
Race Overview: Second lorem ipsum etc etc
Race 3 - Sprint
1 - NAME
14 - NAME
8 - NAME
6 - NAME
Race 4 - Handicap
1 - NAME
14 - NAME
8 - NAME
#Race numbers may run up to 15-20
这就是我试图将其转换为的内容:
print(df):
Race Name | Selection No | Selection | Race Overview
Race 1 - Handicap | 1 | 14 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 2 | 3 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 3 | 5 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 4 | 6 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 5 | 4 - Name | Race Overview: lorem ipsum etc etc
Race 2 - Sprint | 1 | 12 - Name | Race Overview: Second lorem ipsum etc etc
Race 2 - Sprint | 2 | 10 - Name | Race Overview: Second lorem ipsum etc etc
Race 2 - Sprint | 3 | 8 - Name | Race Overview: Second lorem ipsum etc etc
Race 2 - Sprint | 4 | 11 - Name | Race Overview: Second lorem ipsum etc etc
Race 3 - Sprint | 1 | 1 - Name |
Race 3 - Sprint | 2 | 14 - Name |
Race 3 - Sprint | 3 | 8 - Name |
Race 3 - Sprint | 4 | 6 - Name |
Race 4 - Sprint | 1 | 1 - Name |
Race 4 - Sprint | 2 | 14 - Name |
Race 4 - Sprint | 3 | 8 - Name |
如果模式基于6行的重复圆柱体,则此函数用于转置:
df2 = (
pd.DataFrame(data = df['Col A'].values.reshape(-1, 6))
.set_index([0, 5])
.stack()
.rename_axis(index=['Race Name','Race Overview','Selection No'])
.to_frame('Selection')
.reset_index()
)
是否会在每个“Race[0-9]-”
行之间找到行,然后为每个模式运行上述df2
非常感谢您的帮助。谢谢!使用:
#get Race values by pattern
df['Race Name'] = df['Col A'].where(df['Col A'].str.contains('Race [0-9]+ -'))
#get Selection values by pattern - starting numeric of original column
df['Selection'] = df['Col A'].where(df['Col A'].str.contains('^[0-9]+'))
#get info column
df['Race Overview'] = df['Col A'].where(df['Race Name'].isna() & df['Selection'].isna())
#forward and back filling per helper groups
s1 = df['Selection'].isna().cumsum()
s2 = df['Race Overview'].notna().iloc[::-1].cumsum()
df['Race Name'] = df.groupby(s1)['Race Name'].ffill()
df['Race Overview'] = df.groupby(s2)['Race Overview'].bfill()
#remove rows by missing values and also original column
df = df.dropna(subset=['Race Name', 'Selection']).drop('Col A', axis=1)
#added counter
df.insert(1, 'Selection No', df.groupby('Race Name').cumcount().add(1))
您好@jezrael。这个代码运行得很好,尽管唯一的问题是它似乎没有从“第10场比赛”开始。我是否需要调整
df['Race Name']=df['Col A']=df['Col A'].str.contains('Race[0-9]'))
code来捕获进入双位数的比赛?@SOK-是否可以检查df['Race Name']=df['Col A'].where(df['Col A'].str.contains('Race[0-9]+-'))
?它表示一个或多个数字[0-9]完美!+是一个方便的提示,因此我感谢@jezrael的帮助。谢谢!
print (df)
Race Name Selection No Selection \
4 Race 1 - Handicap 1 14 - NAME
5 Race 1 - Handicap 2 3 - NAME
6 Race 1 - Handicap 3 5 - NAME
7 Race 1 - Handicap 4 6 - NAME
8 Race 1 - Handicap 5 4 - NAME
11 Race 2 - Sprint 1 12 - NAME
12 Race 2 - Sprint 2 10 - NAME
13 Race 2 - Sprint 3 8 - NAME
14 Race 2 - Sprint 4 11 - NAME
17 Race 3 - Sprint 1 1 - NAME
18 Race 3 - Sprint 2 14 - NAME
19 Race 3 - Sprint 3 8 - NAME
20 Race 3 - Sprint 4 6 - NAME
22 Race 4 - Handicap 1 1 - NAME
23 Race 4 - Handicap 2 14 - NAME
24 Race 4 - Handicap 3 8 - NAME
Race Overview
4 Race Overview: lorem ipsum etc etc
5 Race Overview: lorem ipsum etc etc
6 Race Overview: lorem ipsum etc etc
7 Race Overview: lorem ipsum etc etc
8 Race Overview: lorem ipsum etc etc
11 Race Overview: Second lorem ipsum etc etc
12 Race Overview: Second lorem ipsum etc etc
13 Race Overview: Second lorem ipsum etc etc
14 Race Overview: Second lorem ipsum etc etc
17 NaN
18 NaN
19 NaN
20 NaN
22 NaN
23 NaN
24 NaN