Python 熊猫数据帧(来自CSV)在整个数据中具有多个标题行

Python 熊猫数据帧(来自CSV)在整个数据中具有多个标题行,python,pandas,csv,dataframe,Python,Pandas,Csv,Dataframe,我正在使用从CSV文件创建的数据帧。数据在整个数据中都有标题行,在下一个标题行之前,标题行标识该数据下面的行 数据看起来像这样 2001| |colour |Price | Quantity sold<br> Shoes|<br> Blank | High heal Shoes| red |£22|44<br> Blank | Low heal Shoes|red |£22|44<br> Slippers|<br> Blan

我正在使用从CSV文件创建的数据帧。数据在整个数据中都有标题行,在下一个标题行之前,标题行标识该数据下面的行

数据看起来像这样

2001|     |colour |Price | Quantity sold<br>
Shoes|<br>
Blank  | High heal Shoes| red |£22|44<br>
Blank  | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank  | High heal Slippers| red |£22|44<br>
Blank  | High heal Slippers| blue |£22|44<br>
Blank  | Low heal Slippers| red |£22|44<br>
2002|   |colour |Price | Quantity sold<br>
Shoes|<br>
Blank  | High heal Shoes| red |£22|44<br>
Blank  | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank  | High heal Slippers| red |£22|44<br>
Blank  | High heal Slippers| blue |£22|44<br>
Blank  | Low heal Slippers| red |£22|44<br>
2001 | |颜色|价格|销售量
鞋|
空白| High heal鞋|红色| 22 | 44
空白|低帮鞋|红色| 22 | 44
拖鞋|
空白| High heal拖鞋|红色| 22 | 44
空白| High heal拖鞋|蓝色| 22 | 44
空白| Low heal拖鞋|红色| 22 | 44
2002年| |颜色|价格|销售量
鞋|
空白| High heal鞋|红色| 22 | 44
空白|低帮鞋|红色| 22 | 44
拖鞋|
空白| High heal拖鞋|红色| 22 | 44
空白| High heal拖鞋|蓝色| 22 | 44
空白| Low heal拖鞋|红色| 22 | 44
这是什么类型的结构

我需要通读这个数据框,从标题行(so 2001、2002等等)获取每年关于特定项目(比如拖鞋)的所有数据。即使在每个数据行的旁边添加一行对应的年份也会有所帮助

我会很感激你对我的帮助

使用:

df = pd.read_csv('test.csv')

#get value of first column (here 2001)
col = df.columns[0]

#forward fill last previous value
df[col] = df[col].ffill()
#convert first column to numeric
num = pd.to_numeric(df[col], errors='coerce')
#forward fill again, first group replace by value of first column name
df['Year'] = num.ffill().fillna(col)
#change columns names 
df = df.rename(columns={col:'Shoes', 'Unnamed: 1':'Names'})
#remove unnecessary rows
df = df[num.isnull() & df['colour'].notnull()].reset_index(drop=True)
编辑:



谢谢你的回复。我不明白在某条线上发生了什么。我希望你不介意我问一些问题。这条线是干什么的?df[col]=df[col].str.strip().replace('Blank',np.nan.).ffill()和forward fill的特殊功能是什么?没问题。但如果我的解决方案不起作用,可能问题在于文件的实际格式,所以可以使用实际分隔符、实际空白值共享示例文件吗?
ffill()
替换上次已知的非NaN值,因此如果
1,2,NaN,NaN,4,7,NaN
它将返回
1,2,2,2,4,7
谢谢。这是一个带格式化数据的演示文件的链接。谢谢你的帮助。我稍后会检查代码。
print (df)
           Shoes       Names  colour price Quantity sold  Year
0   Type A shoes  Sub type A     red    22             5  2001
1   Type A shoes  Sub type A   green    11             5  2001
2   Type A shoes  Sub type A  yellow    44             5  2001
3   Type A shoes  Sub type B     red    33             5  2001
4   Type A shoes  Sub type B   green    66             5  2001
5   Type A shoes  Sub type B  yellow    22             5  2001
6   Type B shoes  Sub type A     red    11             5  2001
7   Type B shoes  Sub type A   green    44             5  2001
8   Type B shoes  Sub type A  yellow    33             5  2001
9   Type B shoes  Sub type B     red    66             5  2001
10  Type B shoes  Sub type B   green    21             5  2001
11  Type B shoes  Sub type B  yellow    22             5  2001
12  Type A shoes  Sub type A     red    22             5  2002
13  Type A shoes  Sub type A   green    11             5  2002
14  Type A shoes  Sub type A  yellow    44             5  2002
15  Type A shoes  Sub type B     red    33             5  2002
16  Type A shoes  Sub type B   green    66             5  2002
17  Type A shoes  Sub type B  yellow    22             5  2002
18  Type B shoes  Sub type A     red    11             5  2002
19  Type B shoes  Sub type A   green    44             5  2002
20  Type B shoes  Sub type A  yellow    33             5  2002
21  Type B shoes  Sub type B     red    66             5  2002
22  Type B shoes  Sub type B   green    21             5  2002
23  Type B shoes  Sub type B  yellow    22             5  2002
df = pd.read_csv('testV2.csv', sep='\t')
#print (df)

#get value of first column (here 2001)
col = df.columns[0]

#forward fill last previous value
df[col] = df[col].ffill()
#convert first column to numeric
num = pd.to_numeric(df[col], errors='coerce')
#forward fill again, first group replace by value of first column name
df['Year'] = num.ffill().fillna(col)
#change columns names 
df = df.rename(columns={col:'Top Category', 'Unnamed: 1':'Names'})
#remove unnecessary rows
df = df[num.isnull() & (df['Top Category'] != 'Top Category')].reset_index(drop=True)
print (df)

   Top Category   Names Colour Price Sold  Year
0        Item 1  Type 1      -     2  NaN  2001
1        Item 2  Type 1      -     2  NaN  2001
2        Item 3  Type 1    red     2    5  2001
3        Item 3  Type 2   blue     2    5  2001
4        Item 3  Type 3  green     2    5  2001
5        item 4  Type 1    red     2    5  2001
6        item 4  Type 2   blue     3  NaN  2001
7        item 4  Type 3  green     3  NaN  2001
8        Item 1  Type 1      -     3  NaN  2002
9        Item 2  Type 1      -     3  NaN  2002
10       Item 3  Type 1    red     3    5  2002
11       Item 3  Type 2   blue     3    5  2002
12       Item 3  Type 3  green     3    5  2002
13        Item4  Type 1    red     3  NaN  2002
14        Item4  Type 2   blue     3  NaN  2002
15        Item4  Type 3  green     3  NaN  2002
16       Item 1  Type 1      -     3  NaN  2003
17       Item 2  Type 1      -     3  NaN  2003
18       Item 3  Type 1    red     3    5  2003
19       Item 3  Type 2   blue     3    5  2003
20       Item 3  Type 3  green     3    5  2003
21        Item4  Type 1    red     3  NaN  2003
22        Item4  Type 2   blue     3  NaN  2003
23        Item4  Type 3  green     3  NaN  2003