在python中修复用BS4提取的损坏html表_Python_Pandas_Edgar

在python中修复用BS4提取的损坏html表

python pandas

在python中修复用BS4提取的损坏html表,python,pandas,edgar,Python,Pandas,Edgar,我正在解析管理文件中的html表。这是很棘手的，因为html经常被破坏，这会导致构造糟糕的表。下面是我加载到pandas数据帧中的表的示例： 0 1 2 3 4 5 \ 0 NaN NaN NaN NaN NaN NaN 1 Name NaN Age NaN NaN Position 2 Aylwin Lewis NaN NaN 5

我正在解析管理文件中的html表。这是很棘手的，因为html经常被破坏，这会导致构造糟糕的表。下面是我加载到pandas数据帧中的表的示例：

                0   1    2     3   4         5  \
0             NaN NaN  NaN   NaN NaN       NaN   
1            Name NaN  Age   NaN NaN  Position   
2    Aylwin Lewis NaN  NaN  59.0 NaN       NaN   
3    John Morlock NaN  NaN  58.0 NaN       NaN   
4  Matthew Revord NaN  NaN  50.0 NaN       NaN   
5  Charles Talbot NaN  NaN  48.0 NaN       NaN   
6      Nancy Turk NaN  NaN  49.0 NaN       NaN   
7      Anne Ewing NaN  NaN  49.0 NaN       NaN   

                                                   6  
0                                                NaN  
1                                                NaN  
2    Chairman, Chief Executive Officer and President  
3    Senior Vice President, Chief Operations Officer  
4  Senior Vice President, Chief Legal Officer, Ge...  
5  Senior Vice President and Chief Financial Officer  
6  Senior Vice President, Chief People Officer an...  
7        Senior Vice President, New Shop Development

我编写了以下python代码来尝试修复该表：

#dropping empty rows
df = df.dropna(how='all',axis=0)

#dropping columns with more than 70% empty values
df = df.dropna(thresh =2, axis=1)

#resetting dataframe index
df = df.reset_index(drop = True)

#set found_name variable to stop the loop once it finds the name column
found_name = 0

#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():

    #only loop if we have not found a name column yet
    if found_name == 0: 

        #convert the row to string
        text_row = str(row)

        #search if there is the word "Name" in that row
        if "Name" in text_row:
            print("Name found in text of rows. Investigating row",row.Index," as header.")

            #changing column names
            df.columns = df.iloc[row.Index]

            #dropping first rows
            df = df.iloc[row.Index + 1 :]

            #changing found_name to 1
            found_name = 1

            #reindex
            df = df.reset_index(drop = True)
            print("Attempted to clean dataframe:")
            print(df)

这是我得到的表格：

0            Name   NaN                                                NaN
0    Aylwin Lewis  59.0    Chairman, Chief Executive Officer and President
1    John Morlock  58.0    Senior Vice President, Chief Operations Officer
2  Matthew Revord  50.0  Senior Vice President, Chief Legal Officer, Ge...
3  Charles Talbot  48.0  Senior Vice President and Chief Financial Officer
4      Nancy Turk  49.0  Senior Vice President, Chief People Officer an...
5      Anne Ewing  49.0        Senior Vice President, New Shop Development

我这里的主要问题是标题“Age”和“Position”已经消失了，因为它们与列不对齐。我正在使用这个脚本解析许多表，因此无法手动修复它们。此时如何修复数据？

不要在开始时删除几乎为空的列，我们稍后需要它们：一旦找到包含“Name”的标题行，我们将收集其所有非空元素，在删除剩余数据中的空列后将其设置为列标题

#dropping empty rows
df = df.dropna(how='all',axis=0)

#resetting dataframe index
df = df.reset_index(drop = True)

#set found_name variable to stop the loop once it finds the name column
found_name = 0

#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():

    #only loop if we have not found a name column yet
    if found_name == 0: 

        #convert the row to string
        text_row = str(row)

        #search if there is the word "Name" in that row
        if "Name" in text_row:
            print("Name found in text of rows. Investigating row",row.Index," as header.")

            #collect column names
            headers = [c for c in row if not pd.isnull(c)][1:]

            #dropping first rows
            df = df.iloc[row.Index + 1 :]

            #dropping empty columns
            df = df.dropna(axis=1)

            #setting column names
            df.columns = (headers + ['col'] * (len(df.columns) - len(headers)))[:len(df.columns)]

            #changing found_name to 1
            found_name = 1

            #reindex
            df = df.reset_index(drop = True)
            print("Attempted to clean dataframe:")
            print(df)

结果:

             Name   Age                                           Position
0    Aylwin Lewis  59.0    Chairman, Chief Executive Officer and President
1    John Morlock  58.0    Senior Vice President, Chief Operations Officer
2  Matthew Revord  50.0  Senior Vice President, Chief Legal Officer, Ge...
3  Charles Talbot  48.0  Senior Vice President and Chief Financial Officer
4      Nancy Turk  49.0  Senior Vice President, Chief People Officer an...
5      Anne Ewing  49.0        Senior Vice President, New Shop Development

非常感谢你的帮助！我得到以下错误：ValueError：长度不匹配：预期轴有3个元素，新值有2个元素。我该怎么办？如果找到的标题行中的非空条目少于剩余数据集中的列，则会出现此错误。在这种情况下，您可以使用虚拟名称填充列名列表，例如

df.columns=headers+['col']*（len（df.columns）-len（headers））

而不是

df.columns=headers

@user1029296:我更新了我的答案，以应对发现的列标题太少或太多的情况。