Python 链接pandas中的嵌套表

Python 链接pandas中的嵌套表,python,pandas,Python,Pandas,我能够使用下面的代码直接读取内部表 url='https://s3.amazonaws.com/todel162/AAKAR.html' df_one=pd.read_html(url, header=0, match='Number of Booked Apartment')[1] df_two=pd.read_html(url, header=0, match='Number of Booked Apartment')[2] 但是如何将内部表与主表链接?例如,上面提到的df_one框架链接

我能够使用下面的代码直接读取内部表

url='https://s3.amazonaws.com/todel162/AAKAR.html'
df_one=pd.read_html(url, header=0, match='Number of Booked Apartment')[1]
df_two=pd.read_html(url, header=0, match='Number of Booked Apartment')[2]
但是如何将内部表与主表链接?例如,上面提到的df_one框架链接到序列号1(外部)。是否有办法提取外部表格,以便只选择序列号1和2


更新:

有一个叫做“建筑细节”的部分。如果访问该页面,您将看到第一个序列号,如下所示:

Sr.No.  Project Name    Name    Proposed Date of Completion Number of Basement's    Number of Plinth    Number of Podium's  Number of Slab of Super Structure   Number of Stilts    Number of Open Parking  Number of Closed Parking
1   SRUSHTI COMPLEX A and B     0   1   0   5   1   48  1
第二个序列号是:

Sr.No.  Project Name    Name    Proposed Date of Completion Number of Basement's    Number of Plinth    Number of Podium's  Number of Slab of Super Structure   Number of Stilts    Number of Open Parking  Number of Closed Parking
2   SRUSHTI COMPLEX C and D     0   1   0   5   1   51  1
df_1数据帧链接到Sr.1,而df_2链接到Sr.2

我希望将Sr.No.1和Sr.No.2的列分别添加到df_one和df_two。

表示,在调用
pd.read_html()
后,您应该希望进行一些手动清理。我不知道如何将这段代码扩展到可能完全不同的HTML。话虽如此,这是否达到了你的目的

# Read df 
df_other=pd.read_html(url, header=0, match='Number of Plinth')

# To keep only the targeted columns; have a look at df_other -  it's cluttered.
targeted_columns = ['Sr.No.', 'Project Name', 'Name', 'Proposed Date of Completion',
       'Number of Basement\'s', 'Number of Plinth', 'Number of Podium\'s',
       'Number of Slab of Super Structure', 'Number of Stilts',
       'Number of Open Parking', 'Number of Closed Parking']

# 'Project Name'=='SRUSHTI COMPLEX' is an easy way to extract the two dataframes of interest. Also resetting index and dropping.
df_other = df_other[0].loc[df_other[0]['Project Name']=='SRUSHTI COMPLEX',targeted_columns].reset_index(drop=True)

# This is useful for the merge step later since the Sr.No. in df_one and df_two int
df_other['Sr.No.'] = df_other['Sr.No.'].astype(int)

# Extract the two rows as dataframes that correspond to each frame you mentioned
df_other_one = df_other.iloc[[0]]
df_other_two = df_other.iloc[[1]]
完成后,您可以使用
merge
加入数据帧

df_one_ = df_one.merge(df_other_one, on='Sr.No.')
print(df_one_)

     Sr.No. Apartment Type  Carpet Area (in Sqmts)  Number of Apartment  \
0       1          Shops                   70.63                    6   

   Number of Booked Apartment     Project Name     Name  \
0                           0  SRUSHTI COMPLEX  A and B   

  Proposed Date of Completion Number of Basement's Number of Plinth  \
0                         NaN                    0                1   

  Number of Podium's Number of Slab of Super Structure Number of Stilts  \
0                  0                                 5                1   

  Number of Open Parking Number of Closed Parking  
0                     48                        1 


df_two_ = df_two.merge(df_other_two, on='Sr.No.')
print(df_two_)


     Sr.No. Apartment Type  Carpet Area (in Sqmts)  Number of Apartment  \
0       2           1BHK                 1409.68                   43   

   Number of Booked Apartment     Project Name     Name  \
0                           4  SRUSHTI COMPLEX  C and D   

  Proposed Date of Completion Number of Basement's Number of Plinth  \
0                         NaN                    0                1   

  Number of Podium's Number of Slab of Super Structure Number of Stilts  \
0                  0                                 5                1   

  Number of Open Parking Number of Closed Parking  
0                     51                        1 

你说的内桌、外桌和主桌是什么意思?如果您可以显示有问题的数据帧,并且预期的结果-
df.head()
可能对这一目的非常有用,您基本上只需要从html中解析信息。通过
pandas.read_html
函数进行中继是一种有线方式。尝试使用选择器,如
#DivBuilding>div>table>tbody>tr:nth child(2)
,甚至正则表达式。如果项目名称='SRUSHTI COMPLEX'未知,该怎么办?@shantanuo如果表结构相同:我们可以使用
iloc
来选择相应的行。如何使用contains方法而不是“==SRUSHTI COMPLEX”?df_other=df_other[0][df_other[0]['Project Name'].str.contains('SRUSHTI COMPLEX',na=False)]。重置_索引(drop=True)