Python 3.x 刮取两个单独的图表，然后通过beautifulsoup合并为一个_Python 3.x_Pandas

Python 3.x 刮取两个单独的图表，然后通过beautifulsoup合并为一个

python-3.x pandas

Python 3.x 刮取两个单独的图表，然后通过beautifulsoup合并为一个,python-3.x,pandas,Python 3.x,Pandas,我试图在这个网站上刮掉票房图表，却陷入了将两张单独的图表制作成一个数据框的困境。（Idk为什么它已经分离，但这些应该合并到一个相同的图表中） URL:https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019 当涉及到有两个单独的图表，但没有为每个图表包含任何特定的代码名时，我如何处理这些列当我使用soup.select（'table>thead>tr>th'）刮

我试图在这个网站上刮掉票房图表，却陷入了将两张单独的图表制作成一个数据框的困境。（Idk为什么它已经分离，但这些应该合并到一个相同的图表中）

URL:https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019

当涉及到有两个单独的图表，但没有为每个图表包含任何特定的代码名时，我如何处理这些列

当我使用

soup.select（'table>thead>tr>th'）

刮取列时，它显示两次，所以我只想在重复之前剪切列
例如

Columns: [Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare, Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare] import requests from bs4 import BeautifulSoup as bs URL = "https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019" rq = requests.get(URL) soup = bs(rq.content,'html.parser') columns=soup.select('table > thead > tr > th') columnlist=[] for column in columns: columnlist.append(column.text) df=pd.DataFrame(columns=columnlist) contents=soup.find_all('table') contents=soup.select('tbody > tr') dfcontent=[] alldfcontents=[] for content in contents: tds = content.find_all('td') for td in tds: dfcontent.append(td.text) alldfcontents.append(dfcontent) dfcontent=[] df = pd.DataFrame(columns=columnlist)
这是我想制作的数据帧：

Columns: Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare Factors: 1, Avengers Endgame, ... ... 100, ~, ...
希望我能用它来学习机器

#Read url URL = "https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019" data = requests.get(URL).text #parse url soup = BeautifulSoup(data, "html.parser") #find the tables you want table = soup.findAll("table")[1:] #read it into pandas df = pd.read_html(str(table)) #concat both the tables df = pd.concat([df[0],df[1]]) df Rank Movie Worldwide Box OfficeDomestic Box Office International Box Office DomesticShare 0 1 Avengers: Endgame $2,615,368,375 $771,368,375 $1,844,000,000 29.49% 1 2 Captain Marvel $1,122,281,059 $425,152,517 $697,128,542 37.88% 2 3 Liu Lang Di Qiu $692,163,684 NaN $692,163,684 NaN 3 4 How to Train Your Dragon: The Hidden World $518,846,075 $160,346,075 $358,500,000 30.90% 4 5 Alita: Battle Angel $402,976,036 $85,710,210 $317,265,826 21.27% 5 6 Shazam! $358,308,992 $138,067,613 $220,241,379 38.53%
这应该可以满足您的需要，您只需在使用pandas读取正确的html标记后将这两个表合并在一起即可
这应该可以满足您的需要，您只需在使用pandas读取正确的html标记后将这两个表合并在一起即可