Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 刮取两个单独的图表,然后通过beautifulsoup合并为一个_Python 3.x_Pandas - Fatal编程技术网

Python 3.x 刮取两个单独的图表,然后通过beautifulsoup合并为一个

Python 3.x 刮取两个单独的图表,然后通过beautifulsoup合并为一个,python-3.x,pandas,Python 3.x,Pandas,我试图在这个网站上刮掉票房图表,却陷入了将两张单独的图表制作成一个数据框的困境。(Idk为什么它已经分离,但这些应该合并到一个相同的图表中) URL:https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019 当涉及到有两个单独的图表,但没有为每个图表包含任何特定的代码名时,我如何处理这些列 当我使用soup.select('table>thead>tr>th')刮

我试图在这个网站上刮掉票房图表,却陷入了将两张单独的图表制作成一个数据框的困境。(Idk为什么它已经分离,但这些应该合并到一个相同的图表中)

URL:https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019

当涉及到有两个单独的图表,但没有为每个图表包含任何特定的代码名时,我如何处理这些列

当我使用
soup.select('table>thead>tr>th')
刮取列时,它显示两次,所以我只想在重复之前剪切列

例如

Columns: [Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare, Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare]


import requests
from bs4 import BeautifulSoup as bs

URL = "https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019"

rq = requests.get(URL)
soup = bs(rq.content,'html.parser')

columns=soup.select('table > thead > tr > th')

columnlist=[]
for column in columns:
    columnlist.append(column.text)
df=pd.DataFrame(columns=columnlist)

contents=soup.find_all('table')
contents=soup.select('tbody > tr')

dfcontent=[]
alldfcontents=[]

for content in contents:
    tds = content.find_all('td')
    for td in tds:
        dfcontent.append(td.text)
        alldfcontents.append(dfcontent)
        dfcontent=[]

df = pd.DataFrame(columns=columnlist)
这是我想制作的数据帧:

Columns: Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare
Factors: 1, Avengers Endgame, ... 
         ...
         100, ~, ...
希望我能用它来学习机器

#Read url
URL = "https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019"
data = requests.get(URL).text

#parse url
soup = BeautifulSoup(data, "html.parser")

#find the tables you want
table = soup.findAll("table")[1:]

#read it into pandas
df = pd.read_html(str(table))

#concat both the tables
df = pd.concat([df[0],df[1]])

df

  Rank       Movie                              Worldwide Box OfficeDomestic Box Office International Box Office    DomesticShare
0   1   Avengers: Endgame                           $2,615,368,375         $771,368,375 $1,844,000,000              29.49%
1   2   Captain Marvel                              $1,122,281,059         $425,152,517 $697,128,542                37.88%
2   3   Liu Lang Di Qiu                             $692,163,684               NaN      $692,163,684                NaN
3   4   How to Train Your Dragon: The Hidden World  $518,846,075          $160,346,075   $358,500,000              30.90%
4   5   Alita: Battle Angel                         $402,976,036           $85,710,210   $317,265,826              21.27%
5   6   Shazam!                                     $358,308,992           $138,067,613 $220,241,379               38.53%
这应该可以满足您的需要,您只需在使用pandas读取正确的html标记后将这两个表合并在一起即可

这应该可以满足您的需要,您只需在使用pandas读取正确的html标记后将这两个表合并在一起即可