Python 抓取多个网页,但结果被最后一个url覆盖

Python 抓取多个网页,但结果被最后一个url覆盖,python,python-3.x,web-scraping,beautifulsoup,urllib,Python,Python 3.x,Web Scraping,Beautifulsoup,Urllib,我想从多个网页中删除所有URL。它可以工作,但只有上一个网页的结果才会保存在文件中 from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re import requests urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movi

我想从多个网页中删除所有URL。它可以工作,但只有上一个网页的结果才会保存在文件中

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']

for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")

links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
    links.append(link.get('href'))

filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s\n" % s)
我错过了什么


如果我可以使用一个包含所有URL的csv文件而不是列表,那就更酷了。但是我试过的任何东西都离…

你正在使用你的URL的最后一道汤。你应该把你的第二个换成第一个。此外,您将获得与正则表达式匹配的所有元素。您试图刮除的表外有一些元素

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2']

links = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    #You should get only movies from list otherwise you will also append coming soon section. That is why we added select_one
    for link in soup.select_one('ol.list_products').findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
        links.append(link.get('href'))


filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s\n" % s)
结果如下

/movie/woman-at-war
/movie/destroyer
/movie/aquaman
/movie/bumblebee
/movie/between-worlds
/movie/american-renegades
/movie/mortal-engines
/movie/spider-man-into-the-spider-verse
/movie/the-quake
/movie/once-upon-a-deadpool
/movie/all-the-devils-men
/movie/dead-in-a-week-or-your-money-back
/movie/blood-brother-2018
/movie/ghostbox-cowboy
/movie/robin-hood-2018
/movie/creed-ii
/movie/outlaw-king
/movie/overlord-2018
/movie/the-girl-in-the-spiders-web
/movie/johnny-english-strikes-again
/movie/hunter-killer
/movie/bullitt-county
/movie/the-night-comes-for-us
/movie/galveston
/movie/the-oath-2018
/movie/mfkz
/movie/viking-destiny
/movie/loving-pablo
/movie/ride-2018
/movie/venom-2018
/movie/sicario-2-soldado
/movie/black-water
/movie/jurassic-world-fallen-kingdom
/movie/china-salesman
/movie/incredibles-2
/movie/superfly
/movie/believer
/movie/oceans-8
/movie/hotel-artemis
/movie/211
/movie/upgrade
/movie/adrift-2018
/movie/action-point
/movie/solo-a-star-wars-story
/movie/feral
/movie/show-dogs
/movie/deadpool-2
/movie/breaking-in
/movie/revenge
/movie/manhunt
/movie/avengers-infinity-war
/movie/supercon
/movie/love-bananas
/movie/rampage
/movie/ready-player-one
/movie/pacific-rim-uprising
/movie/tomb-raider
/movie/gringo
/movie/the-hurricane-heist

您正在使用URL的最后一部分。你应该把你的第二个换成第一个。此外,您将获得与正则表达式匹配的所有元素。您试图刮除的表外有一些元素

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2']

links = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    #You should get only movies from list otherwise you will also append coming soon section. That is why we added select_one
    for link in soup.select_one('ol.list_products').findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
        links.append(link.get('href'))


filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s\n" % s)
结果如下

/movie/woman-at-war
/movie/destroyer
/movie/aquaman
/movie/bumblebee
/movie/between-worlds
/movie/american-renegades
/movie/mortal-engines
/movie/spider-man-into-the-spider-verse
/movie/the-quake
/movie/once-upon-a-deadpool
/movie/all-the-devils-men
/movie/dead-in-a-week-or-your-money-back
/movie/blood-brother-2018
/movie/ghostbox-cowboy
/movie/robin-hood-2018
/movie/creed-ii
/movie/outlaw-king
/movie/overlord-2018
/movie/the-girl-in-the-spiders-web
/movie/johnny-english-strikes-again
/movie/hunter-killer
/movie/bullitt-county
/movie/the-night-comes-for-us
/movie/galveston
/movie/the-oath-2018
/movie/mfkz
/movie/viking-destiny
/movie/loving-pablo
/movie/ride-2018
/movie/venom-2018
/movie/sicario-2-soldado
/movie/black-water
/movie/jurassic-world-fallen-kingdom
/movie/china-salesman
/movie/incredibles-2
/movie/superfly
/movie/believer
/movie/oceans-8
/movie/hotel-artemis
/movie/211
/movie/upgrade
/movie/adrift-2018
/movie/action-point
/movie/solo-a-star-wars-story
/movie/feral
/movie/show-dogs
/movie/deadpool-2
/movie/breaking-in
/movie/revenge
/movie/manhunt
/movie/avengers-infinity-war
/movie/supercon
/movie/love-bananas
/movie/rampage
/movie/ready-player-one
/movie/pacific-rim-uprising
/movie/tomb-raider
/movie/gringo
/movie/the-hurricane-heist

嘿,这是我的第一个答案,所以我会尽力帮助你

数据覆盖的问题是在一个循环中迭代URL,然后在另一个循环中迭代soup对象

这将始终在循环结束时返回最后一个soup对象,因此最好将每个soup对象从url循环中附加到数组中,或者在url循环中实际查询soup对象:

soup_obj_list = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    soup_obj_list.append(soup)

希望这能解决你的第一个问题。无法真正帮助解决csv问题。

嘿,这是我的第一个答案,所以我会尽力帮助

数据覆盖的问题是在一个循环中迭代URL,然后在另一个循环中迭代soup对象

这将始终在循环结束时返回最后一个soup对象,因此最好将每个soup对象从url循环中附加到数组中,或者在url循环中实际查询soup对象:

soup_obj_list = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    soup_obj_list.append(soup)

希望这能解决你的第一个问题。无法真正帮助解决csv问题。

非常感谢。这真的很有用,我没有想到要排除即将到来的部分!非常感谢。这真的很有用,我没有想到要排除即将到来的部分!