Python 3.x BeautifulSoup不';我看不到完整的链接

Python 3.x BeautifulSoup不';我看不到完整的链接,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,当我试图获取网页上的链接时,bs4无法捕获整个链接,它会在**?ref**…. 我将通过代码来解释这个问题: imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250" site = requests.get(imdb_link) soup = BeautifulSoup(site.text,'lxml') for items in soup.find("table",class_="chart").find_all(class_="

当我试图获取网页上的链接时,
bs4
无法捕获整个链接,它会在
**?ref**….

我将通过代码来解释这个问题:

imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
site = requests.get(imdb_link)
soup = BeautifulSoup(site.text,'lxml')

for items in soup.find("table",class_="chart").find_all(class_="titleColumn"):
    link = items.find("a").get('href')
    print(link)
输出为:

/title/tt0111161/
/title/tt0068646/
/title/tt0071562/
/title/tt0468569/
/title/tt0050083/
/title/tt0108052/
/title/tt0167260/
...and so on..
但这是错误的,正如您通过查看网页所看到的,因为它可能是:

/title/tt0111161/?ref_=adv_li_tt
/title/tt0068646/?ref_=adv_li_tt
...and so on...
如何获取整个链接?我指的是
?ref=adv\u li\u tt


我使用的是Python 3.7.4,总的来说,尝试并解决如何获取完整链接可能会很有趣——我认为您需要selenium来允许javascript在页面上运行,而不需要呈现页面上看到的完整链接。添加前缀
https://www.imdb.com
,完全可以使用

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
    r = s.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
    soup = bs(r.content, 'lxml')
    links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        print(soup.select_one('title').text)
您可以让selenium加载页面,以便内容呈现然后传递到bs4,以获取页面上的链接:

from selenium import webdriver
from bs4 import BeautifulSoup as bs

d = webdriver.Chrome()
d.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(d.page_source, 'lxml')
d.quit()
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]

非常感谢。事实上,我已经为此使用了Selenium,但现在我意识到,多亏了您,对于我的范围,我只需要一个前缀。