Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
beautifulsoup仅返回某些网站的部分URL_Url_Web Scraping_Beautifulsoup - Fatal编程技术网

beautifulsoup仅返回某些网站的部分URL

beautifulsoup仅返回某些网站的部分URL,url,web-scraping,beautifulsoup,Url,Web Scraping,Beautifulsoup,这就是基本代码,当我要求时 from bs4 import BeautifulSoup, SoupStrainer import requests def get_url(url): page = requests.get(url.format()) data = page.text soup = BeautifulSoup(data) for link in soup.find_all('a'): print(link.ge

这就是基本代码,当我要求时

from bs4 import BeautifulSoup, SoupStrainer
    import requests

def get_url(url):
    page = requests.get(url.format()) 
    data = page.text
    soup = BeautifulSoup(data)
    
    for link in soup.find_all('a'):
        print(link.get('href'))
    
新华社返回

完整网址

但是另一个网站

不返回完整的超链接

我不知道我为什么会有这个问题以及如何解决它。
有没有人有过类似的问题?或者知道如何解决这个问题吗?

我怀疑您正在寻找urljoin:

你也可以考虑

from bs4 import BeautifulSoup, SoupStrainer
import requests
from urllib.parse import urljoin

def get_url(url):
    page = requests.get(url.format())
    data = page.text
    soup = BeautifulSoup(data)

    for link in soup.find_all('a'):
        print(urljoin(url, link.get('href')))

避免结果重复。

这不是错误或问题。这是特定站点在html中的方式。如果是根url内的链接,则不包括根url。谢谢@谢谢@Rusticus!
from bs4 import BeautifulSoup, SoupStrainer
import requests
from urllib.parse import urljoin

def get_url(url):
    page = requests.get(url.format())
    data = page.text
    soup = BeautifulSoup(data)

    for link in soup.find_all('a'):
        print(urljoin(url, link.get('href')))
for link in set(soup.find_all('a')):