Python 无法将网页转换为所有链接都是绝对链接的html文件_Python_Html_Python 3.x_Web Scraping_Python Requests

Python 无法将网页转换为所有链接都是绝对链接的html文件

python html python-3.x web-scraping

Python 无法将网页转换为所有链接都是绝对链接的html文件,python,html,python-3.x,web-scraping,python-requests,Python,Html,Python 3.x,Web Scraping,Python Requests,我已经创建了一个脚本，它能够将网页转换为html文件，使文件看起来与该网页非常相似。我唯一无法解决的问题是html文件包含相对URL，如/organizations/11 unilever？group=8831，而绝对链接是https://en.eyeka.com/organizations/11-unilever?group=8831 我试过： import requests link = "https://en.eyeka.com/contests/8831/results&qu

我已经创建了一个脚本，它能够将网页转换为html文件，使文件看起来与该网页非常相似。我唯一无法解决的问题是html文件包含相对URL，如

/organizations/11 unilever？group=8831

，而绝对链接是

https://en.eyeka.com/organizations/11-unilever?group=8831

我试过：

import requests

link = "https://en.eyeka.com/contests/8831/results"

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    
    with open("results.html", "wb") as f:
        f.write(res.content)

如何将网页转换为所有链接均为绝对链接（完整链接）的html文件

我检查了您链接的页面，建议您检查脚本将在页面中找到的所有“href=”/，并将其替换为“href=”https://en.eyeka.com/"

因此：

像这样的

import requests

link = "https://en.eyeka.com/contests/8831/results"

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
# convert to string
    res = res.text
# add https ...
    res = res.replace('href="/', 'href="https://en.eyeka.com/')
# encode 
    res = res.encode()  
    with open("results.html", "wb") as f:
        f.write(res)

因为它们是相对路径，所以它们只能是+相对路径。

我检查了您链接的页面，建议您检查脚本在页面中找到的所有“href=“/”并将其替换为“href=”https://en.eyeka.com/"

因此：

像这样的

import requests

link = "https://en.eyeka.com/contests/8831/results"

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
# convert to string
    res = res.text
# add https ...
    res = res.replace('href="/', 'href="https://en.eyeka.com/')
# encode 
    res = res.encode()  
    with open("results.html", "wb") as f:
        f.write(res)

因为它们是相对路径，所以只能是+相对路径。

您可以使用正则表达式查找和替换相对URL：

重新导入
导入请求
导入urllib3
链接=”https://en.eyeka.com/contests/8831/results"
将requests.Session（）作为s：
base_url=“：/”。加入(
urllib3.get_主机（链接）[:2]
)#获取基本url(https://en.eyeka.com)
s、 标题[
“用户代理”
]=“Mozilla/5.0（Windows NT 6.1）AppleWebKit/537.36（KHTML，类似Gecko）Chrome/88.0.4324.150 Safari/537.36”
text=s.get（link）.text
text=re.sub(
r“（href=[\“\']）\/”，f“\g{base\u url}/”，text，0，re.MULTILINE
)#将所有相对URL替换为绝对URL
将open（“results.html”，“w”，encoding=“utf-8”）作为f：
f、 书写（文本）

您可以使用正则表达式查找和替换相对URL：

重新导入
导入请求
导入urllib3
链接=”https://en.eyeka.com/contests/8831/results"
将requests.Session（）作为s：
base_url=“：/”。加入(
urllib3.get_主机（链接）[:2]
)#获取基本url(https://en.eyeka.com)
s、 标题[
“用户代理”
]=“Mozilla/5.0（Windows NT 6.1）AppleWebKit/537.36（KHTML，类似Gecko）Chrome/88.0.4324.150 Safari/537.36”
text=s.get（link）.text
text=re.sub(
r“（href=[\“\']）\/”，f“\g{base\u url}/”，text，0，re.MULTILINE
)#将所有相对URL替换为绝对URL
将open（“results.html”，“w”，encoding=“utf-8”）作为f：
f、 书写（文本）

我的答案包括额外的库，但没有假设它只适用于您提供的示例url。它也不假设所有URL都是相对的

import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup

link = "https://en.eyeka.com/contests/8831/results"

with requests.Session() as s:
    headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"}
    s.headers = headers
    res = s.get(link)

u = urlparse(res.url)
domain = u.scheme + "://" + u.netloc  # extract domain will work with http or https since we are also extracting the scheme
soup = BeautifulSoup(res.content, "html.parser")

for a in soup.find_all('a'):  # loop through all links
    if "href" in a.attrs:  # not all links have an href tag
        if "http" not in a.attrs['href']:  # not all links are relative links
            a.attrs['href'] = domain + a.attrs['href']

with open("results.html", "w", encoding='utf-8') as f:
    f.write(str(soup))

我的答案包括额外的库，但没有假设它只适用于您提供的示例url。它也不假设所有URL都是相对的

import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup

link = "https://en.eyeka.com/contests/8831/results"

with requests.Session() as s:
    headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"}
    s.headers = headers
    res = s.get(link)

u = urlparse(res.url)
domain = u.scheme + "://" + u.netloc  # extract domain will work with http or https since we are also extracting the scheme
soup = BeautifulSoup(res.content, "html.parser")

for a in soup.find_all('a'):  # loop through all links
    if "href" in a.attrs:  # not all links have an href tag
        if "http" not in a.attrs['href']:  # not all links are relative links
            a.attrs['href'] = domain + a.attrs['href']

with open("results.html", "w", encoding='utf-8') as f:
    f.write(str(soup))

我尝试了您的解决方案，最后得到了这个错误。write（res.content）AttributeError:'str'对象没有属性'content'。这是因为res在前面3行被转换成字符串，所以它不再有'content'。代码无法正常工作。抱歉，我键入了错误的用法。请重新编辑答案。您的代码仍然无法正常工作。我建议运行您的代码。我尝试了您的解决方案，结果出现了此错误。write（res.content）AttributeError:“str”对象没有属性“content”。这是因为res在前面3行被转换为字符串，所以它不再有“content”。代码无法正常工作。抱歉，我键入了错误的用法。请重新编辑答案。您的代码仍然无法正常工作。我建议运行你的代码。我刚刚测试了你的脚本。链接似乎已转换为绝对链接，但这是我将光标悬停在上面时看到的，因此当我单击此类链接时，它们会导致错误的地址。很抱歉，我忘记了

：

，已修复！我刚才测试了你的脚本。链接似乎已转换为绝对链接，但这是我将光标悬停在上面时看到的，因此当我单击此类链接时，它们会导致错误的地址。很抱歉，我忘记了

：

，已修复！您的解决方案似乎正在运行。它似乎复制了contributor部分中的基本url，但该内容是通过javascript生成的，因此不清楚原因。要修复此问题，您可能需要删除json对象中包含用户/url对的基本url。这不会修改脚本、样式表、链接和其他使页面在本地不可用的内容。@38脚本和样式表使用绝对链接。如果您运行代码，它确实在本地工作。您的解决方案似乎正在工作。它似乎与contributor部分中的基本url重复，但该内容是通过javascript生成的，因此不清楚原因。要修复此问题，您可能需要删除json对象中包含用户/url对的基本url。这不会修改脚本、样式表、链接和其他使页面在本地不可用的内容。@38脚本和样式表使用绝对链接。如果您运行代码，它确实在本地工作。