包含python和beautiful soup的格式错误的html
我正在使用Python3.5、BS4.6、selenium 3.6和phantomjs来清理这个站点。脚本在我位于美国的服务器上运行,我想抓取一个德国站点。但我遇到了一个问题。我下载的html如下所示:包含python和beautiful soup的格式错误的html,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我正在使用Python3.5、BS4.6、selenium 3.6和phantomjs来清理这个站点。脚本在我位于美国的服务器上运行,我想抓取一个德国站点。但我遇到了一个问题。我下载的html如下所示: <div class="col-md-40 product-highlights-container"><div class="product-filters"><select class="colorfilter__select"><option va
<div class="col-md-40 product-highlights-container"><div class="product-filters"><select class="colorfilter__select"><option value="{"ebootisId":"HW102581-1","color":"Midnight Black","colorCode":"000000","colorGroup":"Schwarz","colorGroupCode":"000000","deliveryTime":"2-3 Werktage","default":true,"images":[{"small":"/img/dist/HW102581-1_ZU102869_S_1.png","medium":"/img/dist/HW102581-1_ZU102869_M_1.png","large":"/img/dist/HW102581-1_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":1086,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}">Midnight Black</option><option value="{"ebootisId":"HW102581-2","color":"Arctic Silver","colorCode":"c7ccd0","colorGroup":"Silber","colorGroupCode":"c0c0c0","deliveryTime":"2-3 Werktage","default":false,"images":[{"small":"/img/dist/HW102581-2_ZU102869_S_1.png","medium":"/img/dist/HW102581-2_ZU102869_M_1.png","large":"/img/dist/HW102581-2_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":503,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}">Arctic Silver</option><option value="{"ebootisId":"HW102581-3","color":"Orchid Grey","colorCode":"9d9dad","colorGroup":"Grau","colorGroupCode":"dcdcdc","deliveryTime":"2-3 Werktage","default":false,"images":[{"small":"/img/dist/HW102581-3_ZU102869_S_1.png","medium":"/img/dist/HW102581-3_ZU102869_M_1.png","large":"/img/dist/HW102581-3_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":500,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id"
您的代码可以更改如下。它将创建一个名为
pretty.html
的输出文件,其中包含prettify
版本的html:
from bs4 import BeautifulSoup
from selenium import webdriver
link = 'https://tarife.mediamarkt.de/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet'
filename = 'output.html'
driver = webdriver.PhantomJS() #executable_path=path_to_pjs)
driver.get(link)
with open(filename, "wb") as f_output:
f_output.write(driver.page_source.encode('utf-8'))
page_soup = BeautifulSoup(driver.page_source, "lxml")
with open('pretty.html', 'w') as f_output:
f_output.write(page_soup.prettify())
driver.close()
给你一个
开始:
<div class="col-md-40 product-highlights-container">
<div class="product-filters">
<select class="colorfilter__select">
<option value='{"ebootisId":"HW102581-1","color":"Midnight Black","colorCode":"000000","colorGroup":"Schwarz","colorGroupCode":"000000","deliveryTime":"2-3 Werktage","default":true,"images":[{"small":"/img/dist/HW102581-1_ZU102869_S_1.png","medium":"/img/dist/HW102581-1_ZU102869_M_1.png","large":"/img/dist/HW102581-1_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":1075,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}'>
什么是链接
?然后我们就可以重现这个问题了。谢谢你,真是妙不可言。我遇到了一个小ascii问题,但可以通过使用open(str(I)+“u pretty.html”,“wb”)作为f_输出来解决它:f_输出。write(page_soup.prettify(encoding='utf-8'))
。
<div class="col-md-40 product-highlights-container">
<div class="product-filters">
<select class="colorfilter__select">
<option value='{"ebootisId":"HW102581-1","color":"Midnight Black","colorCode":"000000","colorGroup":"Schwarz","colorGroupCode":"000000","deliveryTime":"2-3 Werktage","default":true,"images":[{"small":"/img/dist/HW102581-1_ZU102869_S_1.png","medium":"/img/dist/HW102581-1_ZU102869_M_1.png","large":"/img/dist/HW102581-1_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":1075,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}'>