Python BeautifulSoup将HTML解析为一行字符串
出于某种原因,当我使用beautifulsoup解析HTML页面并将页面打印到txt文件时,它会取消HTML格式并将其放在一行上。当我尝试使用正则表达式搜索时,它会找到一些东西,然后打印出行,但是这会打印出整个页面,因为它都是一行。。。我怎样才能让它停止这样做 这是我的密码:Python BeautifulSoup将HTML解析为一行字符串,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,出于某种原因,当我使用beautifulsoup解析HTML页面并将页面打印到txt文件时,它会取消HTML格式并将其放在一行上。当我尝试使用正则表达式搜索时,它会找到一些东西,然后打印出行,但是这会打印出整个页面,因为它都是一行。。。我怎样才能让它停止这样做 这是我的密码: #!/usr/bin/python3 from bs4 import BeautifulSoup import re import urllib.request def main(): #Open the P
#!/usr/bin/python3
from bs4 import BeautifulSoup
import re
import urllib.request
def main():
#Open the PID file and read the PID's
URLList = []
PID = [open("PID.txt").read().split()]
for list in PID:
for code in list:
URLList.append("http://www.abb.com/productdetails/" + code)
pageNo = 1
for URL in URLList:
fh = open("html.txt", "a")
fh.write("\n\n\n\n\n")
webPage = urllib.request.urlopen(URL)
soup = BeautifulSoup(webPage.read())
print("Page", pageNo, "retrieved")
fh.write(str(soup.prettify().encode("utf-8")))
pageNo += 1
fh.close()
output = open('html.txt', 'r')
for line in output:
line = line.rstrip()
if re.search('NetDepth', line):
print(line)
if __name__ == "__main__": main()
基本上,我需要它做的是打开一个UPC的/PID的文件,去他们的网站,打开他们的网页。。。那部分很好用。然后我想gt的HTML,并把它全部在一个txt文件。从那里,我想搜索该文件中的某些元素,例如div标记或ProductNetDepth id。问题是,当它找到其中一个元素时,它会打印整个文档,因为它认为它是一行。我只是想要一个HTML行,里面有它
以下是网站源代码的一部分:
<div class="Dimensions pisEvenRow">
<div id="ProductNetLength" class="detailPageLeftColumn">
Product Net Length:
</div>
<div class="detailPageRightColumn">
<div>68 mm</div>
</div>
</div>
<div class="Dimensions pisOddRow">
<div id="ProductNetDepth" title="Depth of a single unpacked product" class="detailPageLeftColumn">Product Net Depth:</div>
<div class="detailPageRightColumn">
<div>67.5 mm</div>
</div>
</div>
<div class="Dimensions pisEvenRowLast">
<div id="ProductNetWeight" title="Weight of a single unpacked product" class="detailPageLeftColumn">Product Net Weight:</div>
<div class="detailPageRightColumn">
<div>0.041 kg</div>
</div>
产品净长度:
68毫米
产品净深度:
67.5毫米
产品净重:
0.041公斤
以下是从beautifulsoup写入文件后的外观:
ijQoI5DAFDwZHYnHo-npjlC99WMTQ6qWYJ8fkDP8ddGyBe9DZa4IVC3j3aFtg7m85t7V9lKauOCgTq5CZ7cJneFTTH12Nx8mLxeKkAmLee2awza0rGQucVII-WdAyptFtKvKDBSLWhBUFTU7WLzD7DN4tAZzUEbQDGL2VHY5A0&t=635706797508895128"/>\xc2\xa0Loading Images..\r\n </div>\n</div>\n</div>\n<div class="pisDetailPageTitle">General Information</div>\n<div class="pisOddRow">\n<div class="detailPageLeftColumn">\n<span>Extended Product Type:\r\n </span>\n</div>\n<div class="detailPageRightColumn">\r\n E213-25-001\r\n </div>\n</div>\n<div class="pisEvenRow">\n<div class="detailPageLeftColumn">\n<span>Product ID:\r\n </span>\n</div>\n<div class="detailPageRightColumn">\r\n 2CCA703041R0001\r\n </div>\n</div>\n<div class="pisOddRow">\n<div class="detailPageLeftColumn">\n<span>EAN:\r\n </span>\n</div>\n<div class="detailPageRightColumn">\r\n 7612270938711\r\n </div>\n</div>\n<div class="pisEvenRow">\n<div class="detailPageLeftColumn">\n<span>Catalog Description:\r\n </span>\n</div>\n<div class="detailPageRightColumn">\r\n E213-25-10 Change over switch 25A 1CO 250VAC\r\n </div>\n</div>\n<div class="pisOddRowLast">\n<div class="detailPageLeftColumn">\n<span>Long Description:\r\n </span>\n</div>\n<div class="detailPageRightColumn">\r\n Change over switches according DIN EN 60669-1, VDE 0632 Part 1, Rated currents: 16/25 A, 250 VACPDC, Contacts: 1 CO/2 CO, Module width: 0,5/1\r\n </div>\n</div>\n<div class="pisDetailPageTitle">\r\n Categories\r\n </div>\n<div class="pisEvenRowLast" id="pisEvenRowLast">\n<ul class="pisCategoryList">\n<span>Products</span><span class="CategorySeperator">\xc2\xbb</span>\n<li> Low Voltage Products and Systems\r\n </li>\n<span class="CategorySeperator">\xc2\xbb</span>\n<li> Modular DIN Rail Products\r\n </li>\n<span class="CategorySeperator">\xc2\xbb</span>\n<li> Modular DIN Rail Components MDRCs\r\n </li>\n<span class="CategorySeperator">\xc2\xbb</span>\n<li> Command Devices\r\n </li>\n</ul>\n</div>\n<div class="displayNone" id="PisDiv_PlaceHolder1">\xc2\xa0</div>\n<div class="pisDetailPageTitle" id="Ordering">Ordering</div>\n<div class="Ordering pisOddRow">\n<div class="detailPageLeftColumn" id="Ean">\r\n EAN:\r\n </div>\n<div class="detailPageRightColumn">\n<div>7612270938711</div>\n</div>\n</div>\n<div class="Ordering pisEvenRow">\n<div class="detailPageLeftColumn" id="MinimumOrderQuantity">\r\n Minimum Order Quantity:\r\n </div>\n<div class="detailPageRightColumn">\n<div>10 piece</div>\n</div>\n</div>\n<div class="Ordering pisOddRowLast">\n<div class="detailPageLeftColumn" id="CustomsTariffNumber">\r\n Customs Tariff Number:\r\n
ijQoI5DAFDwZHYnHo-npjlc99wmtq6qwyj8fkdp8ddgybe9dza4ivc3j3ftg7m85t7v9lkauocgtq5cz7cjneftth12nx8mlxekamlee2awza0rgqucvii-wdayptftkvkdbslwhbuf7wlzd7dn4tazzuebqd2vhy5a0&;t=635706797508895128"/>\xc2\XA0正在加载图像..\r\n\n\n\n一般信息\n\n\n\n扩展产品类型:\r\n\n\r\n E213-25-001\r\n\n\n\n\n\n产品ID:\r\n\n\r\n 2CCA703041R0001\r\n\n\n\n\n\n7612270938711\r\n\n\n\n目录描述:\r\n\n\r\n E213-25-10转换开关25A 1CO 250VAC\r\n\n\n\n长描述:\r\n\n\r\n转换开关符合DIN EN 60669-1,VDE 0632第1部分,额定电流:16/25 A,250 VACPDC,触点:1 CO/2 CO,模块宽度:0,5/1\r\n\r\n类别\r\n\n\n\n产品\xc2\xbb\n- 低压产品和系统\r\n
\n\xc2\xbb\n- 模块化DIN导轨产品\r\n
\n\xc2\xbb\n- 模块化DIN导轨组件MDRCs\r\n
\n\xc2\xbb\n- 命令设备\r\n
\n
\n\n\xc2\xa0\nOrdering\n\n\r\n EAN:\r\n\n\n7612270938711\n\n\n\r\n最小订购数量:\r\n\n\n10件\n\n\n\r\n关税号码:\r\n
如果你能帮忙,那太好了。。。我试过各种方法,从粉饰到尝试自己把它分成几行,但似乎没有一种方法是正确的。我想它是一样的源代码格式,以便我可以很容易地搜索和获得我需要的项目从它!谢谢你的帮助,如果可以的话,不要只是给我一个答案,你也能解释一下你做了什么吗 我尝试了这个简单的脚本来提取
NetDepth
,效果很好
from bs4 import BeautifulSoup as bs
from urllib import urlopen
soup = bs(urlopen('<insert url here>').read())
print soup.find(id="ProductNetDepth").next_sibling.next_sibling.div.text
从bs4导入美化组作为bs
从urllib导入urlopen
soup=bs(urlopen(“”).read())
打印soup.find(id=“ProductNetDepth”).next\u sibling.next\u sibling.div.text
如果查看html的结构,则包含mm度量的div是id为ProductNetDepth的div的同级。所以我就是在这个基础上建立起来的
如果您不熟悉soup的搜索功能,那么您应该会看到它们写得非常好。我尝试了这个简单的脚本来提取
NetDepth
,效果很好
from bs4 import BeautifulSoup as bs
from urllib import urlopen
soup = bs(urlopen('<insert url here>').read())
print soup.find(id="ProductNetDepth").next_sibling.next_sibling.div.text
从bs4导入美化组作为bs
从urllib导入urlopen
soup=bs(urlopen(“”).read())
打印soup.find(id=“ProductNetDepth”).next\u sibling.next\u sibling.div.text
如果查看html的结构,则包含mm度量的div是id为ProductNetDepth的div的同级。所以我就是在这个基础上建立起来的
如果您不熟悉soup的搜索功能,那么您应该会看到它们写得非常好。对于您的问题,这里有几种不同的可能解决方案,但我将演示最简单的解决方案 首先,我将回顾问题陈述和您的解决方案 问题陈述:打印包含特定短语(在本例中为“NetDepth”)的所有请求HTML页面行 尝试的解决方案:您正在使用
urllib
请求HTML文件,然后尝试使用BeautifulSoup对其进行修饰,将其写入文本文件,最后打开文本文件并使用正则表达式提取包含匹配正则表达式的特定行
在我看来,这个解决方案对于我们真正需要的东西来说有点过于苛刻。没有理由我们真的需要将HTML写入一个文件,然后从fil中再次读取它