Python 如何从txt文件创建web scrab
假设我使用了在线工具,比如HTML源代码查看器 然后我输入一个链接,然后他们生成HTML源代码。 然后只选择我想要的标签,像这样Python 如何从txt文件创建web scrab,python,numpy,beautifulsoup,Python,Numpy,Beautifulsoup,假设我使用了在线工具,比如HTML源代码查看器 然后我输入一个链接,然后他们生成HTML源代码。 然后只选择我想要的标签,像这样 <li class='item'><a class='list-link' href='https://foo1.com'><img src='https://foo1.com/imgfoo1.jpg' /></a></li><li class='item'><a class='list-l
<li class='item'><a class='list-link' href='https://foo1.com'><img src='https://foo1.com/imgfoo1.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo2.com'><img src='https://foo1.com/imgfoo2.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo3.com'><img src='https://foo1.com/imgfoo3.jpg' /></a></li>
这就是错误所在
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 225, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/usr/lib/python2.7/dist-packages/bs4/builder/_htmlparser.py", line 157, in prepare_markup
exclude_encodings=exclude_encodings)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 352, in __init__
markup, override_encodings, is_html, exclude_encodings)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 228, in __init__
self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 280, in strip_byte_order_mark
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
问题是,当我在终端上键入page_html时,这是值:
array(['<li', "class='item'><a", "class='list-link'",
"href='https://foo1.com'><img",
"src='https://foo1.com/imgfoo1.jpg'", '/></a></li><li',
"class='item'><a", "class='list-link'",
"href='https://foo2.com'><img",
"src='https://foo1.com/imgfoo2.jpg'", '/></a></li><li',
"class='item'><a", "class='list-link'",
"href='https://foo3.com'><img",
"src='https://foo1.com/imgfoo3.jpg'", '/></a></li>'],
dtype='|S34')
只需像平常一样读取文件。不需要使用NumPy
with open("urlcontainer.txt") as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser")
然后,继续你的解析活动
with open("urlcontainer.txt") as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser")