Python BeautifulSoup4解析html_Python_Html_Parsing_Beautifulsoup

Python BeautifulSoup4解析html

python html parsing

Python BeautifulSoup4解析html,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,我需要从这个网站上抓取所有高中的名字以及他们所在的城市。使用BeautifulSoup4。我在下面添加了无效代码。非常感谢 !！[html]（）您的程序中有许多错误。下面是一个工作的，应该作为额外优化的基础 import requests # much better than using urllib2 from bs4 import BeautifulSoup # you forgot the `from` url = "http://en.wikipedia.org/wiki/List_

我需要从这个网站上抓取所有高中的名字以及他们所在的城市。使用BeautifulSoup4。我在下面添加了无效代码。非常感谢

!！[html]（）

您的程序中有许多错误。下面是一个工作的，应该作为额外优化的基础

import requests # much better than using urllib2
from bs4 import BeautifulSoup # you forgot the `from`

url = "http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas" 
# you don't need () around it
r = requests.get(url) 
# does everything all at once, no need to call `opener` and `read()`
contents = r.text # get the HTML contents of the page

soup = BeautifulSoup(contents)
for item in soup.find_all('li'): # 'li' and 'il' are different things...
    print item.get_text()        # you need to iterate over all the elements
                                 # found by `find_all()`

就这样。这将获得页面上每个

项目的文本。正如您在运行该程序时所看到的，有很多不相关的结果，例如目录、左侧的菜单项、页脚等。我将让您自行决定如何仅获取学校名称，并分离出县名称和其他错误
作为参考，请仔细阅读BS。他们会回答你的许多问题。
你能提供更多信息吗？主要是：您得到的输出有什么问题？我在运行代码时遇到了以下错误：回溯（最近一次调用）：文件“grapWebpage2.py”，打印BeautifulSoup中的第12行。get_text（item）TypeError：必须使用BeautifulSoup实例作为第一个参数调用unbound方法get_text（）（取而代之的是标记实例）我使用python 2.7。此外，我能够获取标记之间的所有文本。这不是问题所在。我想知道如何深入挖掘并获取我指定的文本。@user3827516你的问题根本不清楚，因为你说你的代码根本不起作用（这是真的），所以我的回答是如何让你发布的代码正常工作。要深入研究，你必须识别HTML中的模式，这些模式只出现在学校名称列表中，而不是其他地方。维基百科的代码通常非常干净，所以这个练习不应该太复杂。
import requests # much better than using urllib2
from bs4 import BeautifulSoup # you forgot the `from`

url = "http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas" 
# you don't need () around it
r = requests.get(url) 
# does everything all at once, no need to call `opener` and `read()`
contents = r.text # get the HTML contents of the page

soup = BeautifulSoup(contents)
for item in soup.find_all('li'): # 'li' and 'il' are different things...
    print item.get_text()        # you need to iterate over all the elements
                                 # found by `find_all()`