Python 无法读取用于HTML刮取的列
我正试图从表中提取数据 我使用了以下代码:Python 无法读取用于HTML刮取的列,python,beautifulsoup,mechanize,Python,Beautifulsoup,Mechanize,我正试图从表中提取数据 我使用了以下代码: #!/usr/bin/env python from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States" page = mech.open(url) html = page.r
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table",{ "class" : "wikitable" })
for row in table.findAll('tr')[1:]:
col = row.findAll('th')
Vehicle = col[0].string
Year1 = col[2].string
Year2 = col[3].string
Year3 = col[4].string
Year4 = col[5].string
Year5 = col[6].string
Year6 = col[7].string
Year7 = col[8].string
Year8 = col[9].string
Year9 = col[10].string
Year10 = col[11].string
Year11 = col[12].string
Year12 = col[13].string
Year13 = col[14].string
Year14 = col[15].string
Year15 = col[16].string
Year16 = col[17].string
record =(Vehicle,Year1,Year2,Year3,Year4,Year5,Year6,Year7,Year8,Year9,Year10,Year11,Year12,Year13,Year14,Year15,Year16)
print "|".join(record)
我得到这个错误
File "scrap1.ph", line 13
col = row.findAll('th')
^
IndentationError: expected an indented block
有人能告诉我我做错了什么吗。除了@traceur关于缩进错误的观点外,这里还有一些可以大大简化代码的方法:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
soup = BeautifulSoup(mech.open(url))
table = soup.find("table", class_="wikitable")
for row in table('tr')[1:]:
print "|".join(col.text.strip() for col in row.find_all('th'))
请注意,与其使用来自BeautifulSoup import BeautifulSoup(第三版BeautifulSoup)的,不如使用来自bs4 import BeautifulSoup
(第四版)的,因为第三版不再维护
还请注意,您可以将mech.open(url)
直接传递给BeautifulSoup
构造函数,而不是手动读取它
希望这有帮助。我仍然在您的脚本上看到缩进错误。请帮助我如何删除该错误。@Auguster hm,这里没有缩进问题,请检查您是否正确粘贴了代码。我粘贴了相同的代码,但出现了此错误。文件“scrap1.py”,第10行打印“|”。为行中的列连接(col.text.strip()。查找所有('th'))^indentation错误:应为缩进block@Auguster缩进以print
开头的行。我需要做的就是将打印行与for loop放在同一行中,这可能是因为我正在windows中编辑并使用cygwin运行代码。