Python 2.7 使用BS4解析HTML表_Python 2.7_Html Parsing_Web Scraping_Beautifulsoup

Python 2.7 使用BS4解析HTML表

python-2.7 web-scraping

Python 2.7 使用BS4解析HTML表,python-2.7,html-parsing,web-scraping,beautifulsoup,Python 2.7,Html Parsing,Web Scraping,Beautifulsoup,我一直在尝试从这个站点（）中提取数据的不同方法，但似乎无法使任何方法起作用。我尝试过使用给定的索引，但似乎无法实现。我想我已经尝试了太多的事情，所以如果有人能给我指出正确的方向，我会非常感激我想提取所有信息并将其导出到一个.csv文件中，但现在我只想获取要打印的名称和位置，以便开始这是我的密码： import urllib2 from bs4 import BeautifulSoup import re url = ('http://nflcombineresults.com/nflcom

我一直在尝试从这个站点（）中提取数据的不同方法，但似乎无法使任何方法起作用。我尝试过使用给定的索引，但似乎无法实现。我想我已经尝试了太多的事情，所以如果有人能给我指出正确的方向，我会非常感激

我想提取所有信息并将其导出到一个.csv文件中，但现在我只想获取要打印的名称和位置，以便开始

这是我的密码：

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

for row in table.findAll('tr')[0:]:
    col = row.findAll('tr')
    name = col[1].string
    position = col[3].string
    player = (name, position)
    print "|".join(player)

import urllib2
from bs4 import BeautifulSoup
import csv

url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    height = col[4].getText()
    weight = col[5].getText()
    forty = col[7].getText()
    bench = col[8].getText()
    vertical = col[9].getText()
    broad = col[10].getText()
    shuttle = col[11].getText()
    threecone = col[12].getText()
    player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
    f.writerow(player)

下面是我得到的错误：第14行，in name=col[1]。字符串索引器错误：列表索引超出范围

--更新--

好的，我取得了一些进展。现在它允许我从开始到结束，但它需要知道表中有多少行。我怎样才能让它一直通过它们直到结束？更新代码：

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')


for row in table.findAll('tr')[1:250]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    player = (name, position)
    print "|".join(player)

由于防火墙权限，我无法运行您的脚本，但我相信问题出在这一行：

col = row.findAll('tr')

row

是一个

tr

标记，您要求BeautifulSoup查找该tr标记中的所有tr标记。你可能想做：

col = row.findAll('td')

此外，由于实际文本并不直接位于tds内部，而是隐藏在嵌套的

div

s和

s中，因此使用

getText

方法而不是

.string

：

name = col[1].getText()
position = col[3].getText()

我只花了8个小时左右就明白了。学习很有趣。谢谢你的帮助，凯文！它现在包含了将刮取的数据输出到csv文件的代码。下一步是获取数据并筛选出某些位置

这是我的密码：

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

for row in table.findAll('tr')[0:]:
    col = row.findAll('tr')
    name = col[1].string
    position = col[3].string
    player = (name, position)
    print "|".join(player)

import urllib2
from bs4 import BeautifulSoup
import csv

url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    height = col[4].getText()
    weight = col[5].getText()
    forty = col[7].getText()
    bench = col[8].getText()
    vertical = col[9].getText()
    broad = col[10].getText()
    shuttle = col[11].getText()
    threecone = col[12].getText()
    player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
    f.writerow(player)

啊，这是有道理的。非常感谢。好的，所以我做了你建议的更改，并且在页面上打印大部分结果时，我肯定取得了进展。不过，它是从阿德里安·丁格尔开始的，而不是列中的第一个名字，但随后打印了完整的列表，包括|和位置。然后它返回以下错误：文件“nfltest.py”，第14行，在name=col[1]中。getText（）indexer错误：列表索引超出范围。我再次尝试使用索引，但似乎无法摆脱错误。是我，还是这张桌子的格式很奇怪？