Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/306.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用Python进行维基百科数据抓取_Python_Web Scraping_Beautifulsoup_Html Parsing_Wikipedia - Fatal编程技术网

用Python进行维基百科数据抓取

用Python进行维基百科数据抓取,python,web-scraping,beautifulsoup,html-parsing,wikipedia,Python,Web Scraping,Beautifulsoup,Html Parsing,Wikipedia,我正在尝试从以下内容中检索3列(NFL球队、球员姓名、大学队)。我是python新手,一直在尝试使用beautifulsoup来完成这项工作。我只需要属于QB的列,但尽管位置不同,我甚至无法获得所有列。这就是我到目前为止所拥有的,它没有输出任何东西,我也不完全确定为什么。我相信这是由于a标签,但我不知道要改变什么。任何帮助都将不胜感激。” wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft" header = {'User-Agent': 'M

我正在尝试从以下内容中检索3列(NFL球队、球员姓名、大学队)。我是python新手,一直在尝试使用beautifulsoup来完成这项工作。我只需要属于QB的列,但尽管位置不同,我甚至无法获得所有列。这就是我到目前为止所拥有的,它没有输出任何东西,我也不完全确定为什么。我相信这是由于a标签,但我不知道要改变什么。任何帮助都将不胜感激。”

wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""

table = soup.find("table", { "class" : "wikitable sortable" })

#print table

#output = open('output.csv','w')

for row in table.findAll("tr"):
    cells = row.findAll("href")
    print "---"
    print cells.text
    print "---"
    #For each "tr", assign each "td" to a variable.
    #if len(cells) > 1:
        #NFL = cells[1].find(text=True)
        #player = cells[2].find(text = True)
        #pos = cells[3].find(text=True)
        #college = cells[4].find(text=True)
        #write_to_file = player + " " + NFL + " " + college + " " + pos
        #print write_to_file

    #output.write(write_to_file)

#output.close()

我知道很多都被评论掉了,因为我一直在努力寻找故障所在

以下是我要做的:

  • 查找
    播放器选择
    段落
  • 使用获取下一个
    wikitable
  • 查找内部的所有
    tr
    标记
  • 对于每一行,找到
    td
    an
    th
    标记并通过索引获得所需的单元格
代码如下:

filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
    cells = row.find_all(['td', 'th'])

    try:
        nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
    except IndexError:
        continue

    if position != filter_position:
        continue

    print nfl_team, name, position, college
这是输出(只过滤四分卫):

Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawaiʻi
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State