For循环通过Python中的URL传递变量
我对Python非常陌生,我正在尝试通过做一些简单的网页抓取来获取足球统计数据来自学 我已经成功地一次获取了一个页面的数据,但是我还没有弄清楚如何在代码中添加一个循环,一次刮取多个页面(或者多个职位/年份/会议) 我在这个网站和其他网站上搜索了相当多的内容,但我似乎没有找到正确的答案 这是我的密码:For循环通过Python中的URL传递变量,python,python-2.7,web-scraping,beautifulsoup,python-requests,Python,Python 2.7,Web Scraping,Beautifulsoup,Python Requests,我对Python非常陌生,我正在尝试通过做一些简单的网页抓取来获取足球统计数据来自学 我已经成功地一次获取了一个页面的数据,但是我还没有弄清楚如何在代码中添加一个循环,一次刮取多个页面(或者多个职位/年份/会议) 我在这个网站和其他网站上搜索了相当多的内容,但我似乎没有找到正确的答案 这是我的密码: import csv import requests from BeautifulSoup import BeautifulSoup url = 'http://www.nfl.com/stats
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=1&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(''', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
#for line in list_of_rows: print ', '.join(line)
outfile = open("./2014.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
writer.writerows(list_of_rows)
outfile.close()
以下是我在URL中添加变量并构建循环的尝试:
import csv
import requests
from BeautifulSoup import BeautifulSoup
pagelist = ["1", "2", "3"]
x = 0
while (x < 500):
url = "http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p="+str(x)).read(),'html'+"&d-447263-s=RUSHING_ATTEMPTS_PER_GAME_AVG&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=RUSHING&conference=null&qualified=false"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(''', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
#for line in list_of_rows: print ', '.join(line)
outfile = open("./2014.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Att", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Long", "1st", "1st%", "20+", "40+", "FUM"])
writer.writerows(list_of_rows)
x = x + 0
outfile.close()
假设您只想更改页码,可以执行以下操作并使用: 请注意,您应该以追加模式(“ab”)而不是写入模式(“wb”)打开文件,因为后者会覆盖现有内容,正如您所经历的那样。使用追加模式,新内容将写入文件末尾
这超出了问题的范围,更像是一个友好的代码改进建议,但是如果您将脚本拆分为较小的函数,每个函数都做一件事,例如从站点获取数据,将其写入csv,等等非常感谢您的帮助Jomel-当我打印到屏幕上时,代码可以工作,但当我尝试保存到csv时,似乎文件中的每一页都会被覆盖,因此我只能以最后一页结束。有没有一种方法可以在不覆盖第一页数据的情况下将第页数据附加到第一页数据上?@JasonC请参阅我修改后的答案。您只需在附加模式下打开文件,以免在每次传递时覆盖内容。
import csv
import requests
from BeautifulSoup import BeautifulSoup
url_template = 'http://www.nfl.com/stats/categorystats?tabSeq=0&season=2014&seasonType=REG&experience=&Submit=Go&archive=false&d-447263-p=%s&conference=null&statisticCategory=PASSING&qualified=false'
for p in ['1','2','3']:
url = url_template % p
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(''', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
#for line in list_of_rows: print ', '.join(line)
outfile = open("./2014Passing.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
writer.writerows(list_of_rows)
outfile.close()
url_template = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=%s&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
for page in [1,2,3]:
url = url_template % page
response = requests.get(url)
# Rest of the processing code can go here
outfile = open("./2014.csv", "ab")
writer = csv.writer(outfile)
writer.writerow(...)
writer.writerows(list_of_rows)
outfile.close()