Python+;美化组导出为CSV
我在自动从维基百科文章中抓取表中的数据时遇到了一些问题。首先,我得到一个编码错误。我指定了UTF-8,错误消失了,但刮取的数据并没有正确显示很多字符。你可以从代码中看出我是一个完全的新手:Python+;美化组导出为CSV,python,csv,beautifulsoup,Python,Csv,Beautifulsoup,我在自动从维基百科文章中抓取表中的数据时遇到了一些问题。首先,我得到一个编码错误。我指定了UTF-8,错误消失了,但刮取的数据并没有正确显示很多字符。你可以从代码中看出我是一个完全的新手: from bs4 import BeautifulSoup import urllib2 wiki = "http://en.wikipedia.org/wiki/Anderson_Silva" header = {'User-Agent': 'Mozilla/5.0'} #Needed to preven
from bs4 import BeautifulSoup
import urllib2
wiki = "http://en.wikipedia.org/wiki/Anderson_Silva"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
Result = ""
Record = ""
Opponent = ""
Method = ""
Event = ""
Date = ""
Round = ""
Time = ""
Location = ""
Notes = ""
table = soup.find("table", { "class" : "wikitable sortable" })
f = open('output.csv', 'w')
for row in table.findAll("tr"):
cells = row.findAll("td")
#For each "tr", assign each "td" to a variable.
if len(cells) == 10:
Result = cells[0].find(text=True)
Record = cells[1].find(text=True)
Opponent = cells[2].find(text=True)
Method = cells[3].find(text=True)
Event = cells[4].find(text=True)
Date = cells[5].find(text=True)
Round = cells[6].find(text=True)
Time = cells[7].find(text=True)
Location = cells[8].find(text=True)
Notes = cells[9].find(text=True)
write_to_file = Result + "," + Record + "," + Opponent + "," + Method + "," + Event + "," + Date + "," + Round + "," + Time + "," + Location + "\n"
write_to_unicode = write_to_file.encode('utf-8')
print write_to_unicode
f.write(write_to_unicode)
f.close()
正如pswaminathan所指出的,使用
csv
模块将大有帮助。我是这样做的:
table = soup.find('table', {'class': 'wikitable sortable'})
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 10:
csvwriter.writerow(cells)
讨论
- 使用csv模块,我创建了一个连接到输出文件的
对象csvwriter
- 通过使用
命令,我不需要担心在完成后关闭输出文件:它将在with块之后关闭with
- 在我的代码中,
是从cells
标记中的tr
标记中提取的UTF8编码文本列表td
- 我使用了构造
,它比c.text
更简洁c.find(text=True)