Python BeautifulSoup刮表_Python_Html_Web Scraping_Beautifulsoup_Html Parsing

Python BeautifulSoup刮表

python html web-scraping

Python BeautifulSoup刮表,python,html,web-scraping,beautifulsoup,html-parsing,Python,Html,Web Scraping,Beautifulsoup,Html Parsing,我正在尝试用BeautifulSoup创建一个表刮擦。我编写了以下Python代码： import urllib2 from bs4 import BeautifulSoup url = "http://dofollow.netsons.org/table1.htm" # change to whatever your url is page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) for i in soup.fi

我正在尝试用BeautifulSoup创建一个表刮擦。我编写了以下Python代码：

import urllib2
from bs4 import BeautifulSoup

url = "http://dofollow.netsons.org/table1.htm"  # change to whatever your url is

page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

for i in soup.find_all('form'):
    print i.attrs['class']

我需要刮去Nome、Cognome、Email。

在表行（

tr

tag）上循环，并获取内部单元格（

td

tag）的文本：

印刷品：

Nome:  Massimo, Cognome:  Allegri, Email:  Allegri.Massimo@alitalia.it
Nome:  Alessandra, Cognome:  Anastasia, Email:  Anastasia.Alessandra@alitalia.it
...

仅供参考，

[2:]

这里的切片将跳过两行标题

UPD，以下是将结果保存到txt文件的方法：

with open('output.txt', 'w') as f:
    for tr in soup.find_all('tr')[2:]:
        tds = tr.find_all('td')
        f.write("Nome: %s, Cognome: %s, Email: %s\n" % \
              (tds[0].text, tds[1].text, tds[2].text))

#库
从bs4导入BeautifulSoup
#空列表
制表符=[]
#文件处理
将open（'/home/rakesh/showHW/content.html'，r'）作为fp:
html_content=fp.read（）
table_doc=BeautifulSoup（html_内容'html.parser'）
#解析html内容
对于表_doc.table.find_all（'tr'）中的tr：
tabs.append({
“Nome”：tr.find_all（'td'）[0]。字符串，
“Cogname”：tr.find_all（'td'）[1]。字符串，
“Email”：tr.find_all（'td'）[2]。字符串
})
打印（选项卡）

OP发布的原始链接已失效。。。但以下是使用以下工具刮取表数据的方法：

步骤1-导入

Soup

并下载html：

来自gazpacho进口汤的


url=”https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists"
soup=soup.get（url）

步骤2-查找表和表行：

table=soup.find（“table”，“class”：“wikitable sortable”}，mode=“first”）
trs=表。查找（“tr”）[1:]

步骤3-使用函数分析每一行以提取所需数据：

def parse_tr（tr）：
返回{
“名称”：tr.find（“td”）[0]。文本，
“国家”：tr.find（“td”）[1]。文本，
“奖牌”：int（tr.find（“td”）[-1]。文本）
}
数据=[trs中tr的parse_tr（tr）]
已排序（数据，key=lambda x:x[“奖牌”]，reverse=True）

你能解释一下为什么在第一行中需要[2:]吗？@AZhao当然，答案中就有它——跳过两行标题。

with open('output.txt', 'w') as f:
    for tr in soup.find_all('tr')[2:]:
        tds = tr.find_all('td')
        f.write("Nome: %s, Cognome: %s, Email: %s\n" % \
              (tds[0].text, tds[1].text, tds[2].text))