Python 刮表结果不完整_Python_Web Scraping_Beautifulsoup_Urllib

Python 刮表结果不完整

python web-scraping

Python 刮表结果不完整,python,web-scraping,beautifulsoup,urllib,Python,Web Scraping,Beautifulsoup,Urllib,我正试图用beautifulsoup刮去这一页，并从中获取表格。下面是我的代码 from bs4 import BeautifulSoup import urllib.request base_url = "http://bifr.nic.in/asp/list.asp" page = urllib.request.urlopen(base_url) soup = BeautifulSoup(page, "html.parser") table =

我正试图用beautifulsoup刮去这一页，并从中获取表格。下面是我的代码

from bs4 import BeautifulSoup
import urllib.request
base_url = "http://bifr.nic.in/asp/list.asp"

page = urllib.request.urlopen(base_url)
soup = BeautifulSoup(page, "html.parser")

table = soup.find("table",{"class":"forumline"})
tr = table.find_all("tr")
for rows in tr:
    print(rows.get_text())

它没有显示错误，但当我执行它时，我只能从表中获取第一行内容

List of Companies

Case
            No
Company
            Name









 359  2000   A & F OVERSEAS LTD.





 359  2000   A & F OVERSEAS LTD.
 359  2000   A & F OVERSEAS LTD.

这就是我得到的结果。我不明白发生了什么。

可能页面代码在html标记中包含一些错误，请尝试使用html5lib而不是html.parser，但在需要安装之前：

pip install html5lib


soup = BeautifulSoup(page, "html5lib")

尝试以下操作以获取该表中的所有数据：

from urllib.request import urlopen
from bs4 import BeautifulSoup

page = urlopen("http://bifr.nic.in/asp/list.asp")
soup = BeautifulSoup(page, "html5lib")
table = soup.select_one("table.forumline")
for items in table.select("tr")[4:]:
    data = ' '.join([item.get_text(" ",strip=True) for item in items.select("td")])
    print(data)

部分输出：

359 2000 A & F OVERSEAS LTD.
99 1988 A B C PRODUCTS LTD.
103 1989 A INFRASTRUCTURE LTD.
3 2006 A V ALLOYS LTD.
13 1988 A V J WIRES LTD.

这不是问题所在。问题明确地说，

结果不完整

。同样，我得到的结果只有359 2000 A&F海外有限公司。但是，是的，只是在适当的格式。哇！你是怎么做到的？就像它为所有行提供了正确的格式一样。你能简短地解释一下吗？我很乐意帮助桑迪。首先，

html5lib

修复了损坏的html。

.get_text（“，strip=True）

所做的是去掉不需要的空格，然后

”。join（）

将列表转换成常规字符串。请查看下面的链接以了解更多信息。