如何使用python（beautifulsoup4和requests或任何其他库）刮取特定的表表单网站？_Python_Web Scraping

如何使用python（beautifulsoup4和requests或任何其他库）刮取特定的表表单网站？

python web-scraping

如何使用python（beautifulsoup4和requests或任何其他库）刮取特定的表表单网站？,python,web-scraping,Python,Web Scraping,以上是该网站的链接，我想浏览一下表格：2016年《财富》杂志收入排名前10位的欧盟公司。请共享相同的代码： import requests from bs4 import BeautifulSoup def web_crawler(url): page = requests.get(url) plain_text = page.text soup = BeautifulSoup(plain_text,"html.parser") tables = soup.findAll("tbody")

以上是该网站的链接，我想浏览一下表格：

2016年《财富》杂志收入排名前10位的欧盟公司。

请共享相同的代码：

import requests
from bs4 import BeautifulSoup

def web_crawler(url):

page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)

soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")

按照@FanMan所说的，这是帮助您开始的简单代码，请记住，您需要清理它，并且还需要自己执行其余的工作

import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
    w=text.findAll(text=True)
    if(len(w)>0):
        temp_datastore.append(w)

一些文件

靓汤：

要求：

urllib:

您的第一个问题是您的url没有正确定义。之后，您需要找到要提取的表及其类。在本例中，类是“wikitable”，它是第一个表中的一个。我已经为您启动了代码，因此它将为您提供从表中提取的数据。网络抓取是一个很好的学习方法，但是如果你刚刚开始编程，那么先用一些简单的东西来练习

import requests
from bs4 import BeautifulSoup

def webcrawler():

    url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
    page = requests.get(url)
    soup = BeautifulSoup(page.text,"html.parser")
    tables = soup.findAll("table", class_='wikitable')[0]
    print(tables)

webcrawler()

你需要在提问前阅读。我们是来帮忙的，不是来教书的。请根据您已经尝试过的内容和代码中出现的问题添加代码。我很乐意在这一点上帮助你。@FanMan为没有编写代码的麻烦感到抱歉，实际上我是stackflow的新手。。。。不管怎么说，我没听清你的回答。。。基本上，我希望获取表格及其内容……此外，我提供的维基百科链接有几个表格，我只想获取一个标题为《财富》杂志收入排名前十的欧盟公司（2016年）的特定表格....@FanMan进一步说，我也有兴趣问一下，在你的答案中的for循环中，我发现你使用了text.findAll方法，在for循环中，你使用了text.findAll方法，我不知道为什么，但在我的pycharm中，这不起作用，那就是我可以在soup上调用findAll（这是beautifulsou的变量），但不能在text上调用findAll（这是汤的进一步变量）我已经添加了我的答案。你提到的答案不是我的。