如何在Python中从Web抓取构建数据框架_Python_Pandas_Beautifulsoup_Python Requests_Tabulate

如何在Python中从Web抓取构建数据框架

python pandas

如何在Python中从Web抓取构建数据框架,python,pandas,beautifulsoup,python-requests,tabulate,Python,Pandas,Beautifulsoup,Python Requests,Tabulate,我可以通过Python中的网页抓取从网页中获取数据。我的数据被提取到一个列表中。但我不知道如何将列表转换为数据帧。有没有什么方法可以直接从web抓取数据到df？这是我的密码： import pandas as pd import requests from bs4 import BeautifulSoup from tabulate import tabulate from pandas import DataFrame import lxml # GET the response from

我可以通过Python中的网页抓取从网页中获取数据。我的数据被提取到一个列表中。但我不知道如何将列表转换为数据帧。有没有什么方法可以直接从web抓取数据到df？这是我的密码：

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
from pandas import DataFrame
import lxml

# GET the response from the web page using requests library
res = requests.get("https://www.worldometers.info/coronavirus/")

# PARSE and fetch content using BeutifulSoup method of bs4 library
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))

# Here dumping the fetched data to have a look
print( tabulate(df[0], headers='keys', tablefmt='psql') )
print(df[0])

好的

read\uhtml

返回一个数据帧列表（按照），因此您必须获取该列表的“第一”（也是唯一）元素

我只想在末尾加上（在你调用

read_html

之后）：

然后，您可以检查其信息获取：

df.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 207 entries, 0 to 206
# Data columns (total 10 columns):
# Country,Other       207 non-null object
# TotalCases          207 non-null int64
# NewCases            59 non-null object
# TotalDeaths         144 non-null float64
# NewDeaths           31 non-null float64
# TotalRecovered      154 non-null float64
# ActiveCases         207 non-null int64
# Serious,Critical    112 non-null float64
# Tot Cases/1M pop    205 non-null float64
# Deaths/1M pop       142 non-null float64
# dtypes: float64(6), int64(2), object(2)
# memory usage: 16.3+ KB

df.info（）
# 
#范围索引：207个条目，0到206
#数据列（共10列）：
#国家/地区，其他207个非空对象
#TotalCases 207非空int64
#NewCases 59非空对象
#TotalDeath 144非空浮点64
#新死亡31非空浮点64
#Total154非空浮点64
#ActiveCases 207非空int64
#严重、严重112非空浮点64
#Tot案例/1M pop 205非空浮动64
#死亡人数/1M pop 142非空浮动64
#数据类型：float64（6）、int64（2）、object（2）
#内存使用率：16.3+KB

Well

read\u html

返回一个数据帧列表（根据），因此必须获取该列表的“第一个”（也是唯一的）元素

我只想在末尾加上（在你调用

read_html

之后）：

然后，您可以检查其信息获取：

df.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 207 entries, 0 to 206
# Data columns (total 10 columns):
# Country,Other       207 non-null object
# TotalCases          207 non-null int64
# NewCases            59 non-null object
# TotalDeaths         144 non-null float64
# NewDeaths           31 non-null float64
# TotalRecovered      154 non-null float64
# ActiveCases         207 non-null int64
# Serious,Critical    112 non-null float64
# Tot Cases/1M pop    205 non-null float64
# Deaths/1M pop       142 non-null float64
# dtypes: float64(6), int64(2), object(2)
# memory usage: 16.3+ KB

df.info（）
# 
#范围索引：207个条目，0到206
#数据列（共10列）：
#国家/地区，其他207个非空对象
#TotalCases 207非空int64
#NewCases 59非空对象
#TotalDeath 144非空浮点64
#新死亡31非空浮点64
#Total154非空浮点64
#ActiveCases 207非空int64
#严重、严重112非空浮点64
#Tot案例/1M pop 205非空浮动64
#死亡人数/1M pop 142非空浮动64
#数据类型：float64（6）、int64（2）、object（2）
#内存使用率：16.3+KB

导入请求
作为pd进口熊猫
r=请求。获取（“https://www.worldometers.info/coronavirus/")
df=pd.read\u html（r.content）[0]
打印（类型（df））
# 
df.to_csv（“data.csv”，index=False）

输出：

导入请求
作为pd进口熊猫
r=请求。获取（“https://www.worldometers.info/coronavirus/")
df=pd.read\u html（r.content）[0]
打印（类型（df））
# 
df.to_csv（“data.csv”，index=False）

输出：

感谢您的解决方案。非常直截了当，今天学到了一些新东西。谢谢你的解决方案。非常直截了当，今天学到了一些新的东西。我只能选择一种解决方案，两种解决方案都适合我。无论如何，谢谢。我只能选择一种解决方案，两种解决方案都适合我。无论如何，谢谢你。