使用python从Wikipedia中刮表？_Python_Pandas_Web Scraping_Beautifulsoup

使用python从Wikipedia中刮表？

python pandas web-scraping

使用python从Wikipedia中刮表？,python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我正在尝试从这个维基百科页面中获取表数据：我尝试过使用pandaspd.read\u html语法，但对于我正在努力清理的表来说，它不起作用（尼泊尔地区确诊的新冠病毒-19病例）我试着用Beautifulsoup和pandas来搜集数据，但没有用 url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal' r = requests.get(url) soup = BeautifulSoup(r.text,

我正在尝试从这个维基百科页面中获取表数据：我尝试过使用pandaspd.read\u html语法，但对于我正在努力清理的表来说，它不起作用（尼泊尔地区确诊的新冠病毒-19病例）

我试着用Beautifulsoup和pandas来搜集数据，但没有用

url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
table = soup.find('table', {'class': 'wikitable'})
dfs=pd.read_html(table)
dfs[0]

这是可行的，您需要将表转换为字符串，以便

read\u html

正常工作

由于某种原因，

行span

和

列span

属性显示为

“2；”

，我找不到一个好方法来修复它-

pd.read\u html（）

不喜欢这样，所以我只使用

.replace（）

理论上，这应该可以完成同样的任务，但是更短、更容易，但与

行span

有相同的问题：

dfs = pd.read_html("https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal", flavor="lxml")
print(dfs[0])  # whatever the index of the table is

这似乎是

read\u html

（熊猫版1.0.3）可能存在的错误

dfs = pd.read_html("https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal", flavor="lxml")
print(dfs[0])  # whatever the index of the table is