Python 如何将wikipedia表转换为数据帧?
我想对直接从特定互联网页面获得的数据表应用一些统计数据。 本教程帮助我从网页上的表格创建数据框。然而,我想对地理数据做同样的处理,比如几个国家的人口和gdp 我在维基百科上找到了一些表格,但效果不太好,我不明白为什么。以下是我的代码,遵循上述教程:Python 如何将wikipedia表转换为数据帧?,python,html,python-3.x,pandas,dataframe,Python,Html,Python 3.x,Pandas,Dataframe,我想对直接从特定互联网页面获得的数据表应用一些统计数据。 本教程帮助我从网页上的表格创建数据框。然而,我想对地理数据做同样的处理,比如几个国家的人口和gdp 我在维基百科上找到了一些表格,但效果不太好,我不明白为什么。以下是我的代码,遵循上述教程: import requests import lxml.html as lh import pandas as pd url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by
import requests
import lxml.html as lh
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])
#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
print(df.head())
列的长度不应为null。格式与教程中的格式不同。你知道怎么做吗?或者是另一个没有返回这种奇怪输出格式的数据源?使用pandas来读取url数据,而不是使用请求
'df=pd.read_htmlurl正如您在第16行中的print语句所示,与输出的第一行相对应的行长度不是10。是五点。您的代码在第一次迭代中就打破了循环,而不是填充col 更改此声明:
if len(T)!=10:
break
到
应该可以解决问题。在第52行,您正在尝试编辑元组。这在Python中是不可能的 要更正此问题,请使用列表 将第25行更改为col.append[name,[] 此外,当使用break时,它会中断for循环,这会导致它在数组中没有数据 在做这类事情时,您还必须查看html。表的格式没有人们希望的那么好。例如,它有一堆新的线条,还有国旗的图像。您可以查看每次的格式是如何不同的 看起来你想要一个简单的方法来做这件事。我会调查一下。我已经添加了一种使用bs4实现这一点的方法。你必须做一些编辑,使它看起来更好
import requests
import bs4 as bs
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
column_names.append(th.get_text())
#gets all the rows of the table
rows = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
#Creates a list with each index being a different entry in the row.
values = [r for r in row]
#Gets each values that we care about
rank = values[1].get_text()
country = values[3].get_text()
pop = values[5].get_text()
date = values[7].get_text()
source = values[9].get_text()
temp_list = [rank,country,pop,date,source]
#Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
data.append(dict(zip(column_names, temp_list)))
print(column_names)
df = pd.DataFrame(data)
长话短说:pd.read\u htmlur为您提供页面上的表列表,然后您可以将这些表编入索引
if len(T)!=5:
break
import requests
import bs4 as bs
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
column_names.append(th.get_text())
#gets all the rows of the table
rows = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
#Creates a list with each index being a different entry in the row.
values = [r for r in row]
#Gets each values that we care about
rank = values[1].get_text()
country = values[3].get_text()
pop = values[5].get_text()
date = values[7].get_text()
source = values[9].get_text()
temp_list = [rank,country,pop,date,source]
#Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
data.append(dict(zip(column_names, temp_list)))
print(column_names)
df = pd.DataFrame(data)