Python 如何将wikipedia表转换为数据帧？_Python_Html_Python 3.x_Pandas_Dataframe

Python 如何将wikipedia表转换为数据帧？

python html python-3.x pandas dataframe

Python 如何将wikipedia表转换为数据帧？,python,html,python-3.x,pandas,dataframe,Python,Html,Python 3.x,Pandas,Dataframe,我想对直接从特定互联网页面获得的数据表应用一些统计数据。本教程帮助我从网页上的表格创建数据框。然而，我想对地理数据做同样的处理，比如几个国家的人口和gdp 我在维基百科上找到了一些表格，但效果不太好，我不明白为什么。以下是我的代码，遵循上述教程： import requests import lxml.html as lh import pandas as pd url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by

我想对直接从特定互联网页面获得的数据表应用一些统计数据。本教程帮助我从网页上的表格创建数据框。然而，我想对地理数据做同样的处理，比如几个国家的人口和gdp

我在维基百科上找到了一些表格，但效果不太好，我不明白为什么。以下是我的代码，遵循上述教程：

import requests
import lxml.html as lh
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'


#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])

#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))
    
    
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=10:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

print(df.head())

列的长度不应为null。格式与教程中的格式不同。你知道怎么做吗？或者是另一个没有返回这种奇怪输出格式的数据源？

使用pandas来读取url数据，而不是使用请求

'df=pd.read_htmlurl

正如您在第16行中的print语句所示，与输出的第一行相对应的行长度不是10。是五点。您的代码在第一次迭代中就打破了循环，而不是填充col

更改此声明：

if len(T)!=10:
    break

到

应该可以解决问题。

在第52行，您正在尝试编辑元组。这在Python中是不可能的

要更正此问题，请使用列表

将第25行更改为col.append[name，[]

此外，当使用break时，它会中断for循环，这会导致它在数组中没有数据

在做这类事情时，您还必须查看html。表的格式没有人们希望的那么好。例如，它有一堆新的线条，还有国旗的图像。您可以查看每次的格式是如何不同的

看起来你想要一个简单的方法来做这件事。我会调查一下。我已经添加了一种使用bs4实现这一点的方法。你必须做一些编辑，使它看起来更好

import requests
import bs4 as bs
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'

column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
    column_names.append(th.get_text())

#gets all the rows of the table
rows  = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
    #Creates a list with each index being a different entry in the row. 
    values = [r for r in row]
    #Gets each values that we care about
    rank = values[1].get_text()
    country = values[3].get_text()
    pop = values[5].get_text()
    date = values[7].get_text()
    source = values[9].get_text()
    temp_list = [rank,country,pop,date,source]
    #Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
    data.append(dict(zip(column_names, temp_list)))
print(column_names)

df = pd.DataFrame(data)

长话短说：pd.read\u htmlur为您提供页面上的表列表，然后您可以将这些表编入索引

if len(T)!=5:
    break

import requests
import bs4 as bs
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'

column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
    column_names.append(th.get_text())

#gets all the rows of the table
rows  = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
    #Creates a list with each index being a different entry in the row. 
    values = [r for r in row]
    #Gets each values that we care about
    rank = values[1].get_text()
    country = values[3].get_text()
    pop = values[5].get_text()
    date = values[7].get_text()
    source = values[9].get_text()
    temp_list = [rank,country,pop,date,source]
    #Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
    data.append(dict(zip(column_names, temp_list)))
print(column_names)

df = pd.DataFrame(data)