Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Python(Beautifulsoup)从html中提取列_Python_Html_Web Scraping - Fatal编程技术网

使用Python(Beautifulsoup)从html中提取列

使用Python(Beautifulsoup)从html中提取列,python,html,web-scraping,Python,Html,Web Scraping,我需要从这个页面提取信息-。我需要日期,价格,开盘价,高,低,兑换率。 我是Python新手,所以我被困在这一步: import requests from bs4 import BeautifulSoup from datetime import datetime url='http://www.investing.com/currencies/usd-brl-historical-data' r = requests.get(url) soup=BeautifulSoup(r.conte

我需要从这个页面提取信息-。我需要日期,价格,开盘价,高,低,兑换率。 我是Python新手,所以我被困在这一步:

import requests
from bs4 import BeautifulSoup
from datetime import datetime

url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)

soup=BeautifulSoup(r.content,'lxml')

g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})

d=[]

for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1

for n in range(N):
    k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)

    print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
这里我有几个问题:

价格列可以取消标记为或我如何指定我希望使用class='redFont'或/和'greenfont'标记的项目?。另外,更改%还可以具有类redFont和greenFont。其他列由标记如何提取它们?

有办法从表中提取列吗?

理想情况下,我想有一个日期框架列日期,价格,开放,高,低,变化%


谢谢

这里有一种将html表转换为嵌套列表的方法

解决方案是找到特定的表,然后循环遍历表中的每个tr,创建该tr中所有项目文本的子列表。执行此操作的代码是嵌套列表

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
这将从表中获取所有数据

[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
 ['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
 ['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
 ['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
 ['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
 ['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
 ['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
 ['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
 ['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
 ['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
 ['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
 ['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
 ['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
 ['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
 ['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
 ['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
 ['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
 ['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
 ['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
 ['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
 ['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
 ['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
 ['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
 ['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
如果要将其转换为数据帧,只需获取表标题并添加它们

import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint

url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]

#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]

#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)

print(df)
然后您将得到一个如下所示的数据帧:

            Date   Price    Open    High     Low Change %
0   Jun 08, 2016  3.3596  3.4411  3.4465  3.3584   -2.40%
1   Jun 07, 2016  3.4421  3.4885  3.5141  3.4401   -1.36%
2   Jun 06, 2016  3.4896  3.5265  3.5295  3.4840   -1.09%
3   Jun 05, 2016  3.5280  3.5280  3.5280  3.5280    0.11%
4   Jun 03, 2016  3.5240  3.5910  3.5947  3.5212   -1.91%
5   Jun 02, 2016  3.5926  3.6005  3.6157  3.5765   -0.22%
6   Jun 01, 2016  3.6007  3.6080  3.6363  3.5755   -0.29%
7   May 31, 2016  3.6111  3.5700  3.6383  3.5534    1.11%
8   May 30, 2016  3.5713  3.6110  3.6167  3.5675   -1.11%
9   May 27, 2016  3.6115  3.5824  3.6303  3.5792    0.81%
10  May 26, 2016  3.5825  3.5826  3.5857  3.5757   -0.03%
11  May 25, 2016  3.5836  3.5702  3.6218  3.5511    0.34%
12  May 24, 2016  3.5713  3.5717  3.5903  3.5417   -0.04%
13  May 23, 2016  3.5728  3.5195  3.5894  3.5121    1.49%
14  May 20, 2016  3.5202  3.5633  3.5663  3.5154   -1.24%
15  May 19, 2016  3.5644  3.5668  3.6197  3.5503   -0.11%
16  May 18, 2016  3.5683  3.4877  3.5703  3.4854    2.28%
17  May 17, 2016  3.4888  3.4990  3.5300  3.4812   -0.32%
18  May 16, 2016  3.5001  3.5309  3.5366  3.4944   -0.96%
19  May 13, 2016  3.5340  3.4845  3.5345  3.4630    1.39%
20  May 12, 2016  3.4855  3.4514  3.5068  3.4346    0.95%
21  May 11, 2016  3.4528  3.4755  3.4835  3.4389   -0.66%
22  May 10, 2016  3.4758  3.5155  3.5173  3.4623   -1.15%
23  May 09, 2016  3.5164  3.5010  3.6766  3.4906    0.40%

我已经回答了如何解析来自该站点的表,但是因为您需要一个数据帧,所以只需使用

这将给你:

            Date   Price    Open    High     Low Change %
0   Jun 08, 2016  3.3609  3.4411  3.4465  3.3584   -2.36%
1   Jun 07, 2016  3.4421  3.4885  3.5141  3.4401   -1.36%
2   Jun 06, 2016  3.4896  3.5265  3.5295  3.4840   -1.09%
3   Jun 05, 2016  3.5280  3.5280  3.5280  3.5280    0.11%
4   Jun 03, 2016  3.5240  3.5910  3.5947  3.5212   -1.91%
5   Jun 02, 2016  3.5926  3.6005  3.6157  3.5765   -0.22%
6   Jun 01, 2016  3.6007  3.6080  3.6363  3.5755   -0.29%
7   May 31, 2016  3.6111  3.5700  3.6383  3.5534    1.11%
8   May 30, 2016  3.5713  3.6110  3.6167  3.5675   -1.11%
9   May 27, 2016  3.6115  3.5824  3.6303  3.5792    0.81%
10  May 26, 2016  3.5825  3.5826  3.5857  3.5757   -0.03%
11  May 25, 2016  3.5836  3.5702  3.6218  3.5511    0.34%
12  May 24, 2016  3.5713  3.5717  3.5903  3.5417   -0.04%
13  May 23, 2016  3.5728  3.5195  3.5894  3.5121    1.49%
14  May 20, 2016  3.5202  3.5633  3.5663  3.5154   -1.24%
15  May 19, 2016  3.5644  3.5668  3.6197  3.5503   -0.11%
16  May 18, 2016  3.5683  3.4877  3.5703  3.4854    2.28%
17  May 17, 2016  3.4888  3.4990  3.5300  3.4812   -0.32%
18  May 16, 2016  3.5001  3.5309  3.5366  3.4944   -0.96%
19  May 13, 2016  3.5340  3.4845  3.5345  3.4630    1.39%
20  May 12, 2016  3.4855  3.4514  3.5068  3.4346    0.95%
21  May 11, 2016  3.4528  3.4755  3.4835  3.4389   -0.66%
22  May 10, 2016  3.4758  3.5155  3.5173  3.4623   -1.15%
23  May 09, 2016  3.5164  3.5010  3.6766  3.4906    0.40%

您通常可以直接传递url,但我们使用urllib2(read_html使用的库)对此特定站点出现403错误,因此我们需要使用请求来获取该html。

谢谢,代码看起来非常紧凑,但我收到以下错误消息-“ImportError:未找到lxml,请安装它”当我pip安装lxml时,我得到“Requiremnt已经是最新的”,您正在使用python2或3?python3。我在用水蟒(不知道这是否重要)。我遵循了您的第一个链接,并使用它提取了不同的日期范围。我将尝试将这些例程组合起来,以数据帧的形式提取数据。然后使用pip3安装lxml,也可以传递
flavor=“bs4”
没问题,您可以指定bs4、lxml、html5lib或xmlHi,它工作正常。一个问题-为什么要加载pprint?我不熟悉这个模块,这里似乎没有用过。哦,你说得对,我在第一个答案中使用它来很好地显示列表,它在将html转换为数据帧的版本中没有用过
            Date   Price    Open    High     Low Change %
0   Jun 08, 2016  3.3609  3.4411  3.4465  3.3584   -2.36%
1   Jun 07, 2016  3.4421  3.4885  3.5141  3.4401   -1.36%
2   Jun 06, 2016  3.4896  3.5265  3.5295  3.4840   -1.09%
3   Jun 05, 2016  3.5280  3.5280  3.5280  3.5280    0.11%
4   Jun 03, 2016  3.5240  3.5910  3.5947  3.5212   -1.91%
5   Jun 02, 2016  3.5926  3.6005  3.6157  3.5765   -0.22%
6   Jun 01, 2016  3.6007  3.6080  3.6363  3.5755   -0.29%
7   May 31, 2016  3.6111  3.5700  3.6383  3.5534    1.11%
8   May 30, 2016  3.5713  3.6110  3.6167  3.5675   -1.11%
9   May 27, 2016  3.6115  3.5824  3.6303  3.5792    0.81%
10  May 26, 2016  3.5825  3.5826  3.5857  3.5757   -0.03%
11  May 25, 2016  3.5836  3.5702  3.6218  3.5511    0.34%
12  May 24, 2016  3.5713  3.5717  3.5903  3.5417   -0.04%
13  May 23, 2016  3.5728  3.5195  3.5894  3.5121    1.49%
14  May 20, 2016  3.5202  3.5633  3.5663  3.5154   -1.24%
15  May 19, 2016  3.5644  3.5668  3.6197  3.5503   -0.11%
16  May 18, 2016  3.5683  3.4877  3.5703  3.4854    2.28%
17  May 17, 2016  3.4888  3.4990  3.5300  3.4812   -0.32%
18  May 16, 2016  3.5001  3.5309  3.5366  3.4944   -0.96%
19  May 13, 2016  3.5340  3.4845  3.5345  3.4630    1.39%
20  May 12, 2016  3.4855  3.4514  3.5068  3.4346    0.95%
21  May 11, 2016  3.4528  3.4755  3.4835  3.4389   -0.66%
22  May 10, 2016  3.4758  3.5155  3.5173  3.4623   -1.15%
23  May 09, 2016  3.5164  3.5010  3.6766  3.4906    0.40%