Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/video/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pandas 使用BeautifulSoup将表刮到数据帧中_Pandas_Dataframe_Web Scraping_Beautifulsoup - Fatal编程技术网

Pandas 使用BeautifulSoup将表刮到数据帧中

Pandas 使用BeautifulSoup将表刮到数据帧中,pandas,dataframe,web-scraping,beautifulsoup,Pandas,Dataframe,Web Scraping,Beautifulsoup,我正试图从硬币目录中搜集数据 有。我需要进入数据帧 到目前为止,我有以下代码: import bs4 as bs import urllib.request import pandas as pd source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/view/45518').read() soup = bs.BeautifulSoup(source,'lxml') table = soup.find('table

我正试图从硬币目录中搜集数据

有。我需要进入数据帧

到目前为止,我有以下代码:

import bs4 as bs
import urllib.request
import pandas as pd

source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/view/45518').read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    print(row)                    # I need to save this data instead of printing it 
它产生以下输出:

[]
['', '', '1882', '', '108,000', 'UNC', '—']
[' ', '', '1883', '', '786,000', 'UNC', '~ $3.99']
[' ', " \n\n\n\n\t\t\t\t\t\t\t$('subGraph55337').on('click', function(event) {\n\t\t\t\t\t\t\t\tLightview.show({\n\t\t\t\t\t\t\t\t\thref : '/en/catalog/ajax/subgraph?id=55337',\n\t\t\t\t\t\t\t\t\trel : 'ajax',\n\t\t\t\t\t\t\t\t\toptions : {\n\t\t\t\t\t\t\t\t\t\tautosize : true,\n\t\t\t\t\t\t\t\t\t\ttopclose : true,\n\t\t\t\t\t\t\t\t\t\tajax : {\n\t\t\t\t\t\t\t\t\t\t\tevalScripts : true\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t} \n\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\tevent.stop();\n\t\t\t\t\t\t\t\treturn false;\n\t\t\t\t\t\t\t});\n\t\t\t\t\t\t", '1884', '', '4,604,000', 'UNC', '~ $2.08–$4.47']
[' ', '', '1885', '', '1,314,000', 'UNC', '~ $3.20']
['', '', '1886', '', '444,000', 'UNC', '—']
[' ', '', '1888', '', '413,000', 'UNC', '~ $2.88']
[' ', '', '1889', '', '568,000', 'UNC', '~ $2.56']
[' ', " \n\n\n\n\t\t\t\t\t\t\t$('subGraph55342').on('click', function(event) {\n\t\t\t\t\t\t\t\tLightview.show({\n\t\t\t\t\t\t\t\t\thref : '/en/catalog/ajax/subgraph?id=55342',\n\t\t\t\t\t\t\t\t\trel : 'ajax',\n\t\t\t\t\t\t\t\t\toptions : {\n\t\t\t\t\t\t\t\t\t\tautosize : true,\n\t\t\t\t\t\t\t\t\t\ttopclose : true,\n\t\t\t\t\t\t\t\t\t\tajax : {\n\t\t\t\t\t\t\t\t\t\t\tevalScripts : true\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t} \n\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\tevent.stop();\n\t\t\t\t\t\t\t\treturn false;\n\t\t\t\t\t\t\t});\n\t\t\t\t\t\t", '1890', '', '2,137,000', 'UNC', '~ $1.28–$4.79']
['', '', '1891', '', '605,000', 'UNC', '—']
[' ', '', '1892', '', '205,000', 'UNC', '~ $4.47']
[' ', '', '1893', '', '754,000', 'UNC', '~ $4.79']
[' ', '', '1894', '', '532,000', 'UNC', '~ $3.20']
[' ', '', '1895', '', '423,000', 'UNC', '~ $2.40']
['', '', '1896', '', '174,000', 'UNC', '—']
但当我试图将其保存到Dataframe并导出到excel时,它只包含最后一个值:

         0
0         
1         
2     1896
3         
4  174,000
5      UNC
6        —
试试这个

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
pd.DataFrame(l, columns=["A", "B", ...])
尝试:

输出:

   Year  Mintage Quality    Price
0  1882  108,000     UNC        —
1  1883  786,000     UNC  ~ $4.03

只是有点头晕。。。Rakesh代码的这一部分意味着数据框中只包含包含文本的HTML行,因为如果行是空列表,则不会追加行:

if row:
    res.append(row)
在我的用例中有问题,我想稍后比较HTML和dataframe表的行索引。我只需要将其更改为:

res.append(row)
此外,如果行中的单元格为空,则不会将其包括在内。然后,这会弄乱列。所以我改变了

row = [tr.text.strip() for tr in td if tr.text.strip()]


但是,除此之外,它对我有效。谢谢:)

熊猫已经有了一种内置的方法,可以将web上的表转换为数据帧:

table = soup.find_all('table')
df = pd.read_html(str(table))[0]

您是如何将它保存到数据帧的。抢手货我没提过。我刚刚又添加了两行:df=pd.DataFrame(行)和df.to_excel('coins.xlsx')。for循环中的数据将被覆盖。您还可以使用
df['col'].str.strip('\n')
删除\nHi,Rakesh。谢谢你的回答。这对我也很管用。我选择了phi的答案,因为它促进了我昨天的工作:)干杯!这太棒了!谢谢你这应该是公认的答案。在这个用例中使用
BeautifulSoup
没有意义。这似乎是正确的,但有不一致的行为会丢失大量的表行。
row = [d.text.strip() for d in td]
table = soup.find_all('table')
df = pd.read_html(str(table))[0]