Pandas read#U html正在创建一个具有2x';列数是多少
我创建了一个数据框架,从一个站点获取一些数据Pandas read#U html正在创建一个具有2x';列数是多少,pandas,io,Pandas,Io,我创建了一个数据框架,从一个站点获取一些数据 df = pd.read_html('http://finviz.com/insidertrading.ashx?tc=1', header = 0)[4].set_index('Date') 然后,我创建了一个html文件,以html文件的名称作为日期 today_date = dt.date.today().isoformat() html_name = 'Insider Trading/{}_buys.html'.forma
df = pd.read_html('http://finviz.com/insidertrading.ashx?tc=1', header = 0)[4].set_index('Date')
然后,我创建了一个html文件,以html文件的名称作为日期
today_date = dt.date.today().isoformat()
html_name = 'Insider Trading/{}_buys.html'.format(today_date)
df.to_html(html_name)
当我打开html文件时,它看起来像这样(但在行和列周围有边框)。非常干净,没有错误
Ticker Owner Relationship Transaction Cost #Shares Value ($) #Shares Total SEC Form 4
Date
Sep 30 PIH Fundamental Global Investors, 10% Owner Buy 6.25 700 4375 352202 Sep 30 06:28 PM
Sep 28 PIH Fundamental Global Investors, 10% Owner Buy 6.05 36400 220220 351502 Sep 30 06:28 PM
Sep 30 FSTR Vizi Bradley Director Buy 12.00 14419 173028 801209 Sep 30 05:21 PM
Sep 29 FSTR Vizi Bradley Director Buy 12.00 11292 135504 786790 Sep 30 05:21 PM
Sep 28 FSTR Vizi Bradley Director Buy 11.83 9500 112385 775498 Sep 30 05:21 PM
现在,当我尝试通过如下代码将html文件读回数据帧时:
import pandas as pd
df =pd.read_html('Insider Trading/2016-09-30_buys.html')[0]
Unnamed: 0 Ticker Owner Relationship Transaction \
0 Sep 30 PIH Fundamental Global Investors, 10% Owner Buy
1 Sep 28 PIH Fundamental Global Investors, 10% Owner Buy
2 Sep 30 FSTR Vizi Bradley Director Buy
3 Sep 29 FSTR Vizi Bradley Director Buy
4 Sep 28 FSTR Vizi Bradley Director Buy
Cost #Shares Value ($) #Shares Total SEC Form 4 Date \
0 6.25 700 4375 352202 Sep 30 06:28 PM NaN
1 6.05 36400 220220 351502 Sep 30 06:28 PM NaN
2 12.00 14419 173028 801209 Sep 30 05:21 PM NaN
3 12.00 11292 135504 786790 Sep 30 05:21 PM NaN
4 11.83 9500 112385 775498 Sep 30 05:21 PM NaN
Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
(读取html时只有一个数据帧,因此我使用[0])
我得到的列数是原来的两倍,20列而不是10列,另外10列有“Unnamed 1”类型的名称
所以我的输出是这样的:
import pandas as pd
df =pd.read_html('Insider Trading/2016-09-30_buys.html')[0]
Unnamed: 0 Ticker Owner Relationship Transaction \
0 Sep 30 PIH Fundamental Global Investors, 10% Owner Buy
1 Sep 28 PIH Fundamental Global Investors, 10% Owner Buy
2 Sep 30 FSTR Vizi Bradley Director Buy
3 Sep 29 FSTR Vizi Bradley Director Buy
4 Sep 28 FSTR Vizi Bradley Director Buy
Cost #Shares Value ($) #Shares Total SEC Form 4 Date \
0 6.25 700 4375 352202 Sep 30 06:28 PM NaN
1 6.05 36400 220220 351502 Sep 30 06:28 PM NaN
2 12.00 14419 173028 801209 Sep 30 05:21 PM NaN
3 12.00 11292 135504 786790 Sep 30 05:21 PM NaN
4 11.83 9500 112385 775498 Sep 30 05:21 PM NaN
Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
我可能做错了什么
我还根据一个建议尝试了这段代码:
df = pd.read_html('http://finviz.com/insidertrading.ashx?tc=1', header = 0, attrs = {'class': 'body-table'})[0].set_index('SEC Form 4')
但似乎遇到了同样的问题 如果您查看该页面的源代码,您会发现它有嵌套在表中的表,嵌套在表中。如果不做更多的挖掘,我会怀疑这是由于。您可以简单地排除所有值都为空的列。感谢您的澄清。我认为我犯了一些愚蠢的编码错误。通过指定一个属性,您也可以在开始时让数据更“干净”。例如,
attrs={'class':'body table'}
,这似乎就是您正在尝试读取的表。谢谢,我现在就尝试一下。。奇怪的是,当我第一次将html页面读到数据框时,数据框是干净的。但是,在将其转换为html文件,然后尝试将html文件读回数据帧之后,我遇到了这种混乱。实际上,即使在添加attrs={'class':'body table}时,我也遇到了同样的问题。我将更新这个问题。如果你查看该页面的源代码,你会发现它有嵌套在表中的表,嵌套在表中的表。如果不做更多的挖掘,我会怀疑这是由于。您可以简单地排除所有值都为空的列。感谢您的澄清。我认为我犯了一些愚蠢的编码错误。通过指定一个属性,您也可以在开始时让数据更“干净”。例如,attrs={'class':'body table'}
,这似乎就是您正在尝试读取的表。谢谢,我现在就尝试一下。。奇怪的是,当我第一次将html页面读到数据框时,数据框是干净的。但是,在将其转换为html文件,然后尝试将html文件读回数据帧之后,我遇到了这种混乱。实际上,即使在添加attrs={'class':'body table}时,我也遇到了同样的问题。我会更新这个问题。