从刮取的Javascript表列表创建DataFrame
我使用Selenium从中获取了一个动态Javascript表,其中包含联邦雇员的职位和工资信息。(注意:这都是公共领域的数据,所以不用担心:个人信息) 我正试着把它放进一个实验室进行分析。我的问题是,我的Selenium输入数据是一个列表,打印为:从刮取的Javascript表列表创建DataFrame,javascript,python,python-2.7,pandas,selenium-webdriver,Javascript,Python,Python 2.7,Pandas,Selenium Webdriver,我使用Selenium从中获取了一个动态Javascript表,其中包含联邦雇员的职位和工资信息。(注意:这都是公共领域的数据,所以不用担心:个人信息) 我正试着把它放进一个实验室进行分析。我的问题是,我的Selenium输入数据是一个列表,打印为: [u'DOE,JON'], [u'14'], [u'SK'], [u'$176,571.00'], [u'$2,000.00'], [u'SECURITIES AND EXCHANGE COMMISSION'], [u'WASHINGTON'],
[u'DOE,JON'], [u'14'], [u'SK'], [u'$176,571.00'], [u'$2,000.00'], [u'SECURITIES AND EXCHANGE COMMISSION'], [u'WASHINGTON'], [u'GENERAL ATTORNEY'], [u'2012']], ...
我想得到的是处理任意数量记录的DF
作为:
我尝试过将这个列表转换成字典,使用zip()函数,将列名称作为元组,将数据作为列表,等等,但都没有用,尽管这是一次很好的Python特性之旅。获取数据后,下一步该怎么做?还是我应该以不同的方式读取数据
目前,刮板代码为:
from selenium import webdriver
path_to_chromedriver = '/Users/xxx/Documents/webdriver/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
url = 'http://www.fedsdatacenter.com/federal-pay-rates/index.php'
browser.get(url)
inputAgency = browser.find_element_by_id('a')
inputYear = browser.find_element_by_id('y')
# Send data
inputAgency.send_keys('SECURITIES AND EXCHANGE COMMISSION')
inputYear.send_keys('All')
# Select 'All' from Years element
browser.find_element_by_css_selector('input[type=\"submit\"]').click()
browser.find_element_by_xpath('//*[@id="example_length"]/label/select/option[4]').click()
SMRtable = browser.find_element_by_id('example')
scrapedData = []
for td in SMRtable.find_elements_by_xpath('.//td'):
scrapedData.append([td.get_attribute('innerHTML')])
print td.get_attribute('innerHTML')
您只能使用
pandas
因此,首先您可以检查网页的查看页面源:
检查编号为14807-14826的线路:
// data table initialization
$(document).ready(function() {
$('#example').dataTable( {
"bPaginate": true,
"bFilter": false,
"bProcessing": true,
"bServerSide": true,
"aoColumns": [
null,
null,
null,
{ "sType": 'currency' }, // set currency columns to allow sorting
{ "sType": 'currency' }, // set second column to currency to allow sorting
null,
null,
null,
null
],
"sAjaxSource": "output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all"
} );
} );
这意味着这个页面的使用和数据是作为JSON从ajax源代码加载的
因此,您可以获得干净、漂亮的json,而不是废弃html:
output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all
最后一个链接是(而不是空格使用%20
):
JSON:
因此,您可以通过以下方式解析此json:
然后从列aaData
获得新的数据帧-使用列表理解:
df1 = pd.DataFrame([ x for x in df['aaData'] ])
设置列名称:
df1.columns = ['NAME','GRADE','SCALE','SALARY','BONUS','AGENCY','LOCATION','POSITION','YEAR']
print df1.head()
NAME GRADE SCALE SALARY BONUS \
0 ZUVER,SHAHEEN H 14 SK $170,960.00 $0.00
1 ZUR,MIA C. 14 SK $164,875.00 $0.00
2 ZUNDEL,JENNET LEONG 14 SK $204,638.00 $0.00
3 ZUKOWSKI,DAVID W 04 SK $38,382.00 $0.00
4 ZOU,FAN 14 SK $166,650.00 $0.00
AGENCY LOCATION \
0 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
1 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
2 SECURITIES AND EXCHANGE COMMISSION SAN FRANCISCO
3 SECURITIES AND EXCHANGE COMMISSION BOSTON
4 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
POSITION YEAR
0 GENERAL ATTORNEY 2014
1 GENERAL ATTORNEY 2014
2 ACCOUNTING 2014
3 ADMIN AND OFFICE SUPPORT STUDENT TRAINEE 2014
4 INFORMATION TECHNOLOGY MANAGEMENT 2014
太好了,谢谢!还需要更好地掌握Javascript。事实上,我们发现了另一个限制,这意味着可能仍然需要进行刮取-而“iTotalDisplayRecords”:“19919”,由此产生的实际数据帧仅包含100行,对应于row select元素允许的最大100行选项。你知道有什么办法吗?你可以试试这个urlhttp://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all&sEcho=4&iColumns=9&sColumns=&iDisplayStart=0&iDisplayLength=100000
也许可以尝试更改最后一个号码100000
import pandas as pd
df = pd.read_json("http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all")
print df.head()
aaData iTotalDisplayRecords \
0 [ZUVER,SHAHEEN H, 14, SK, $170,960.00, $0.00, ... 19919
1 [ZUR,MIA C., 14, SK, $164,875.00, $0.00, SECUR... 19919
2 [ZUNDEL,JENNET LEONG, 14, SK, $204,638.00, $0.... 19919
3 [ZUKOWSKI,DAVID W, 04, SK, $38,382.00, $0.00, ... 19919
4 [ZOU,FAN, 14, SK, $166,650.00, $0.00, SECURITI... 19919
iTotalRecords sEcho
0 7072900 0
1 7072900 0
2 7072900 0
3 7072900 0
4 7072900 0
df1 = pd.DataFrame([ x for x in df['aaData'] ])
df1.columns = ['NAME','GRADE','SCALE','SALARY','BONUS','AGENCY','LOCATION','POSITION','YEAR']
print df1.head()
NAME GRADE SCALE SALARY BONUS \
0 ZUVER,SHAHEEN H 14 SK $170,960.00 $0.00
1 ZUR,MIA C. 14 SK $164,875.00 $0.00
2 ZUNDEL,JENNET LEONG 14 SK $204,638.00 $0.00
3 ZUKOWSKI,DAVID W 04 SK $38,382.00 $0.00
4 ZOU,FAN 14 SK $166,650.00 $0.00
AGENCY LOCATION \
0 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
1 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
2 SECURITIES AND EXCHANGE COMMISSION SAN FRANCISCO
3 SECURITIES AND EXCHANGE COMMISSION BOSTON
4 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
POSITION YEAR
0 GENERAL ATTORNEY 2014
1 GENERAL ATTORNEY 2014
2 ACCOUNTING 2014
3 ADMIN AND OFFICE SUPPORT STUDENT TRAINEE 2014
4 INFORMATION TECHNOLOGY MANAGEMENT 2014