Javascript 使用Python2.7、BeautifulSoup和selenium抓取asp和java脚本生成的表_Javascript_Asp.net_Python 2.7_Selenium_Beautifulsoup

Javascript 使用Python2.7、BeautifulSoup和selenium抓取asp和java脚本生成的表

javascript asp.net python-2.7 selenium

Javascript 使用Python2.7、BeautifulSoup和selenium抓取asp和java脚本生成的表,javascript,asp.net,python-2.7,selenium,beautifulsoup,Javascript,Asp.net,Python 2.7,Selenium,Beautifulsoup,我需要抓取一个JavaScript生成的表，并将一些数据写入csv文件。我仅限于python 2.7、Beautiful Soup和/或Selenium。我所需要的最接近的代码是有问题的，但我得到的回报是一个空列表。我看到的网站是：来源：例如，其中一条记录如下所示： <tr> <td class="flagmay"><a href="javascript:dataWin('STAGE','119901','Colorado River at Winch

我需要抓取一个JavaScript生成的表，并将一些数据写入csv文件。我仅限于python 2.7、Beautiful Soup和/或Selenium。我所需要的最接近的代码是有问题的，但我得到的回报是一个空列表。我看到的网站是：

来源：

例如，其中一条记录如下所示：

 <tr>
 <td class="flagmay"><a href="javascript:dataWin('STAGE','119901','Colorado River at Winchell')" class="tablink">Colorado River at Winchell</a></td>
<td align="left" class="flagmay">Jan 12 2016  5:55PM</td><td align="right" class="flagmay">2.48</td><td align="right" class="flagmay">4.7</td></tr>


2016年1月12日5:55PM2.484.7

我试图写入csv的内容应该如下所示：

车站|车站ID |时间|阶段|流量

科罗拉多河温切尔站| 119901 | 2016年1月12日下午5:55 | 2.48 | 4.7

谁能给我指点一下吗？提前谢谢。

试试这个：

我正在使用

pandas

，

请求

和

BeautifulSoup4

库，并测试代码是否与python

2.7.11

和

3.5.1

import requests
import pandas
from bs4 import BeautifulSoup

url = 'http://hydromet.lcra.org/repstage.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')

# convert the html table data into pandas data frames, skip the heading so that it is easier to add a column
df = pandas.read_html(str(tables[1]), skiprows={0}, flavor="bs4")[0]

# loop over the table to find out station id and store it in a dict obj
a_links = soup.find_all('a', attrs={'class': 'tablink'})
stnid_dict = {}
for a_link in a_links:
    cid = ((a_link['href'].split("dataWin('STAGE','"))[1].split("','")[0])
    stnid_dict[a_link.text] = cid

# add the station id column from the stnid_dict object above
df.loc[:, (len(df.columns)+1)] = df.loc[:, 0].apply(lambda x: stnid_dict[x])
df.columns = ['Station', 'Time', 'Stage', 'Flow', 'StationID']

# added custom order of columns to add in csv, and to skip row numbers in the output file
df.to_csv('station.csv', columns=['Station', 'StationID', 'Time', 'Stage', 'Flow'], index=False)

此脚本将在与脚本相同的位置创建一个名为

station.CSV

的CSV文件。

如果您不打算测试UI，我不理解selenium在整个故事中的意义。为什么不使用

urllib

或来获取页面，然后使用BeautifulSoup来解析它呢？你的

soup

调用是什么样子的？

url=（“http://hydromet.lcra.org/repframe.html）page=urlib2.urlopen（url.read（）soup=BeautifulSoup（page，“html.parser”）soup.prettify（）table1=soup.find（“table”，class=“tablink”）

csv文件是空的，我不知道如何获取来自html这部分的站点id:href=“javascript:dataWin（'STAGE'、'119901'、'Colorado River at Winchell'）url不应该是

url=（“http://hydromet.lcra.org/repstage.asp")

？首先，我要说的是，这是我第二次尝试抓取，因此感谢您的耐心。如果我更改url，我将获得Indexer:list index超出范围Alan，非常感谢。如果我获得安装pandas的权限，我将能够测试它。