Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 使用python从.aspx站点抓取Web_Python 3.x_Beautifulsoup_Python Requests - Fatal编程技术网

Python 3.x 使用python从.aspx站点抓取Web

Python 3.x 使用python从.aspx站点抓取Web,python-3.x,beautifulsoup,python-requests,Python 3.x,Beautifulsoup,Python Requests,我正试图从这个网站上搜集一些数据: 我可以使用我的方法获得前11页,但由于某些原因,它会在第11页之后退出。我读过其他与.aspx相关的帖子,但没有看到任何与我的情况相符的东西 我是新手,所以我的代码有点冗长,但它完成了工作——有点。我一直在调整页眉和其他一些东西,但无法通过第11页。对我来说毫无意义 我相当确定问题在于viewstate和viewgenerator标题参数。我不知道如何在循环中为您想要进入的页面获取这些。我对所有页面都使用相同的值。出于某种原因,这种方法一直到第11页(包括第

我正试图从这个网站上搜集一些数据:

我可以使用我的方法获得前11页,但由于某些原因,它会在第11页之后退出。我读过其他与.aspx相关的帖子,但没有看到任何与我的情况相符的东西

我是新手,所以我的代码有点冗长,但它完成了工作——有点。我一直在调整页眉和其他一些东西,但无法通过第11页。对我来说毫无意义

我相当确定问题在于viewstate和viewgenerator标题参数。我不知道如何在循环中为您想要进入的页面获取这些。我对所有页面都使用相同的值。出于某种原因,这种方法一直到第11页(包括第11页),然后就中断了。这很奇怪,因为看起来每个页面都有不同的viewstate值

提前谢谢

import pandas as pd
import re
import pandas as pd
import numpy as np
import urllib
from requests import Session
from bs4 import BeautifulSoup
import time
import requests


# List of pages to loop over
page_list = ['Page$1','Page$2','Page$3','Page$4','Page$5','Page$6','Page$7','Page$8','Page$9','Page$10',
             'Page$11','Page$12','Page$13','Page$14','Page$15','Page$16','Page$17','Page$18','Page$19','Page$20']
wa_url = 'https://fortress.wa.gov/esd/file/warn/Public/SearchWARN.aspx'

# Getting header elements from url
session = requests.Session()
session.headers.update({
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
})
val_get = session.get(wa_url)
soup = BeautifulSoup(val_get.content, "html.parser")

tags = soup.find_all('input')
# Header elements I need for the POST request
view_state = tags[3]['value']
view_generator = tags[4]['value']
evnt_validation = tags[6]['value']



no_emps = []
date = []

#Looping through pages of WARN database
for page in page_list:
    
    data = {
    # Form data header stuff
    "__EVENTTARGET": "ucPSW$gvMain",
    "__EVENTARGUMENT": page,
    "__LASTFOCUS": "",
    "__VIEWSTATE": view_state,
    "__VIEWSTATEGENERATOR": view_generator,
    "__VIEWSTATEENCRYPTED": "",
    "__EVENTVALIDATION": evnt_validation,
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding":"gzip, deflate, br",
    "Accept-Language":"en-US,en;q=0.9",
    "Cache-Control":"max-age=0",
    "Connection":"keep-alive",
    "Content-Type":"application/x-www-form-urlencoded",
    "Cookie":"_ga=GA1.2.1011893740.1592948578; _gid=GA1.2.1433455844.1592948578",
    "Host":"fortress.wa.gov",
    "Origin":"https://fortress.wa.gov",
    "Referer":"https://fortress.wa.gov/esd/file/warn/Public/SearchWARN.aspx",
    "Sec-Fetch-Dest":"document",
    "Sec-Fetch-Mode":"navigate",
    "Sec-Fetch-Site":"same-origin",
    "Sec-Fetch-User":"?1",
    "Upgrade-Insecure-Requests":"1"
    }
    
    # Getting data from each page
    session = requests.Session()
    session.headers.update({
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
    })
    
    get_warn_data = session.post(wa_url, data=data)
    soup = BeautifulSoup(get_warn_data.content, "html.parser")
    
    # Getting all rows of data and desired table data after some cleaning up
    work = soup.find_all('tr')
    work = [a.get_text('@') for a in work]
    work = [re.sub(r'\n', '', a) for a in work]
    work = [re.sub(r'^@|@$', '', a) for a in work]
    work = [a.split('@') for a in work]
    
        
    work = [a for a in work if len(a) == 7]
    no_emps_u = [a[3] for a in work]
    date_use = [a[6] for a in work]
    
    no_emps.append(no_emps_u)
    date.append(date_use)
    
# Dynamically Updating header values with stuff in current html
# Only applicable for page2 and on
if page != 'Page$1':
    tags = soup.find_all('input')
    view_state = tags[3]['value']
    view_generator = tags[4]['value']
    evnt_validation = tags[6]['value']
else:
    pass
    
# Wrapping up results into lists
from pandas.core.common import flatten
WA_WARN_no_emps = list(flatten(no_emps))
WA_WARN_date = list(flatten(date))

更新您可以使用此示例从站点获取所有页面(总共67页)(它动态获取所有
值-因此获得正确的
\u VIEWSTATE
等):


谢谢你,安德烈。我在想我需要在for循环的末尾添加一些东西。希望有一天我能像你一样浏览html代码;在此之前,regex它是您将上一页的viewstate头值用于下一页的权利;这是运行post请求的唯一方法?@wolf7687是的,据我所知,
.aspx
页面就是这样工作的。您还必须从请求中删除一些
输入
数据,在这种情况下
ucPSW$btnSearchCompany
,否则服务器将混淆。
import requests
from bs4 import BeautifulSoup


url = 'https://fortress.wa.gov/esd/file/warn/Public/SearchWARN.aspx'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

def get_data(soup, page_num):
    data = {}
    for i in soup.select('input'):
        data[i['name']] = i.get('value', '')
    del data['ucPSW$btnSearchCompany']
    data['__EVENTTARGET'] = 'ucPSW$gvMain'
    data['__EVENTARGUMENT'] = 'Page${}'.format(page_num)
    data['__LASTFOCUS'] = ''
    return data

page = 1
while True:
    print('Page {}...'.format(page))

    total = 1
    for total, tr in enumerate(soup.select('#ucPSW_gvMain > tr:not(:has(table)):has(td)'), 1):
        tds = [td.get_text(strip=True) for td in tr.select('td')]
        print('{:<3}{:<50}{:<25}{:<15}{:<15}{:<15}{:<15}{:<15}'.format(total, *tds))

    if total % 15:
        break

    page += 1
    soup = BeautifulSoup( requests.post(url, get_data(soup, page)).content, 'html.parser' )
Page 1...
1  Safran Cabin Materials, LLC                       Marysville and Newport   6/23/2020      85             Layoff         Permanent      6/24/2020      
2  Swissport Fueling                                 SeaTac                   5/8/2020       69             Layoff         Permanent      6/19/2020      
3  Swissport USA, Inc                                SeaTac                   5/22/2020      62             Layoff         Permanent      6/19/2020      
4  Swissport USA, Inc                                SeaTac                   3/20/2020      167            Layoff         Temporary      6/19/2020      
5  Tool Gauge and Machine Works                      Tacoma                   6/17/2020      59             Layoff         Permanent      6/18/2020      
6  Hyatt Corporation Motif Seattle                   Seattle                  3/14/2020      91             Layoff         Temporary      6/18/2020      
7  Jacobsen Daniel's Enterprise, Inc                 Tacoma                   6/12/2020      1              Layoff         Permanent      6/18/2020      
8  Benchmark Stevenson, LLC d/b/a Skamania Lodge     Stevenson                3/18/2020      185            Layoff         Temporary      6/17/2020      
9  Seattle Art Museum                                Seattle                  7/5/2020       76             Layoff         Temporary      6/16/2020      
10 Chihuly Garden & Glass                            Seattle                  3/21/2020      97             Layoff         Temporary      6/16/2020      
11 Seattle Center                                    Seattle                  3/21/2020      182            Layoff         Temporary      6/16/2020      
12 Sekisui Aerospace                                 Renton and Sumner        6/12/2020      111            Layoff         Permanent      6/15/2020      
13 Pioneer Human Services                            Seattle                  8/14/2020      59             Layoff         Permanent      6/15/2020      
14 Crista Senior Living                              Shoreline                8/16/2020      156            Closure        Permanent      6/15/2020      
15 Hyatt Corporation / Hyatt Regency Bellevue        Bellevue                 3/15/2020      223            Layoff         Temporary      6/15/2020      
Page 2...
1  Toray Composite Materials America, Inc            Tacoma                   8/8/2020       146            Layoff         Permanent      6/12/2020      
2  Embassy Suites Seattle Bellevue                   Seattle                  6/1/2020       57             Layoff         Temporary      6/12/2020      
3  Triumph Aerospace Structures                      Spokane                  6/15/2020      12             Layoff         Permanent      6/11/2020      
4  Hyatt Corporation / Hyatt Regency Lake Washington Renton                   6/30/2020      129            Layoff         Temporary      6/9/2020       
5  Lamb Weston, Inc                                  Connell, WA              6/15/2020      360            Layoff         Temporary      6/8/2020       
6  Lamb Weston, Inc                                  Warden                   6/15/2020      300            Layoff         Temporary      6/8/2020       

... and so on.