PythonWeb通过登录网站进行抓取
寻找一些帮助刮一个网站,需要登录。基本上,该网站是为了获取交易卡价格(我相信是从ebay获得的),但其格式允许在ebays网站上搜索超过90天的时间。登录url是我从中搜索的url,我搜索了以前的帖子,发现了一个我认为可以尝试复制但没有成功的帖子。下面是代码,无论它是否能工作,任何帮助都将不胜感激PythonWeb通过登录网站进行抓取,python,authentication,web-scraping,Python,Authentication,Web Scraping,寻找一些帮助刮一个网站,需要登录。基本上,该网站是为了获取交易卡价格(我相信是从ebay获得的),但其格式允许在ebays网站上搜索超过90天的时间。登录url是我从中搜索的url,我搜索了以前的帖子,发现了一个我认为可以尝试复制但没有成功的帖子。下面是代码,无论它是否能工作,任何帮助都将不胜感激 #https://stackoverflow.com/questions/47438699/scraping-a-website-with-python-3-that-requires-login i
#https://stackoverflow.com/questions/47438699/scraping-a-website-with-python-3-that-requires-login
import requests
from lxml import html
from bs4 import BeautifulSoup
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
from datetime import datetime
from datetime import date
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from urllib.parse import quote
Product_name = []
Price = []
Date_sold = []
url = "https://www.pwccmarketplace.com/login"
values = {"email": "xyz@abc.com",
"password": "password"}
session = requests.Session()
r = session.post(url, data=values)
Search_name = input("Search for: ")
Exclude_terms = input("Exclude these terms (- infront of all, no spaces): ")
qstr = quote(Search_name)
qstrr = quote(Exclude_terms)
Number_pages = int(input("Number of pages you want searched (Number -1): "))
pages = np.arange(1, Number_pages)
for page in pages:
params = {"Category": 6, "deltreeid": 6, "do": "Delete Tree"}
url = "https://www.pwccmarketplace.com/market-price-research?q=" + qstr + "+" + qstrr + "&year_min=2004&year_max=2020&price_min=0&price_max=10000&sort_by=date_desc&sale_type=auction&items_per_page=250&page=" + str(page)
result = session.get(url, data=params)
soup = BeautifulSoup(result.text, "lxml")
search = soup.find_all('tr')
sleep(randint(2,10))
for container in search:
代码继续,但与此问题无关。执行
POST时,有效负载中发送了一个令牌https://members.pwccmarketplace.com/login
。此令牌位于输入
标记中,可以使用beautifulsoup:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
email = "your@email.com"
password = "your_password"
r = session.get("https://members.pwccmarketplace.com/login")
soup = BeautifulSoup(r.text, "html.parser")
token = soup.find("input", { "name": "_token"})["value"]
r = session.post(
"https://members.pwccmarketplace.com/login",
data = {
"_token": token,
"redirect": "",
"email": email,
"password": password,
"remember": "true"
}
)
执行
POST时,有效负载中发送了一个令牌https://members.pwccmarketplace.com/login
。此令牌位于输入
标记中,可以使用beautifulsoup:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
email = "your@email.com"
password = "your_password"
r = session.get("https://members.pwccmarketplace.com/login")
soup = BeautifulSoup(r.text, "html.parser")
token = soup.find("input", { "name": "_token"})["value"]
r = session.post(
"https://members.pwccmarketplace.com/login",
data = {
"_token": token,
"redirect": "",
"email": email,
"password": password,
"remember": "true"
}
)
Brillian bertrand请客,现在进入下一个问题:Brillian bertrand请客,现在进入下一个问题:D