Python 使用selenium的自动爬行
基于此代码,我想爬网这个网站。 就像原始url一样,selenium进入第一个链接并将文本保存到txt文件,然后返回到原始url并进入第二个链接并继续运行Python 使用selenium的自动爬行,python,selenium,web,web-crawler,Python,Selenium,Web,Web Crawler,基于此代码,我想爬网这个网站。 就像原始url一样,selenium进入第一个链接并将文本保存到txt文件,然后返回到原始url并进入第二个链接并继续运行 但问题是第一个链接的css#u选择器值是#viewHeightDiv>table>tbody>tr:nth child(1)>td.s#tit>a,第二个链接是#viewHeightDiv>table>tbody>tr:nth child(3)>td.s#tit>它们之间的唯一区别是子级后面的数字,似乎没有规则,它是1,3,5,9,。。。所以
但问题是第一个链接的css#u选择器值是#viewHeightDiv>table>tbody>tr:nth child(1)>td.s#tit>a,第二个链接是#viewHeightDiv>table>tbody>tr:nth child(3)>td.s#tit>它们之间的唯一区别是子级后面的数字,似乎没有规则,它是1,3,5,9,。。。所以我被困在这里…你可以使用定位器,比如:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
OUTPUT_FILE_NAME = 'output0.txt'
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
def get_text():
driver.get("http://law.go.kr/precSc.do?tabMenuId=tab67")
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#viewHeightDiv >
table > tbody > "
"tr:nth-child(1) >
td.s_tit > a")))
title = elem.text.strip().split(" ")[0]
elem.click()
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#viewwrapCenter h2"),
title))
content = driver.find_element_by_css_selector("#viewwrapCenter").text
return content
def main():
open_output_file = open(OUTPUT_FILE_NAME, 'w')
result_text = get_text()
open_output_file.write(result_text)
open_output_file.close()
main()
要刮去所有的帖子,你不需要硒。您可以使用
请求
和美化组
库执行所有操作:
td.s_tit > a
要保存到文件,请将代码的最后一部分更改为以下内容,其中/yourfullpath/
替换为您的路径,如“C://files/”或“/Users/myuser/files/”:
我还是很困惑。。。以及如何将所有文本放入一个txt文件中。。。使用“a”模式?在文本文件上?我真的不明白这段代码,我迷路了。那么这段代码是不是会刮去所有链接的正文?当我看不到代码是否工作时。。需要更多的研究或解释。@Kwanhehwang“是”代码将从“所有”和“存储帖子”地图中删除所有内容。您可以稍后使用它来写入文件或其他内容。代码只是通过使用url发布请求来获取数据。阅读一些关于请求库和web如何工作的信息,你就会明白。你是如何得到响应url的?另外,我正试图为每个itme的t创建txt文件,但它只创建一个文件…@kwanhehwang回复并很高兴
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
# Using request get 50 items from first page. pg=1 is page number, outmax=50 items per page
response = requests.post(
"http://law.go.kr/precScListR.do?q=*§ion=evtNm&outmax=50&pg=1&fsort=21,10,30&precSeq=0&dtlYn=N")
# Parse html using BeautifulSoup
page = BeautifulSoup(response.text, "html.parser")
# Find "go to last page" element and get "onclick" attribute, inside "onlick" attribute parse last page number
# for "outmax=50" (used before)
onclick = str(page.select(".paging > a:last-child")[0].attrs["onclick"])
last_page_number = int(''.join([n for n in onclick if n.isdigit()]))
# To test uncomment code below to get items only from first page
# last_page_number = 1
# Go through all pages and collect posts numbers in items
items = []
for i in range(1, last_page_number + 1):
if i>1:
# Go to next page
response = requests.post(
"http://law.go.kr/precScListR.do?q=*§ion=evtNm&outmax=100&pg=%d&fsort=21,10,30&precSeq=0&dtlYn=N" % i)
# Get all links
links = page.select("#viewHeightDiv .s_tit a")
# Loop all links and collect post numbers
for link in links:
# Parse post number from "onclick" attribute
items.append(''.join([n for n in link.attrs["onclick"] if n.isdigit()]))
# Open all posts and collect in posts dictionary with keys: number, url and text
posts = []
for item in items:
url = "http://law.go.kr/precInfoR.do?precSeq=%s&vSct=*" % item
response = requests.get(url)
t = BeautifulSoup(response.text, "html.parser").find('div', attrs={'id': 'contentBody'}).text
posts.append({'number': item, 'url': url, 'text': t})
# Open all posts and collect in posts dictionary with keys: number, url and text
posts = []
for item in items:
url = "http://law.go.kr/precInfoR.do?precSeq=%s&vSct=*" % item
response = requests.get(url)
parsed = BeautifulSoup(response.text, "html.parser")
text = parsed.find('div', attrs={'id': 'contentBody'}).text
title = parsed.select_one("h2").text
posts.append({'number': item, 'url': url, 'text': text, 'title': title})
with open('/yourfullpath/' + title + '.text', 'w') as f:
f.write(text)