Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 使用BeautifulSoup或selenium从网站上删除所有链接_Python 3.x_Selenium_Web Scraping_Scrapy - Fatal编程技术网

Python 3.x 使用BeautifulSoup或selenium从网站上删除所有链接

Python 3.x 使用BeautifulSoup或selenium从网站上删除所有链接,python-3.x,selenium,web-scraping,scrapy,Python 3.x,Selenium,Web Scraping,Scrapy,我想刮一个网站的所有链接,并想过滤掉它们,以便我可以wget他们以后 问题是给定了一个URL URL = "https://stackoverflow.com/questions/" 我的刮板应该刮和提供url的,如 https://stackoverflow.com/questions/51284071/how-to-get-all-the-link-in-page-using-selenium-python https://stackoverflow.com/quest

我想刮一个网站的所有链接,并想过滤掉它们,以便我可以wget他们以后

问题是给定了一个URL

URL = "https://stackoverflow.com/questions/"
我的刮板应该刮和提供url的,如

https://stackoverflow.com/questions/51284071/how-to-get-all-the-link-in-page-using-selenium-python
https://stackoverflow.com/questions/36927366/how-to-get-the-link-to-all-the-pages-of-a-website-for-data-scrapping 
https://stackoverflow.com/questions/46468032/python-selenium-automatically-load-more-pages
目前,我借用了StackOverflow的代码

import requests
from bs4 import BeautifulSoup

def recursiveUrl(url, link, depth):
    if depth == 10:
        return url
    else:
        # print(link['href'])
        page = requests.get(url + link['href'])
        soup = BeautifulSoup(page.text, 'html.parser')
        newlink = soup.find('a')
        if len(newlink) == 0:
            return link
        else:
            return link, recursiveUrl(url, newlink, depth + 1)
def getLinks(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    for link in links:
      try:
        links.append(recursiveUrl(url, link, 0))
      except Exception as e:
        pass
    return links
links = getLinks("https://www.businesswire.com/portal/site/home/news/")
print(links)
我认为,与其浏览所有网页,不如浏览网页中提供的所有超链接

我也提到这一点

link = "https://www.businesswire.com/news"

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = link
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url
            print (url)
            yield Request(url, callback=self.parse)
但这太旧了,不能正常工作

刮痧对我来说是新鲜事,所以我可能会被困在一些基本的基金会里


让我知道如何触发此问题。

使用
请求和
bs4
的一种解决方案:

import requests
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions/"
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

# Find all <a> in your HTML that have a not null 'href'. Keep only 'href'.
links = [a["href"] for a in soup.find_all("a", href=True)]
print(links)
如果您想保留
问题
链接,那么:

print(
    [
        link
        if link.startswith("https://stackoverflow.com")
        else f"https://stackoverflow.com{link}"
        for link in links
        if "/questions/" in link
    ]
)
输出:

[
    "#",
    "https://stackoverflow.com",
    "#",
    "/teams/customers",
    "https://stackoverflow.com/advertising",
    "#",
    "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2fquestions%2f",
    "https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent",
    "https://stackoverflow.com",
...
[
    "https://stackoverflow.com/questions/ask",
    "https://stackoverflow.com/questions/61523359/assembly-nasm-print-ascii-table-using-a-range-determined-by-input",
    "https://stackoverflow.com/questions/tagged/assembly",
    "https://stackoverflow.com/questions/tagged/input",
    "https://stackoverflow.com/questions/tagged/range",
    "https://stackoverflow.com/questions/tagged/ascii",
    "https://stackoverflow.com/questions/tagged/nasm",
    "https://stackoverflow.com/questions/61523356/can-i-inject-an-observable-from-a-parent-component-into-a-child-component",
    "https://stackoverflow.com/questions/tagged/angular",
    "https://stackoverflow.com/questions/tagged/redux",
...
]

我认为这可能是有效的。您必须安装Selenium依赖项并下载firefox Selenium驱动程序。接下来执行这个脚本

OUTPUT: python stackoverflow.py https://stackoverflow.com/questions/61519440/how-do-i-trigger-a-celery-task-from-django-admin https://stackoverflow.com/questions/61519439/how-to-add-rows-in-consecutive-blocks-in-excel https://stackoverflow.com/questions/61519437/not-null-constraint-failed-api-userlog-browser-info-id-when-i-want-to-add-show https://stackoverflow.com/questions/61519435/dart-parse-date-with-0000 https://stackoverflow.com/questions/61519434/is-there-a-way-to-reduce-the-white-pixels-in-a-invereted-image https://stackoverflow.com/questions/61519433/querying-datastore-using-some-of-the-indexed-properties https://stackoverflow.com/questions/61519431/model-checkpoint-doesnt-create-a-directory https://stackoverflow.com/questions/61519430/why-is-the-event-dispatched-by-window-not-captured-by-other-elements https://stackoverflow.com/questions/61519426/live-sass-complier-in-vs-code-unfortunately-stopped-working-while-coding .... 输出: python stackoverflow.py https://stackoverflow.com/questions/61519440/how-do-i-trigger-a-celery-task-from-django-admin https://stackoverflow.com/questions/61519439/how-to-add-rows-in-consecutive-blocks-in-excel https://stackoverflow.com/questions/61519437/not-null-constraint-failed-api-userlog-browser-info-id-when-i-want-to-add-show https://stackoverflow.com/questions/61519435/dart-parse-date-with-0000 https://stackoverflow.com/questions/61519434/is-there-a-way-to-reduce-the-white-pixels-in-a-invereted-image https://stackoverflow.com/questions/61519433/querying-datastore-using-some-of-the-indexed-properties https://stackoverflow.com/questions/61519431/model-checkpoint-doesnt-create-a-directory https://stackoverflow.com/questions/61519430/why-is-the-event-dispatched-by-window-not-captured-by-other-elements https://stackoverflow.com/questions/61519426/live-sass-complier-in-vs-code-unfortunately-stopped-working-while-coding ....
我认为这是从主页上删除URL,它在第二页的链接上有效吗?如果我使用我提到的第二个URL呢?它能在不改变任何东西的情况下工作吗?好的,你应该这样做,我在一个循环中说,直到“下一页”不存在…我知道了,我会循环页码,然后保存所有链接。 OUTPUT: python stackoverflow.py https://stackoverflow.com/questions/61519440/how-do-i-trigger-a-celery-task-from-django-admin https://stackoverflow.com/questions/61519439/how-to-add-rows-in-consecutive-blocks-in-excel https://stackoverflow.com/questions/61519437/not-null-constraint-failed-api-userlog-browser-info-id-when-i-want-to-add-show https://stackoverflow.com/questions/61519435/dart-parse-date-with-0000 https://stackoverflow.com/questions/61519434/is-there-a-way-to-reduce-the-white-pixels-in-a-invereted-image https://stackoverflow.com/questions/61519433/querying-datastore-using-some-of-the-indexed-properties https://stackoverflow.com/questions/61519431/model-checkpoint-doesnt-create-a-directory https://stackoverflow.com/questions/61519430/why-is-the-event-dispatched-by-window-not-captured-by-other-elements https://stackoverflow.com/questions/61519426/live-sass-complier-in-vs-code-unfortunately-stopped-working-while-coding ....