Python 搜索时网页抓取url未更改_Python_Web Scraping_Beautifulsoup

Python 搜索时网页抓取url未更改

python web-scraping

Python 搜索时网页抓取url未更改,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试着去拉屎。我需要在输入搜索查询时显示课程。例如：如果我输入python，将会有17门课程作为结果出现。我只需要获取这些课程。在这里，搜索查询不是作为url的一部分传递的。（不是get方法）。因此html内容也没有改变。那么，我如何在不查看整个课程列表的情况下获取这些结果呢。在这段代码中，我获取了所有的课程链接，获取了课程内容，并搜索了该内容中的搜索词。但它并没有给出我期望的结果 import requests from bs4 import BeautifulSoup from bs4

我正试着去拉屎。我需要在输入搜索查询时显示课程。例如：如果我输入python，将会有17门课程作为结果出现。我只需要获取这些课程。在这里，搜索查询不是作为url的一部分传递的。（不是get方法）。因此html内容也没有改变。那么，我如何在不查看整个课程列表的情况下获取这些结果呢。在这段代码中，我获取了所有的课程链接，获取了课程内容，并搜索了该内容中的搜索词。但它并没有给出我期望的结果

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
from urllib.request import Request, urlopen

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'lxml')
courses = soup.select('a.capitalize')

search_term = input("enter the course:")
for link in courses:
    #print("https://in.udacity.com" + link['href'])
    html = urllib.request.urlopen("https://in.udacity.com" + link['href']).read()

    if search_term in text_from_html(html).lower():
        print('\n'+link.text)
        print("https://in.udacity.com" + link['href'])

阅读教程了解如何使用（用于发出HTTP请求）和（用于处理HTML）。这将教会您下载页面和从HTML中提取数据所需的知识

您将使用该函数定位页面HTML中的所有

元素，并使用

class=课程摘要卡

。您想要的内容就在该

中，阅读上述链接后，您应该可以轻松了解其余内容；）

顺便说一句，当您学习如何执行此操作时，一个有用的工具是使用“Inspect element”功能（适用于Chrome/Firefox），可以通过右键单击浏览器中的元素来访问该功能，它使您能够查看您感兴趣提取的元素周围的源代码，以便您可以获得诸如其类或id之类的信息，父div等，允许您在BeautifulSoup/lxml/etc中选择它。

使用和：

输出：

VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.

AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems

编辑：

正如@Martin Evans所解释的，搜索背后的Ajax调用并不是你所认为的那样，它可能是在保留搜索的计数，即搜索AI的用户数。它基本上是根据

搜索词中的关键字过滤掉搜索：
import requests
from bs4 import BeautifulSoup
import re

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
search_term = "AI"

for course in courses:
    if re.search(search_term, course.text, re.IGNORECASE):
        print(course.text)

输出：
VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.

AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems

udacity页面实际上会在您请求时返回所有可用的课程。当您输入搜索时，页面只是过滤可用数据。这就是为什么您在输入搜索时看不到URL的任何更改。使用浏览器的开发者工具进行的检查也证实了这一点。这也解释了为什么“搜索”如此之快
因此，如果您正在搜索给定的课程，您只需要自己过滤结果。例如：
import requests
from bs4 import BeautifulSoup

req = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(req.content, "html.parser")
a_tags = soup.find_all("a", class_="capitalize")

print("Number of courses:", len(a_tags))
print()

for a_tag in a_tags:
    course = a_tag.text

    if "python" in course.lower():
        print(course)

这将显示标题中带有Python
的所有课程：
课程数：225门
巨蟒基金会
用Python进行人工智能编程
Python编程基础
Python中的数据结构和算法
欢迎来到StackOverflow。请花点时间阅读这篇关于如何提供答案的文章，并相应地修改你的问题。这些提示可能也很有用。@Merin你为什么要取消我的答案？我认为应该有一个更好的选择，而不是“靓汤”和提取搜索结果的请求。我认为如果你阅读更多关于网络抓取的内容，你会发现这可能是解决问题的最简单的方法。您可能希望有更简单的方法，但这与编写脚本获取HTML并从中提取文本一样简单。@Merin为什么要取消勾选我的答案？我尝试了您建议的方法。（通过过滤掉包含搜索词的所有课程）.但它给我的课程比网页上列出的要多。例如，当我搜索python时，网页上会出现17门课程。但当我过滤掉时，会出现17门以上的课程。