Web scraping 如何从html标记中提取文本以及如何过滤它包含的文本?

Web scraping 如何从html标记中提取文本以及如何过滤它包含的文本?,web-scraping,beautifulsoup,python-3.4,Web Scraping,Beautifulsoup,Python 3.4,所以我想做的是从链接中的特定标记中获取文本,我想做的是仅当文本包含某些单词时返回html,例如:如果文本包含“化学”,那么如果未通过,则返回该链接 这是我的密码: import requests from bs4 import BeautifulSoup import webbrowser jobsearch = input("What type of job?: ") location = input("What is your location: ") url = ("https://ca

所以我想做的是从链接中的特定标记中获取文本,我想做的是仅当文本包含某些单词时返回html,例如:如果文本包含“化学”,那么如果未通过,则返回该链接

这是我的密码:

import requests
from bs4 import BeautifulSoup
import webbrowser

jobsearch = input("What type of job?: ")
location = input("What is your location: ")
url = ("https://ca.indeed.com/jobs?q=" + jobsearch + "&l=" + location)
base_url = 'https://ca.indeed.com/'

r = requests.get(url)
rcontent = r.content
prettify = BeautifulSoup(rcontent, "html.parser")

all_job_url = []

def get_all_joblinks():
    for tag in prettify.find_all('a', {'data-tn-element':"jobTitle"}):
        link = tag['href']
        all_job_url.append(link)

def filter_links():

    for eachurl in all_job_url:
        rurl = requests.get(base_url + eachurl)
        content = rurl.content
        soup = BeautifulSoup(content, "html.parser")
        summary = soup.find('td', {'class':'snip'}).get_text()
        print(summary)

def search_job():

    while True:

        if prettify.select('div.no_results'):
            print("no job matches found")
            break
        else:
            # opens the web page of job search if entries are found
            website = webbrowser.open_new(url);
            break

get_all_joblinks()
filter_links()

在您的
get\u all\u joblinks
函数中,您似乎从一个.ca页面获取所有链接。下面是如何检查一个典型链接是否在其
正文
元素的某个地方提到了“化学品”

>>> import requests
>>> import bs4
>>> page = requests.get('https://jobs.sanofi.us/job/-/-/507/4895612?utm_source=indeed.com&utm_campaign=sanofi%20sem%20campaign&utm_medium=job_aggregator&utm_content=paid_search&ss=paid').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> body = soup.find('body').text
>>> chemical_present = body.lower().find('chemical')>-1
>>> chemical_present
True
我希望这就是你想要的

编辑,回应评论

>>> import webbrowser
>>> job_type = 'engineer'
>>> location = 'Toronto'
>>> url = "https://ca.indeed.com/jobs?q=" + job_type + "&l=" + location
>>> base_url = '%s://%s' % parse.urlparse(url)[0:2]
>>> page = requests.get(url).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> for link in soup.find_all('a', {'data-tn-element':"jobTitle"}):
...     job_page = requests.get(base_url+link['href']).content
...     job_soup = bs4.BeautifulSoup(job_page, 'lxml')
...     body = job_soup.find('body').text
...     if body.lower().find('chemical')>-1:
...         webbrowser.open(base_url+link['href'])

是的,我想提取所有链接,然后根据它们包含的特定文本过滤掉它们。过滤后,我想显示与过滤后的链接。这可能吗?请参见编辑。我也许应该警告您,这段代码需要很长时间才能运行。