Text 如何在<;中获取文本;a>;包含特定url的标记

Text 如何在<;中获取文本;a>;包含特定url的标记,text,scrapy,href,contains,Text,Scrapy,Href,Contains,我有一个问题,我不知道答案,可能会很有趣。 我正在寻找这样的链接 <a href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml">Prosta delovna mesta v Sandozu</a> 但我在一个循环中,我只引用了这个url。我尝试了以下几种选择: response.xpath('//a[@href=url_orig]/text()')

我有一个问题,我不知道答案,可能会很有趣。 我正在寻找这样的链接

    <a href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml">Prosta delovna mesta  v Sandozu</a>


    word = "career"
    response.xpath('//a[contains(@href, "%s")]/text()').extract() % word

多谢各位 马尔科


item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract() 
def parse(self, response):


    #We take all urls, they are marked by "href". These are either webpages on our website either new websites.
    urls = response.xpath('//@href').extract()

    #Base url.
    base_url = get_base_url(response) 

    #Loop through all urls on the webpage.
    for url in urls:

        #If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
        if url.endswith((
            '.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', 
            '.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', 

            '.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', 
            '.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', 

            #music and video
            '.mp3', '.mp4', '.mpg', '.ai', '.avi',
            '.MP3', '.MP4', '.MPG', '.AI', '.AVI',

            #compressions and other
            '.zip', '.rar', '.css', '.flv',
            '.ZIP', '.RAR', '.CSS', '.FLV',


        #If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it. 
        #However in this case we exclude good urls like http://www.mdm.si/company#employment
        if any(x in url for x in ['?', '%', '&', '#']):

        #Ignore ftp.
        if url.startswith("ftp"):

        #If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
        # -- It is true, that we may get some strange urls, but it is fine for now.
        if not (url.startswith("http")):

            url_orig = url
            url = urljoin(base_url,url)

        #We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.         
        if (urlparse(url).netloc == urlparse(base_url).netloc):

            #The main part. We look for webpages, whose urls include one of the employment words as strings.

            # -- Instruction. 
            # -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
            if any(x in url for x in [









                #We found url that includes one of the magic words. We check, if we have found it before. If it is new, we add it to the list "jobs_urls".
                if url not in self.jobs_urls:
                    item = JobItem()
                    item["link"] = url
                    #item["term"] = response.xpath('//a[@href=url_orig]/text()').extract() 
                    #item["term"] = response.xpath('//a[contains(@href, "career")]/text()').extract()

                    #We return the item.
                    yield item

            #We don't put "else" sentence because we want to explore the employment webpage to find possible new employment webpages.
            #We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py. 
            yield Request(url, callback = self.parse)
item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract()