Text 如何在<；中获取文本；a>；包含特定url的标记_Text_Scrapy_Href_Contains

Text 如何在<；中获取文本；a>；包含特定url的标记

text scrapy

Text 如何在<；中获取文本；a>；包含特定url的标记,text,scrapy,href,contains,Text,Scrapy,Href,Contains,我有一个问题，我不知道答案，可能会很有趣。我正在寻找这样的链接 <a href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml">Prosta delovna mesta v Sandozu</a> 但我在一个循环中，我只引用了这个url。我尝试了以下几种选择： response.xpath('//a[@href=url_orig]/text()')

我有一个问题，我不知道答案，可能会很有趣。我正在寻找这样的链接

    <a href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml">Prosta delovna mesta  v Sandozu</a>

但我在一个循环中，我只引用了这个url。我尝试了以下几种选择：

    response.xpath('//a[@href=url_orig]/text()').extract()
    response.xpath('//a[@href='url_orig']/text()').extract()

    word = "career"
    response.xpath('//a[contains(@href, "%s")]/text()').extract() % word

但它们都不起作用。我正在研究如何将引用而不是字符串放入'@href'或'contains'函数中。这是我的密码。你认为有办法吗

多谢各位马尔科

您需要将url置于引号中并使用字符串格式：

item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract()

def parse(self, response):

    response.selector.remove_namespaces() 



    #We take all urls, they are marked by "href". These are either webpages on our website either new websites.
    urls = response.xpath('//@href').extract()


    #Base url.
    base_url = get_base_url(response) 


    #Loop through all urls on the webpage.
    for url in urls:

        #If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
        if url.endswith((
            #images
            '.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', 
            '.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', 

            #documents
            '.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', 
            '.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', 

            #music and video
            '.mp3', '.mp4', '.mpg', '.ai', '.avi',
            '.MP3', '.MP4', '.MPG', '.AI', '.AVI',

            #compressions and other
            '.zip', '.rar', '.css', '.flv',
            '.ZIP', '.RAR', '.CSS', '.FLV',


        )):
            continue


        #If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it. 
        #However in this case we exclude good urls like http://www.mdm.si/company#employment
        if any(x in url for x in ['?', '%', '&', '#']):
            continue

        #Ignore ftp.
        if url.startswith("ftp"):
            continue

        #If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
        # -- It is true, that we may get some strange urls, but it is fine for now.
        if not (url.startswith("http")):

            url_orig = url
            url = urljoin(base_url,url)


        #We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.         
        if (urlparse(url).netloc == urlparse(base_url).netloc):


            #The main part. We look for webpages, whose urls include one of the employment words as strings.

            # -- Instruction. 
            # -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
            if any(x in url for x in [

                'careers',
                'Careers',

                'jobs',
                'Jobs',

                'employment',                                   
                'Employment', 

                'join_us',
                'Join_Us',
                'Join_us'

                'vacancies',
                'Vacancies',

                'work-for-us',

                'working-with-us',

                'join_us',


            ]):
                #We found url that includes one of the magic words. We check, if we have found it before. If it is new, we add it to the list "jobs_urls".
                if url not in self.jobs_urls:
                    self.jobs_urls.append(url)
                    item = JobItem()
                    item["link"] = url
                    #item["term"] = response.xpath('//a[@href=url_orig]/text()').extract() 
                    #item["term"] = response.xpath('//a[contains(@href, "career")]/text()').extract()


                    #We return the item.
                    yield item

            #We don't put "else" sentence because we want to explore the employment webpage to find possible new employment webpages.
            #We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py. 
            yield Request(url, callback = self.parse)

item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract()