使用Python Beatifulsoup的刮角列表概要文件描述

使用Python Beatifulsoup的刮角列表概要文件描述,python,excel,web-scraping,vba,Python,Excel,Web Scraping,Vba,一个新手来了 我现在沉迷于一项枯燥乏味的工作,我必须复制/粘贴天使列表中的某些内容,并将其保存在excel中。我以前使用过刮刀来自动化这些无聊的任务,但这一次相当困难,我无法找到一种方法来自动化它。请查看以下网站链接: https://angel.co/people/all 请应用过滤器位置->美国和市场->在线约会。将有大约550个结果(请注意,应用过滤器时URL不会更改) 一旦应用了过滤器,我已经成功地抓取了所有配置文件的URL。因此,我有一个excel文件,其中包含这些配置文件的550个

一个新手来了

我现在沉迷于一项枯燥乏味的工作,我必须复制/粘贴天使列表中的某些内容,并将其保存在excel中。我以前使用过刮刀来自动化这些无聊的任务,但这一次相当困难,我无法找到一种方法来自动化它。请查看以下网站链接:

https://angel.co/people/all
请应用过滤器位置->美国和市场->在线约会。将有大约550个结果(请注意,应用过滤器时URL不会更改)

一旦应用了过滤器,我已经成功地抓取了所有配置文件的URL。因此,我有一个excel文件,其中包含这些配置文件的550个URL

现在,下一步是转到各个配置文件并刮取某些信息。我正在寻找以下字段:

  • 名字
  • 描述信息
  • 投资
  • 创始人
  • 顾问
  • 地点
  • 市场
  • 我在找什么
  • 现在我已经尝试了很多解决方案,但到目前为止没有一个有效。Import.io、data miner、data scraper工具对我帮助不大

    请建议是否有任何VBA代码或Python代码或任何工具可以帮助我自动化此刮片任务

    解决方案的完整代码:

    下面是带有注释的最终代码。如果有人仍然有问题,请在下面评论,我会尽力帮助你

    from bs4 import BeautifulSoup
    import urllib2
    import json
    import csv
    
    def fetch_page(url):
        opener = urllib2.build_opener()
        # changing the user agent as the default one is banned
        opener.addheaders = [('User-Agent', 'Mozilla/43.0.1')]
        return opener.open(url).read()
    
    
    #Create a CSV File.
    f = open('angle_profiles.csv', 'w')
    # Row Headers
    f.write("URL" + "," + "Name" + "," + "Founder" + "," + "Advisor" + "," + "Employee" + "," + "Board Member" + ","
        + "Customer" + "," + "Locations" + "," + "Markets" + "," + "Investments" + "," + "What_iam_looking_for" + "\n")
    
    # URLs to iterate over has been saved in file: 'profiles_links.csv' . I will extract the URLs individually...
    index = 1;
    with open("profiles_links.csv") as f2:
    
        for row in map(str.strip,f2):
            url = format(row)
            print "@ Index: ", index
            index += 1;
    
            # Check if URL has 404 error. if yes, skip and continue with the rest of URLs.
            try:
                html = fetch_page(url)
                page = urllib2.urlopen(url)
            except Exception, e:
                print "Error 404 @: " , url
                continue
    
            bs = BeautifulSoup(html, "html.parser")
    
            #Extract info from page with these tags..
            name = bs.select(".profile-text h1")[0].get_text().strip()
    
            #description = bs.select('div[data-field="bio"]')[0]['data-value']
    
            founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
    
            advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
    
            employee = map(lambda link: link.get_text().strip(), bs.select('.role_employee a'))
    
            board_member = map(lambda link: link.get_text().strip(), bs.select('.role_board_member a'))
    
            customer = map(lambda link: link.get_text().strip(), bs.select('.role_customer a'))
    
            class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_locations'})
            count = 1
            locations = {}
            
            if class_wrapper is not None:
                for span in class_wrapper.find_all('span'):
                    locations[count] = span.text
                    count +=1
    
            class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_markets'})
            count = 1
            markets = {}
            if class_wrapper is not None:
                for span in class_wrapper.find_all('span'):
                    markets[count] = span.text
                    count +=1
            
            what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
    
            user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
    
            # investments are loaded using separate request and response is in JSON format
            json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
    
            investment_records = json.loads(json_data)
    
            investments = map(lambda x: x['company']['company_name'], investment_records)
    
            # Make sure that every variable is in string
    
            name2 = str(name); founder2 = str(founder); advisor2 = str (advisor); employee2 = str(employee)
            board_member2 = str(board_member); customer2 = str(customer); locations2 = str(locations); markets2 = str (markets);
            what_iam_looking_for2 = str(what_iam_looking_for); investments2 = str(investments);
    
            # Replace any , found with - so that csv doesn't confuse it as col separator...
            name = name2.replace(",", " -")
            founder = founder2.replace(",", " -")
            advisor = advisor2.replace(",", " -")
            employee = employee2.replace(",", " -")
            board_member = board_member2.replace(",", " -")
            customer = customer2.replace(",", " -")
            locations = locations2.replace(",", " -")
            markets = markets2.replace(",", " -")
            what_iam_looking_for = what_iam_looking_for2.replace(","," -")
            investments = investments2.replace(","," -")
    
            # Replace u' with nothing
            name = name.replace("u'", "")
            founder = founder.replace("u'", "")
            advisor = advisor.replace("u'", "")
            employee = employee.replace("u'", "")
            board_member = board_member.replace("u'", "")
            customer = customer.replace("u'", "")
            locations = locations.replace("u'", "")
            markets = markets.replace("u'", "")
            what_iam_looking_for = what_iam_looking_for.replace("u'", "")
            investments = investments.replace("u'", "")
    
            # Write the information back to the file... Note \n is used to jump one row ahead...
            f.write(url + "," + name + "," + founder + "," + advisor + "," + employee + "," + board_member + ","
                    + customer + "," + locations + "," + markets + "," + investments + "," + what_iam_looking_for + "\n")
    
    可以通过以下任何链接测试上述代码:

    https://angel.co/idg-ventures?utm_source=people
    https://angel.co/douglas-feirstein?utm_source=people
    https://angel.co/andrew-heckler?utm_source=people
    https://angel.co/mvklein?utm_source=people
    https://angel.co/rajs1?utm_source=people
    
    快乐编码:)

    看看

    它允许非常快速地编写解析器。下面是我为一个类似angel.co的站点编写的解析器示例:

    不幸的是,angel.co现在不适合我。好的开始点:

    $ pip install scrapy
    $ cat > myspider.py <<EOF
    
    import scrapy
    
    class BlogSpider(scrapy.Spider):
        name = 'blogspider'
        start_urls = ['https://angel.co']
    
        def parse(self, response):
            # here's selector to extract interesting elements
            for title in response.css('h2.entry-title'):
                # write down here values you'd like to extract from the element
                yield {'title': title.css('a ::text').extract_first()}
    
            # how to find next page
            next_page = response.css('div.prev-post > a ::attr(href)').extract_first()
            if next_page:
                yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
    
    EOF
    
    $ scrapy runspider myspider.py
    
    $pip安装刮片
    
    $cat>myspider.py对于我的食谱,您需要使用pip或easy_安装来安装BeautifulSoup

    from bs4 import BeautifulSoup
    import urllib2
    import json
    
    def fetch_page(url):
        opener = urllib2.build_opener()
        # changing the user agent as the default one is banned
        opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
        return opener.open(url).read()
    
    
    html = fetch_page("https://angel.co/davidtisch")
    
    # or load from local file
    #html = open('page.html', 'r').read()
    
    bs = BeautifulSoup(html, "html.parser")
    name = bs.select(".profile-text h1")[0].get_text().strip()
    
    description = bs.select('div[data-field="bio"]')[0]['data-value']
    
    founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
    
    advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
    
    locations = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_locations"] a'))
    
    markets = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_markets"] a'))
    
    what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
    
    user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
    
    # investments are loaded using separate request and response is in JSON format
    json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
    
    investment_records = json.loads(json_data)
    
    investments = map(lambda x: x['company']['company_name'], investment_records)
    

    您可以使用python scrapy来完成此任务。看看这个答案,看看如何从多个数据@daniboy000中获取信息。这很难理解,因为我只学习了很少的Python教程,而且我对Scrapy没有任何经验。Scrapy文档非常好,在第二个示例中,它们向您展示了如何做您想做的事情。@halfer感谢您的编辑。谢谢谢谢你的回复。由于我是Python新手,因此编写代码将花费我很多时间。你能提供一个angle.co的工作样本和评论,以便我能理解它的功能。我已经更新了回复。只需插入正确的CSS选择器即可。安装了BS4,了解了如何在csv文件中保存数据,并迭代了550个URL。只剩下一件事:描述给出了错误。位置和市场字段为空。请建议,之后任务将完成。非常感谢您的支持。@MuhammadIrfanAli所有字段都可以用几种不同的方法提取。我给的表达式对我尝试的两个用户有效。如果你在某些字段上仍然有困难,只要给我一个我的代码不起作用的url,我会尝试修复它。我已经更新了帖子。请复习。描述并不重要。地点和市场是必填字段,因此如果可以提取这些字段,那将非常棒。现在我得到了[]。请告诉我您是否能够实现它,因为我尝试使用了bs.find_all(),但响应仍然是[]。如果你能做到这一点,请让我知道,否则我可能会把它作为一个单独的问题。谢谢。嗨,我成功地提取了所需字段。在帖子中你可以找到完整的解决方案。请随时提出建议:)