Python 从分页内容中刮取数据_Python_Django_Beautifulsoup

Python 从分页内容中刮取数据

python django

Python 从分页内容中刮取数据,python,django,beautifulsoup,Python,Django,Beautifulsoup,我有一个网站，它利用AJAX滚动分页（在滚动上加载更多内容）。默认情况下，它显示25个项目，我可以刮这些如何从分页内容中刮取数据我正在使用BeautifulSoup和cronjob清除数据我的代码： r=requests.get(url) data = r.text soup = BeautifulSoup(data) content=soup.find_all('section',{'class':'jrcl'}) for c in content: try: l

我有一个网站，它利用AJAX滚动分页（在滚动上加载更多内容）。默认情况下，它显示25个项目，我可以刮这些

如何从分页内容中刮取数据

我正在使用BeautifulSoup和cronjob清除数据

我的代码：

r=requests.get(url)
data = r.text
soup = BeautifulSoup(data)
content=soup.find_all('section',{'class':'jrcl'})
for c in content:
    try:
        links=c.select('a')[1]['href']
        web_link=requests.get(links)
        print "web",links
    except:
        links=c.select('a')[0]['href']
        web_link=requests.get(links)
        print "web",links
    content_data=web_link.text
    soup_content = BeautifulSoup(content_data)
    text=soup_content.find('section',{'class':'jdlc'})
    vendor=VendorDetails()
    vendor.company=text.select('.fn')[0].text
    vendor.source=links
    vendor.address=text.select('.jadlt')[0].text
    try:
        contact=text.select('.tel')[0]['href']
        vendor.contact=contact.replace('tel:',' ')
        contact2=text.select('.tel')[0]['href']
        vendor.contact2=contact2.replace('tel:',' ')
    except:
        contact=text.select('.tel')[0]['href']
        vendor.contact=contact.replace('tel:',' ')
    vendor.save()

我使用selenium和phantom js来实现这一点。我使用window.scroll来获取整个页面，这对我来说很有用

def handle(self, *args, **options):

       driver = webdriver.PhantomJS()

       driver.get("http://example.com")
       time.sleep(3)

        # elem = driver.find_element_by_tag_name("body")
        driver.set_window_size(1024, 768)

        no_of_pagedowns = 20

        while no_of_pagedowns:
            # elem.send_keys(Keys.PAGE_DOWN)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(1)
            no_of_pagedowns-=1

        post_elems = driver.find_elements_by_class_name("jcn")
        driver.save_screenshot('testing.png')
        for post in post_elems:
            ###Operations to be done
    driver.close()

查看浏览器开发控制台；每次滚动时，都会发出一个AJAX请求。在Python代码中发出相同的AJAX请求。