Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/django/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/user-interface/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从分页内容中刮取数据_Python_Django_Beautifulsoup - Fatal编程技术网

Python 从分页内容中刮取数据

Python 从分页内容中刮取数据,python,django,beautifulsoup,Python,Django,Beautifulsoup,我有一个网站,它利用AJAX滚动分页(在滚动上加载更多内容)。默认情况下,它显示25个项目,我可以刮这些 如何从分页内容中刮取数据 我正在使用BeautifulSoup和cronjob清除数据 我的代码: r=requests.get(url) data = r.text soup = BeautifulSoup(data) content=soup.find_all('section',{'class':'jrcl'}) for c in content: try: l

我有一个网站,它利用AJAX滚动分页(在滚动上加载更多内容)。默认情况下,它显示25个项目,我可以刮这些

如何从分页内容中刮取数据

我正在使用BeautifulSoup和cronjob清除数据

我的代码:

r=requests.get(url)
data = r.text
soup = BeautifulSoup(data)
content=soup.find_all('section',{'class':'jrcl'})
for c in content:
    try:
        links=c.select('a')[1]['href']
        web_link=requests.get(links)
        print "web",links
    except:
        links=c.select('a')[0]['href']
        web_link=requests.get(links)
        print "web",links
    content_data=web_link.text
    soup_content = BeautifulSoup(content_data)
    text=soup_content.find('section',{'class':'jdlc'})
    vendor=VendorDetails()
    vendor.company=text.select('.fn')[0].text
    vendor.source=links
    vendor.address=text.select('.jadlt')[0].text
    try:
        contact=text.select('.tel')[0]['href']
        vendor.contact=contact.replace('tel:',' ')
        contact2=text.select('.tel')[0]['href']
        vendor.contact2=contact2.replace('tel:',' ')
    except:
        contact=text.select('.tel')[0]['href']
        vendor.contact=contact.replace('tel:',' ')
    vendor.save()

我使用selenium和phantom js来实现这一点。我使用window.scroll来获取整个页面,这对我来说很有用

def handle(self, *args, **options):

       driver = webdriver.PhantomJS()

       driver.get("http://example.com")
       time.sleep(3)

        # elem = driver.find_element_by_tag_name("body")
        driver.set_window_size(1024, 768)

        no_of_pagedowns = 20

        while no_of_pagedowns:
            # elem.send_keys(Keys.PAGE_DOWN)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(1)
            no_of_pagedowns-=1

        post_elems = driver.find_elements_by_class_name("jcn")
        driver.save_screenshot('testing.png')
        for post in post_elems:
            ###Operations to be done
    driver.close()

查看浏览器开发控制台;每次滚动时,都会发出一个AJAX请求。在Python代码中发出相同的AJAX请求。