Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 2.7 美丽的汤-无法从分页页面中刮取链接_Python 2.7_Loops_Web Scraping_Pagination_Beautifulsoup - Fatal编程技术网

Python 2.7 美丽的汤-无法从分页页面中刮取链接

Python 2.7 美丽的汤-无法从分页页面中刮取链接,python-2.7,loops,web-scraping,pagination,beautifulsoup,Python 2.7,Loops,Web Scraping,Pagination,Beautifulsoup,我无法抓取分页网页中文章的链接。此外,我有时会得到一个空白屏幕作为我的输出。我无法在循环中找到问题。此外,不会创建csv文件 from pprint import pprint import requests from bs4 import BeautifulSoup import lxml import csv import urllib2 def get_url_for_search_key(search_key): for i in range(1,100): b

我无法抓取分页网页中文章的链接。此外,我有时会得到一个空白屏幕作为我的输出。我无法在循环中找到问题。此外,不会创建csv文件

from pprint import pprint
import requests
from bs4 import BeautifulSoup
import lxml
import csv
import urllib2

def get_url_for_search_key(search_key):
    for i in range(1,100):
        base_url = 'http://www.thedrum.com/'
        response = requests.get(base_url + 'search?page=%s&query=' + search_key +'&sorted=')%i
        soup = BeautifulSoup(response.content, "lxml")
        results = soup.findAll('a')
        return [url['href'] for url in soup.findAll('a')]
        pprint(get_url_for_search_key('artificial intelligence'))

with open('StoreUrl.csv', 'w+') as f:
    f.seek(0)
    f.write('\n'.join(get_url_for_search_key('artificial intelligence')))

你确定你只需要前100页吗?也许还有更多

下面是我对您任务的看法,它将收集所有页面的链接,并精确捕获下一页按钮链接:

import requests
from bs4 import BeautifulSoup


base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

res = []

while 1:
    results = soup.findAll('a')
    res.append([url['href'] for url in soup.findAll('a')])

    next_button = soup.find('a', text='Next page')
    if not next_button:
        break
    response = requests.get(next_button['href'])
    soup = BeautifulSoup(response.content, "lxml")
编辑:仅收集文章链接的替代方法:

import requests
from bs4 import BeautifulSoup


base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

res = []

while 1:
    search_results = soup.find('div', class_='search-results') #localizing search window with article links
    article_link_tags = search_results.findAll('a') #ordinary scheme goes further 
    res.append([url['href'] for url in article_link_tags])

    next_button = soup.find('a', text='Next page')
    if not next_button:
        break
    response = requests.get(next_button['href'])
    soup = BeautifulSoup(response.content, "lxml")
要打印链接,请使用:

for i in res:
    for j in i:
        print(j)

我把前100页仅仅作为初始测试的目的。问题是,当我尝试根据您的解决方案打印链接时,会在一个链接下面打印一系列“无”。只需在您提供的代码片段之后使用
pprint(res.append([url['href']表示soup.findAll('a')])
。我不确定这样做是否正确。非常困惑。当然不正确=)一天结束时,你会有一个列表。要打印每个链接,您必须在链接的每个
列表上循环,并在每个
列表内的每个链接上循环-双循环。删除打印并在
循环结束时检查
res
变量。添加了有效的打印循环,请检查。