Python 从华盛顿邮报网站获取代表不同网页的链接列表(url';s)

Python 从华盛顿邮报网站获取代表不同网页的链接列表(url';s),python,web-scraping,html-parsing,beautifulsoup,Python,Web Scraping,Html Parsing,Beautifulsoup,通过对《华盛顿邮报》新闻网站进行爬网来获取url集,并获取网页的url列表,最后将其保存到文本文件中 这是我的密码 import urllib2 import urllib from cookielib import CookieJar from bs4 import BeautifulSoup cj = CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) text_file = "http://

通过对《华盛顿邮报》新闻网站进行爬网来获取url集,并获取网页的url列表,最后将其保存到文本文件中

这是我的密码

import urllib2
import urllib
from cookielib import CookieJar
from bs4 import BeautifulSoup

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
text_file = "http://www.washingtonpost.com/newssearch/search.html?st=turkey&submit=Submit"
data = opener.open(text_file).read()

soup = BeautifulSoup(data)
paragraph = soup.find_all('div', attrs = {'class': 'pb-feed-headline'})
for href in paragraph:
    print href
    saveFile = open('wpnewspaper_url_collection.txt','a')
    saveFile.write(href.text.encode('utf-8'))
    saveFile.write('\n')
    saveFile.write('\n')
    saveFile.close()
这就是我得到的,当然,这只是我结果的一部分:

<div class="pb-feed-headline"><h3><a href="http://www.washingtonpost.com/world/middle_east/turkey-seeks-behind-scene-role-in-nato-coalition/2014/09/14/4e124944-3beb-11e4-a430-b82a3e67b762_story.html">Turkey seeks behind-scene role in NATO coalition</a></h3></div>

到文本文件。

首先,您无法访问实际链接。此外,与其在循环中多次打开文件,不如先打开它,然后在每次迭代中编写链接并使用它来处理文件—它会为您关闭它

为了获得实际的链接,我将使用
div.pb-feed-headline a
,它将
div
中的每个链接与类
pb-feed headline
匹配

from cookielib import CookieJar
import urllib2

from bs4 import BeautifulSoup


cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

text_file = "http://www.washingtonpost.com/newssearch/search.html?st=turkey&submit=Submit"
soup = BeautifulSoup(opener.open(text_file))

links = soup.select('div.pb-feed-headline a')
with open('wpnewspaper_url_collection.txt', 'w') as f:
    for link in links:
        f.write(link.get('href') + '\n')
因此,在
wppainer\u url\u collection.txt
中,您将得到:

http://www.washingtonpost.com/world/middle_east/turkey-seeks-behind-scene-role-in-nato-coalition/2014/09/14/4e124944-3beb-11e4-a430-b82a3e67b762_story.html
http://www.washingtonpost.com/world/middle_east/turkey-could-accept-muslim-brotherhood-leaders/2014/09/16/7653dc0c-3d6b-11e4-a430-b82a3e67b762_story.html
http://www.washingtonpost.com/world/middle_east/us-expands-aid-funding-for-refugee-crisis-in-syria/2014/09/12/97282b20-3a6b-11e4-8601-97ba88884ffd_story.html
http://www.washingtonpost.com/world/middle_east/hagel-visits-turkey-in-bid-to-strategize-over-coordinated-campaign-against-islamic-state/2014/09/08/6913803e-377f-11e4-9c9f-ebb47272e40e_story.html
http://www.washingtonpost.com/49-reasons-why-turkey-doesnt-want-to-fight-the-islamic-state/2014/09/12/b581f8bb-bbd9-44de-8788-9a09bfd0b3f0_story.html
http://www.washingtonpost.com/world/middle_east/cobbling-coalition-for-iraq-syria-no-easy-task/2014/09/12/11f964c4-3af1-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/world/middle_east/us-turkey-mull-strategy-against-islamic-militants/2014/09/12/6479a9bc-3a78-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/world/middle_east/cobbling-coalition-for-iraq-syria-no-easy-task/2014/09/12/3c9e50be-3ab9-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/world/middle_east/arab-allies-pledge-to-fight-islamic-state-group/2014/09/11/c97927fe-3a2b-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/national/wild-turkey-maker-reaches-60th-year-in-business/2014/09/10/8ca01c46-38de-11e4-a023-1d61f7f31a05_story.html

抱歉,我的代码中没有文本,它只是saveFile.write(hreft.encode('utf-8'))@alecxe,非常感谢,它真的很有效。。感谢你的帮助。。我还有一个问题,如果你同意的话。。我能从华盛顿邮报的所有网页上得到一个完整的url列表吗?
http://www.washingtonpost.com/world/middle_east/turkey-seeks-behind-scene-role-in-nato-coalition/2014/09/14/4e124944-3beb-11e4-a430-b82a3e67b762_story.html
http://www.washingtonpost.com/world/middle_east/turkey-could-accept-muslim-brotherhood-leaders/2014/09/16/7653dc0c-3d6b-11e4-a430-b82a3e67b762_story.html
http://www.washingtonpost.com/world/middle_east/us-expands-aid-funding-for-refugee-crisis-in-syria/2014/09/12/97282b20-3a6b-11e4-8601-97ba88884ffd_story.html
http://www.washingtonpost.com/world/middle_east/hagel-visits-turkey-in-bid-to-strategize-over-coordinated-campaign-against-islamic-state/2014/09/08/6913803e-377f-11e4-9c9f-ebb47272e40e_story.html
http://www.washingtonpost.com/49-reasons-why-turkey-doesnt-want-to-fight-the-islamic-state/2014/09/12/b581f8bb-bbd9-44de-8788-9a09bfd0b3f0_story.html
http://www.washingtonpost.com/world/middle_east/cobbling-coalition-for-iraq-syria-no-easy-task/2014/09/12/11f964c4-3af1-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/world/middle_east/us-turkey-mull-strategy-against-islamic-militants/2014/09/12/6479a9bc-3a78-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/world/middle_east/cobbling-coalition-for-iraq-syria-no-easy-task/2014/09/12/3c9e50be-3ab9-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/world/middle_east/arab-allies-pledge-to-fight-islamic-state-group/2014/09/11/c97927fe-3a2b-11e4-a023-1d61f7f31a05_story.html
http://www.washingtonpost.com/national/wild-turkey-maker-reaches-60th-year-in-business/2014/09/10/8ca01c46-38de-11e4-a023-1d61f7f31a05_story.html