Python 使用Beautifulsoup为URL抓取页面_Python_Web Scraping_Beautifulsoup

Python 使用Beautifulsoup为URL抓取页面

python web-scraping

Python 使用Beautifulsoup为URL抓取页面,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我可以把这一页刮到标题上，没问题。URL是另一回事。它们是附加在基本URL末尾的片段-我理解。。。我需要什么来拉取相关的url以存储格式-base_url.scraped_fragment from urllib2 import urlopen import requests from bs4 import BeautifulSoup import csv import MySQLdb import re html = urlopen("http://advances.sciencemag

我可以把这一页刮到标题上，没问题。URL是另一回事。它们是附加在基本URL末尾的片段-我理解。。。我需要什么来拉取相关的url以存储格式-base_url.scraped_fragment

from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import csv
import MySQLdb
import re


html = urlopen("http://advances.sciencemag.org/")
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml")
#links = soup.findAll("a","href")
headlines = soup.findAll("div", "highwire-cite-title media__headline__title")
    for headline in headlines:
    text = (headline.get_text())
    print text

首先，类名之间应该有一个空格：

highwire-cite-title media__headline__title
               HERE^

无论如何，由于您需要链接，您应该定位

元素，并使用

urljoin（）

创建绝对URL：

from urlparse import urljoin

import requests
from bs4 import BeautifulSoup


base_url = "http://advances.sciencemag.org"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

headlines = soup.find_all(class_="highwire-cite-linked-title")
for headline in headlines:
    print(urljoin(base_url, headline["href"]))

印刷品：

http://advances.sciencemag.org/content/2/4/e1600069
http://advances.sciencemag.org/content/2/4/e1501914
http://advances.sciencemag.org/content/2/4/e1501737
...
http://advances.sciencemag.org/content/2/2
http://advances.sciencemag.org/content/2/1

首先，类名之间应该有一个空格：

highwire-cite-title media__headline__title
               HERE^

无论如何，由于您需要链接，您应该定位

元素，并使用

urljoin（）

创建绝对URL：

from urlparse import urljoin

import requests
from bs4 import BeautifulSoup


base_url = "http://advances.sciencemag.org"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

headlines = soup.find_all(class_="highwire-cite-linked-title")
for headline in headlines:
    print(urljoin(base_url, headline["href"]))

印刷品：

http://advances.sciencemag.org/content/2/4/e1600069
http://advances.sciencemag.org/content/2/4/e1501914
http://advances.sciencemag.org/content/2/4/e1501737
...
http://advances.sciencemag.org/content/2/2
http://advances.sciencemag.org/content/2/1

很好用！接下来，如果我想让URL与标题配对，例如：范德华金属半导体结：弱费米能级钉扎能够有效调谐肖特基势垒，那么该语句看起来如何？@citramaillo，您可以从

标题中获得。get_text（）

，查看它。谢谢，很好用！接下来，如果我想让URL与标题配对，例如：范德华金属半导体结：弱费米能级钉扎能够有效调谐肖特基势垒，那么该语句看起来如何？@citramaillo，您可以从

标题中获得。get_text（）

，查看它。谢谢