构建PythonWeb scraper,需要帮助才能获得正确的输出

构建PythonWeb scraper,需要帮助才能获得正确的输出,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我正在使用python构建一个web scraper。 我的刮刀的目的是从这个网页获取所有网站的链接 我想要像这样的输出- www.thepiratebay.se www.kat.ph 我对python和scraping还不熟悉,我这样做只是为了练习。请帮助我获得正确的输出 我的代码-------------------------------------- import requests from bs4 import BeautifulSoup r = requests.get(&qu

我正在使用python构建一个web scraper。 我的刮刀的目的是从这个网页获取所有网站的链接

我想要像这样的输出-

www.thepiratebay.se
www.kat.ph
我对python和scraping还不熟悉,我这样做只是为了练习。请帮助我获得正确的输出

我的代码--------------------------------------

import requests

from bs4 import BeautifulSoup

r = requests.get("http://www.ebizmba.com/articles/torrent-websites")

soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})
for item in data:
    print(item.contents[1].find_all("a"))

我的输出---

如下所示:

import requests    
from bs4 import BeautifulSoup

r = requests.get("http://www.ebizmba.com/articles/torrent-websites")

soup = BeautifulSoup(r.text, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})

for i in data:
    for j in i.contents[1].find_all("a"):
        print(j.get('href'))
全输出:

http://www.thepiratebay.se
http://siteanalytics.compete.com/thepiratebay.se
http://quantcast.com/thepiratebay.se
http://www.alexa.com/siteinfo/thepiratebay.se/
http://www.kickass.to
http://siteanalytics.compete.com/kickass.to
http://quantcast.com/kickass.to
http://www.alexa.com/siteinfo/kickass.to/
http://www.torrentz.eu
http://siteanalytics.compete.com/torrentz.eu
http://quantcast.com/torrentz.eu
http://www.alexa.com/siteinfo/torrentz.eu/
http://www.extratorrent.cc
http://siteanalytics.compete.com/extratorrent.cc
http://quantcast.com/extratorrent.cc
http://www.alexa.com/siteinfo/extratorrent.cc/
http://www.yify-torrents.com
http://siteanalytics.compete.com/yify-torrents.com
http://quantcast.com/yify-torrents.com
http://www.alexa.com/siteinfo/yify-torrents.com
http://www.bitsnoop.com
http://siteanalytics.compete.com/bitsnoop.com
http://quantcast.com/bitsnoop.com
http://www.alexa.com/siteinfo/bitsnoop.com/
http://www.isohunt.to
http://siteanalytics.compete.com/isohunt.to
http://quantcast.com/isohunt.to
http://www.alexa.com/siteinfo/isohunt.to/
http://www.sumotorrent.sx
http://siteanalytics.compete.com/sumotorrent.sx
http://quantcast.com/sumotorrent.sx
http://www.alexa.com/siteinfo/sumotorrent.sx/
http://www.torrentdownloads.me
http://siteanalytics.compete.com/torrentdownloads.me
http://quantcast.com/torrentdownloads.me
http://www.alexa.com/siteinfo/torrentdownloads.me/
http://www.eztv.it
http://siteanalytics.compete.com/eztv.it
http://quantcast.com/eztv.it
http://www.alexa.com/siteinfo/eztv.it/
http://www.rarbg.com
http://siteanalytics.compete.com/rarbg.com
http://quantcast.com/rarbg.com
http://www.alexa.com/siteinfo/rarbg.com/
http://www.1337x.org
http://siteanalytics.compete.com/1337x.org
http://quantcast.com/1337x.org
http://www.alexa.com/siteinfo/1337x.org/
http://www.torrenthound.com
http://siteanalytics.compete.com/torrenthound.com
http://quantcast.com/torrenthound.com
http://www.alexa.com/siteinfo/torrenthound.com/
https://demonoid.org/
http://siteanalytics.compete.com/demonoid.pw
http://quantcast.com/demonoid.pw
http://www.alexa.com/siteinfo/demonoid.pw/
http://www.fenopy.se
http://siteanalytics.compete.com/fenopy.se
http://quantcast.com/fenopy.se
http://www.alexa.com/siteinfo/fenopy.se/

如果你正在练习,看看正则表达式。 这里只会有标题链接。。。针字符串是匹配字符串,括号
(http://.*?
包含匹配组

import urllib2
import re

myURL = "http://www.ebizmba.com/articles/torrent-websites"
req = urllib2.Request(myURL)

Needle1 = '<p><a href="(http:.*?)" rel="nofollow" target="_blank">'
for match in re.finditer(Needle1, urllib2.urlopen(req).read()):
   print(match.group(1))
导入urllib2
进口稀土
myURL=”http://www.ebizmba.com/articles/torrent-websites"
req=urllib2.Request(myURL)
针刺1=''
对于re.finditer(Needle1、urllib2.urlopen(req.read())中的匹配项:
打印(匹配组(1))

您现在得到了什么输出?嘿,谢谢@kevin。。你能告诉我,我怎样才能得到像这样更精确的输出---…@Kevin同意,在很多情况下,用正则表达式解析HTML可能是个坏主意,因为你必须验证你的结果,以确保你不会得到令人不快的惊喜。另一方面,你很快就会得到结果——所以这完全取决于使用(和重用)情况。