构建PythonWeb scraper，需要帮助才能获得正确的输出_Python_Web Scraping_Beautifulsoup_Python Requests

构建PythonWeb scraper，需要帮助才能获得正确的输出

python web-scraping

构建PythonWeb scraper，需要帮助才能获得正确的输出,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我正在使用python构建一个web scraper。我的刮刀的目的是从这个网页获取所有网站的链接我想要像这样的输出- www.thepiratebay.se www.kat.ph 我对python和scraping还不熟悉，我这样做只是为了练习。请帮助我获得正确的输出我的代码-------------------------------------- import requests from bs4 import BeautifulSoup r = requests.get(&qu

我正在使用python构建一个web scraper。我的刮刀的目的是从这个网页获取所有网站的链接

我想要像这样的输出-

www.thepiratebay.se
www.kat.ph

我对python和scraping还不熟悉，我这样做只是为了练习。请帮助我获得正确的输出

我的代码--------------------------------------

import requests

from bs4 import BeautifulSoup

r = requests.get("http://www.ebizmba.com/articles/torrent-websites")

soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})
for item in data:
    print(item.contents[1].find_all("a"))

我的输出---

如下所示：

import requests    
from bs4 import BeautifulSoup

r = requests.get("http://www.ebizmba.com/articles/torrent-websites")

soup = BeautifulSoup(r.text, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})

for i in data:
    for j in i.contents[1].find_all("a"):
        print(j.get('href'))

全输出：

http://www.thepiratebay.se
http://siteanalytics.compete.com/thepiratebay.se
http://quantcast.com/thepiratebay.se
http://www.alexa.com/siteinfo/thepiratebay.se/
http://www.kickass.to
http://siteanalytics.compete.com/kickass.to
http://quantcast.com/kickass.to
http://www.alexa.com/siteinfo/kickass.to/
http://www.torrentz.eu
http://siteanalytics.compete.com/torrentz.eu
http://quantcast.com/torrentz.eu
http://www.alexa.com/siteinfo/torrentz.eu/
http://www.extratorrent.cc
http://siteanalytics.compete.com/extratorrent.cc
http://quantcast.com/extratorrent.cc
http://www.alexa.com/siteinfo/extratorrent.cc/
http://www.yify-torrents.com
http://siteanalytics.compete.com/yify-torrents.com
http://quantcast.com/yify-torrents.com
http://www.alexa.com/siteinfo/yify-torrents.com
http://www.bitsnoop.com
http://siteanalytics.compete.com/bitsnoop.com
http://quantcast.com/bitsnoop.com
http://www.alexa.com/siteinfo/bitsnoop.com/
http://www.isohunt.to
http://siteanalytics.compete.com/isohunt.to
http://quantcast.com/isohunt.to
http://www.alexa.com/siteinfo/isohunt.to/
http://www.sumotorrent.sx
http://siteanalytics.compete.com/sumotorrent.sx
http://quantcast.com/sumotorrent.sx
http://www.alexa.com/siteinfo/sumotorrent.sx/
http://www.torrentdownloads.me
http://siteanalytics.compete.com/torrentdownloads.me
http://quantcast.com/torrentdownloads.me
http://www.alexa.com/siteinfo/torrentdownloads.me/
http://www.eztv.it
http://siteanalytics.compete.com/eztv.it
http://quantcast.com/eztv.it
http://www.alexa.com/siteinfo/eztv.it/
http://www.rarbg.com
http://siteanalytics.compete.com/rarbg.com
http://quantcast.com/rarbg.com
http://www.alexa.com/siteinfo/rarbg.com/
http://www.1337x.org
http://siteanalytics.compete.com/1337x.org
http://quantcast.com/1337x.org
http://www.alexa.com/siteinfo/1337x.org/
http://www.torrenthound.com
http://siteanalytics.compete.com/torrenthound.com
http://quantcast.com/torrenthound.com
http://www.alexa.com/siteinfo/torrenthound.com/
https://demonoid.org/
http://siteanalytics.compete.com/demonoid.pw
http://quantcast.com/demonoid.pw
http://www.alexa.com/siteinfo/demonoid.pw/
http://www.fenopy.se
http://siteanalytics.compete.com/fenopy.se
http://quantcast.com/fenopy.se
http://www.alexa.com/siteinfo/fenopy.se/

如果你正在练习，看看正则表达式。这里只会有标题链接。。。针字符串是匹配字符串，括号

（http://.*？

包含匹配组

import urllib2
import re

myURL = "http://www.ebizmba.com/articles/torrent-websites"
req = urllib2.Request(myURL)

Needle1 = '<p><a href="(http:.*?)" rel="nofollow" target="_blank">'
for match in re.finditer(Needle1, urllib2.urlopen(req).read()):
   print(match.group(1))

导入urllib2
进口稀土
myURL=”http://www.ebizmba.com/articles/torrent-websites"
req=urllib2.Request（myURL）
针刺1=''
对于re.finditer（Needle1、urllib2.urlopen（req.read（））中的匹配项：
打印（匹配组（1））

您现在得到了什么输出？嘿，谢谢@kevin。。你能告诉我，我怎样才能得到像这样更精确的输出---…@Kevin同意，在很多情况下，用正则表达式解析HTML可能是个坏主意，因为你必须验证你的结果，以确保你不会得到令人不快的惊喜。另一方面，你很快就会得到结果——所以这完全取决于使用（和重用）情况。