Python 使用urllib计算网页上的图像数_Python_Html_Regex_Html Parsing_Urllib

Python 使用urllib计算网页上的图像数

python html regex

Python 使用urllib计算网页上的图像数,python,html,regex,html-parsing,urllib,Python,Html,Regex,Html Parsing,Urllib,对于一个类，我有一个练习，我需要计算任何给定网页上的图像数量。我知道每个图像都以开头，所以我使用regexp来尝试定位它们。但我一直在计算一个我知道是错误的，我的代码有什么问题： import urllib import urllib.request import re img_pat = re.compile('<img.*>',re.I) def get_img_cnt(url): try: w = urllib.request.urlopen(url)

对于一个类，我有一个练习，我需要计算任何给定网页上的图像数量。我知道每个图像都以开头，所以我使用regexp来尝试定位它们。但我一直在计算一个我知道是错误的，我的代码有什么问题：

import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)

def get_img_cnt(url):
  try:
      w =  urllib.request.urlopen(url)
  except IOError:
      sys.stderr.write("Couldn't connect to %s " % url)
      sys.exit(1)
  contents =  str(w.read())
  img_num = len(img_pat.findall(contents))
  return (img_num)

print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

导入urllib
导入urllib.request
进口稀土
img_pat=re.compile（“”，re.I）
def get_img_cnt（url）：
尝试：
w=urllib.request.urlopen（url）
除IOError外：
sys.stderr.write（“无法连接到%s”%url）
系统出口（1）
contents=str（w.read（））
img_num=len（img_pat.findall（目录））
返回（img_num）
打印（获取图像）http://www.americascup.com/en/schedules/races'))

您的正则表达式是贪婪的，所以它匹配的比您想要的多得多。我建议使用HTML解析器

img_pat=re.compile（“”，re.I）

如果您必须用正则表达式的方式来完成，那么它就可以完成。

？

使其不贪婪

一个很好的网站，可以随时查看您的正则表达式匹配项：
了解有关正则表达式的更多信息：

永远不要使用正则表达式解析HTML，请使用HTML解析器，如或。下面是一个工作示例，如何使用

BeautifulSoup

和获得

img

标记计数：

下面是一个使用

lxml

和

请求的工作示例：
from lxml import etree
import requests


def get_img_cnt(url):
    response = requests.get(url)
    parser = etree.HTMLParser()
    root = etree.fromstring(response.content, parser=parser)

    return int(root.xpath('count(//img)'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

两个代码段都打印106

另见：




希望对您有所帮助。
Ahhh正则表达式
您的正则表达式模式
说“找到以
开头的东西”
不过，正则表达式是贪婪的；它会尽可能地将*
填满，然后在某个地方留下一个
字符来满足模式。在这种情况下，它会一直到最后，
并说“看！我在那里找到了一个
！"
您应该通过使*
非贪婪来计算正确的计数，如下所示：
谢谢。我不明白？在做什么？它告诉正则表达式在第一次遇到
时停止搜索，而不是最新的。因此它将捕获所有
，而不仅仅是一个大的
（其中可能包含其他的？
告诉正则表达式以尽可能少的字符来匹配任意的*
模式，而不是尽可能多的字符（这是默认值）。因此，如果我们将regex拟人化得更长一点，它将尽快看到以结束匹配。
from lxml import etree
import requests


def get_img_cnt(url):
    response = requests.get(url)
    parser = etree.HTMLParser()
    root = etree.fromstring(response.content, parser=parser)

    return int(root.xpath('count(//img)'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))