（Python）试图从网站中隔离一些数据_Python_Urllib

（Python）试图从网站中隔离一些数据

python

（Python）试图从网站中隔离一些数据,python,urllib,Python,Urllib,基本上，该脚本将从wallbase.cc的随机页面和toplist页面下载图像。本质上，它寻找一个7位数的字符串，该字符串将每个图像标识为该图像。它将该id输入url并下载。我唯一的问题似乎是隔离7位字符串我想做的是搜索如果只想使用默认库，可以使用正则表达式 pattern = re.compile(r'<div id="thumb(.{7})"') ... for data-id in re.findall(pattern, the_page): pass # do so

基本上，该脚本将从wallbase.cc的随机页面和toplist页面下载图像。本质上，它寻找一个7位数的字符串，该字符串将每个图像标识为该图像。它将该id输入url并下载。我唯一的问题似乎是隔离7位字符串

我想做的是

搜索

如果只想使用默认库，可以使用正则表达式
pattern = re.compile(r'<div id="thumb(.{7})"')

...

for data-id in re.findall(pattern, the_page):
    pass # do something with data-id

pattern=re.compile（r'如果只想使用默认库，可以使用正则表达式
pattern = re.compile(r'<div id="thumb(.{7})"')

...

for data-id in re.findall(pattern, the_page):
    pass # do something with data-id

pattern=re.compile（r'如果只想使用默认库，可以使用正则表达式
pattern = re.compile(r'<div id="thumb(.{7})"')

...

for data-id in re.findall(pattern, the_page):
    pass # do something with data-id

pattern=re.compile（r'如果只想使用默认库，可以使用正则表达式
pattern = re.compile(r'<div id="thumb(.{7})"')

...

for data-id in re.findall(pattern, the_page):
    pass # do something with data-id

pattern=re.compile（r'您可能希望使用像BeautifulSoup这样的web刮片库，请参见例如关于Python中的web刮片
import urllib2
from BeautifulSoup import BeautifulSoup

# download and parse HTML
url = 'http://wallbase.cc/toplist'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# find the links we want
links = soup('a', href=re.compile('^http://wallbase.cc/wallpaper/\d+$'))
for l in links:
    href = l.get('href')
    print href                # u'http://wallbase.cc/wallpaper/1750539'
    print href.split('/')[-1] # u'1750539'

您可能想要使用像BeautifulSoup这样的web抓取库，请参见例如关于Python中的web抓取
import urllib2
from BeautifulSoup import BeautifulSoup

# download and parse HTML
url = 'http://wallbase.cc/toplist'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# find the links we want
links = soup('a', href=re.compile('^http://wallbase.cc/wallpaper/\d+$'))
for l in links:
    href = l.get('href')
    print href                # u'http://wallbase.cc/wallpaper/1750539'
    print href.split('/')[-1] # u'1750539'

您可能想要使用像BeautifulSoup这样的web抓取库，请参见例如关于Python中的web抓取
import urllib2
from BeautifulSoup import BeautifulSoup

# download and parse HTML
url = 'http://wallbase.cc/toplist'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# find the links we want
links = soup('a', href=re.compile('^http://wallbase.cc/wallpaper/\d+$'))
for l in links:
    href = l.get('href')
    print href                # u'http://wallbase.cc/wallpaper/1750539'
    print href.split('/')[-1] # u'1750539'

您可能想要使用像BeautifulSoup这样的web抓取库，请参见例如关于Python中的web抓取
import urllib2
from BeautifulSoup import BeautifulSoup

# download and parse HTML
url = 'http://wallbase.cc/toplist'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# find the links we want
links = soup('a', href=re.compile('^http://wallbase.cc/wallpaper/\d+$'))
for l in links:
    href = l.get('href')
    print href                # u'http://wallbase.cc/wallpaper/1750539'
    print href.split('/')[-1] # u'1750539'

我忍不住链接到这个：我忍不住链接到这个：我忍不住链接到这个：我忍不住链接到这个：