Python中的屏幕抓取_Python_Regex_Screen Scraping

Python中的屏幕抓取

python regex

Python中的屏幕抓取,python,regex,screen-scraping,Python,Regex,Screen Scraping,我目前正在尝试筛选一个网站，将信息放入字典。我正在使用urllib2和BeautifulSoup。我不知道如何解析网页源信息以获得我想要的内容并将其读入词典。我想要的信息显示为11月24日上午8:00AM | Sole In。平静下来。我正在考虑使用一个reg表达式读入该行，将时间和日期转换为日期时间，然后解析该行以将数据读入字典。字典输出应该是与 [ { “日期”：日期输入法（2010、11、24、23、59）， “头衔”：“独家进入，和平退出。”， } ] 当前代码： from Beauti

我目前正在尝试筛选一个网站，将信息放入字典。我正在使用urllib2和BeautifulSoup。我不知道如何解析网页源信息以获得我想要的内容并将其读入词典。我想要的信息显示为

11月24日上午8:00AM | Sole In。平静下来。我正在考虑使用一个reg表达式读入该行，将时间和日期转换为日期时间，然后解析该行以将数据读入字典。字典输出应该是与
[
{
“日期”：日期输入法（2010、11、24、23、59），
“头衔”：“独家进入，和平退出。”，
}
]

当前代码：
from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://events.cmich.edu/RssStudentEvents.aspx'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

很抱歉出现了文字墙，感谢您的时间和帮助
 编辑：我没有意识到这不是一个HTML页面，所以看看Chris的更正。以下内容适用于HTML页面
您可以使用：
titleTag = soup.html.head.title

或：
请看这里：


像这样的东西
titletext = soup.findAll('title')[1].string #assuming it's the second title element.. I've seen worse in html
import datetime
datetext = titletext.split("|")[0]
title = titletext.split("|")[2]
date = datetime.datetime.strptime(datetext,"%b %d").replace(year=2010)
the_final_dict = {'date':date,'title':title}

findAll（）
返回搜索元素的所有实例。。所以你可以把它当作其他的列表
这就差不多了：）
编辑：小补丁
编辑2：从下面的评论中修复
>>> soup.findAll('item')[1].title
<title>Nov 24 | 8:00AM | Sole In. Peace Out. </title>
>>> soup.findAll('item')[1].title.text
u'Nov 24 | 8:00AM | Sole In. Peace Out.'
>>> date, _, title = soup.findAll('item')[1].title.text.rpartition(' | ')
>>> date
u'Nov 24 | 8:00AM'
>>> title
u'Sole In. Peace Out.'
>>> from datetime import datetime
>>> date = datetime.strptime(date, "%b %d | %I:%M%p").replace(year=datetime.now().year)
>>> dict(date=date, title=title)
{'date': datetime.datetime(2010, 11, 24, 8, 0), 'title': u'Sole In. Peace Out.'}

如果你想处理更复杂的年份，你可以这样做。你明白了
最后补充：发电机将是一个很好的使用方法
from datetime import datetime
import urllib2
from BeautifulSoup import BeautifulSoup

def whatevers():
    soup = BeautifulSoup(urllib2.urlopen('http://events.cmich.edu/RssStudentEvents.aspx').read())
    for item in soup.findAll('item'):
        date, _, title = item.title.text.rpartition(' | ')
        yield dict(date=datetime.strptime(date, '%b %d | %I:%M%p').replace(year=datetime.now().year), title=title)

for match in whatevers():
    pass  # Use match['date'], match['title'].  a namedtuple might also be neat here.

第一个“title”元素实际上是我想跳过的元素，那么我该怎么做呢？它不是HTML。是RSS。因此，soup.html.head.title不起作用，“soup.findAll（'title'）是次优的。你看过他给你的那页了吗？“文字墙”？我的答案更像“沃利”；-）你看过机械化模块了吗？
>>> from datetime import datetime
>>> matches = []
>>> for item in soup.findAll('item'):
...     date, _, title = item.title.text.rpartition(' | ')
...     matches.append(dict(date=datetime.strptime(date, '%b %d | %I:%M%p').replace(year=datetime.now().year), title=title))
... 
>>> from pprint import pprint
>>> pprint(matches)
[{'date': datetime.datetime(2010, 11, 24, 8, 0),
  'title': u'The Americana Indian\u2014American Indian in the American Imagination'},
 {'date': datetime.datetime(2010, 11, 24, 8, 0),
  'title': u'Sole In. Peace Out.'},
...
 {'date': datetime.datetime(2010, 12, 8, 8, 0),
  'title': u'Apply to be an FYE Mentor'}]

from datetime import datetime
import urllib2
from BeautifulSoup import BeautifulSoup

def whatevers():
    soup = BeautifulSoup(urllib2.urlopen('http://events.cmich.edu/RssStudentEvents.aspx').read())
    for item in soup.findAll('item'):
        date, _, title = item.title.text.rpartition(' | ')
        yield dict(date=datetime.strptime(date, '%b %d | %I:%M%p').replace(year=datetime.now().year), title=title)

for match in whatevers():
    pass  # Use match['date'], match['title'].  a namedtuple might also be neat here.