Python-首先获取解析HTML的标记_Python_Html_Parsing_Beautifulsoup

Python-首先获取解析HTML的标记

python html parsing

Python-首先获取解析HTML的标记,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,我正在使用python和BeautifulSoup解析这个网页。在“菜单上”部分，我想得到第一个链接的url 以下是我正在使用的代码： monthly_urls = soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True) 现在它正在得到第二个a标签，我不知道为什么，我想它至少会得到两个标签，但它只得到第二个我想更改代码，这样它就可以得到第一个标记，或者我可以根据标记所说的内容进行搜索并得到它对于第二部分，我只是说，例如，

我正在使用python和BeautifulSoup解析这个网页。在“菜单上”部分，我想得到第一个链接的url

以下是我正在使用的代码：

monthly_urls = soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True)

现在它正在得到第二个a标签，我不知道为什么，我想它至少会得到两个标签，但它只得到第二个

我想更改代码，这样它就可以得到第一个标记，或者我可以根据标记所说的内容进行搜索并得到它

对于第二部分，我只是说，例如，如果a标记是

<a new tag </a>

我运行你的代码没有问题

from BeautifulSoup import BeautifulSoup
import requests
response = requests.get('https://rpi.sodexomyway.com/dining-choices/res/sage.html')
soup = BeautifulSoup(response.text)
#output of your code
print soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True)

>>> [<a href="#">On the Menu</a>,
     <a href="/images/WeeklyMenuRSDH%209-22-14_tcm1068-29436.htm" target="_blank">
                     9/22/2014 - 9/28/2014</a>,
     <a href="/images/WeeklyMenuRSDH%209-29-14_tcm1068-29441.htm" target="_blank">
                     9/29/2014 - 10/5/2014</a>,
     <a href="#">Hours of Operation</a>]

# now get the href
url = dict(soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True)[1].attrs)['href']
# output
u'/images/WeeklyMenuRSDH%209-22-14_tcm1068-29436.htm'

更新-添加当前周筛选器

def getUnionMenuUrls(soup):                                                      
    monthly_urls = soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True)[1:3] # cut extra links
    today = datetime.datetime.today() # get todays date                          
    url = "https://rpi.sodexomyway.com"                                          
    for tag in monthly_urls:                                                      
        if ".htm" in tag['href']:                                                
            name = str(tag.text)                                                 
            datestrings = name.split(' - ') # split string and get the list of dates
            date_range = [datetime.datetime.strptime(d, '%m/%d/%Y') for d in datestrings] # convert datestrings to datetime objects
            if date_range[0] <= today <= date_range[1]: # check if today in that range
                return url + tag['href']

def getUnionMenuUrls（汤）：
每月_url=soup.findAll（'div'，{'id'：'accordion_23473'}）[0]（'a'，href=True）[1:3]#剪切额外链接
今天=datetime.datetime.today（）#获取今天的日期
url=”https://rpi.sodexomyway.com"                                          
对于每月URL中的标记：
如果标签['href']中的“.htm”：
name=str（tag.text）
DateString=name.split（'-'）#拆分字符串并获取日期列表
date#日期范围=[datetime.datetime.StrTime（d，'%m/%d/%Y'），用于日期字符串中的d]#将日期字符串转换为日期时间对象
如果date_range[0]您能给出您想要的示例输出吗？您的权利，这一定不是问题所在的代码的一部分。让我发布我的完整脚本，我想获取第一个链接作为我的输出，或者如果我可以在本周搜索第一个a标记，使它只获取该链接。这有意义吗？我不认为按名称搜索是个好主意，因为绑定到页面模式总是好的，但绑定到内容却不好。如果您需要以最小的痛苦获得菜单中的第一个链接，请使用monthly\u url=soup.findAll（'div'，{'id'：'accordion\u 23473'}）[0]（'a'，href=True）[1:2]#第54行好的，这样可以获得第一个链接，但我想获得当前的一周，我会怎么做呢？我已经在我制作的主要问题中添加了一个函数来获取日期，然后我想获取本周的链接，关于如何实现这一点有什么想法吗？你可以忽略我发布的函数，我只是尽我所能问你是否有办法获取本周的链接？有意义吗？我已经根据您的要求更新了我的答案。请注意，它只返回新的url，而不是[名称，url]
from BeautifulSoup import BeautifulSoup
import requests
response = requests.get('https://rpi.sodexomyway.com/dining-choices/res/sage.html')
soup = BeautifulSoup(response.text)
#output of your code
print soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True)

>>> [<a href="#">On the Menu</a>,
     <a href="/images/WeeklyMenuRSDH%209-22-14_tcm1068-29436.htm" target="_blank">
                     9/22/2014 - 9/28/2014</a>,
     <a href="/images/WeeklyMenuRSDH%209-29-14_tcm1068-29441.htm" target="_blank">
                     9/29/2014 - 10/5/2014</a>,
     <a href="#">Hours of Operation</a>]

# now get the href
url = dict(soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True)[1].attrs)['href']
# output
u'/images/WeeklyMenuRSDH%209-22-14_tcm1068-29436.htm'

import re
soup.find(text=re.compile('new tag'))

def getUnionMenuUrls(soup):                                                      
    monthly_urls = soup.findAll('div',{'id':'accordion_23473'})[0]('a',href=True)[1:3] # cut extra links
    today = datetime.datetime.today() # get todays date                          
    url = "https://rpi.sodexomyway.com"                                          
    for tag in monthly_urls:                                                      
        if ".htm" in tag['href']:                                                
            name = str(tag.text)                                                 
            datestrings = name.split(' - ') # split string and get the list of dates
            date_range = [datetime.datetime.strptime(d, '%m/%d/%Y') for d in datestrings] # convert datestrings to datetime objects
            if date_range[0] <= today <= date_range[1]: # check if today in that range
                return url + tag['href']