尝试使用Beauty Soup（Python）在属性值中查找2个部分匹配项_Python_Html_Regex_Beautifulsoup

尝试使用Beauty Soup（Python）在属性值中查找2个部分匹配项

python html regex

尝试使用Beauty Soup（Python）在属性值中查找2个部分匹配项,python,html,regex,beautifulsoup,Python,Html,Regex,Beautifulsoup,这是之前用户帮助我解决的问题的后续问题。不过，把这篇后续文章作为一个独立的问题发表会更有意义，这样其他人就更容易找到它我有一个python脚本，它使用BeautifulSoup查找托管服务上的一些web报告现在剧本非常严格。我想让它更灵活一点。我觉得reg ex是我需要的，但也许一些嵌套搜索也可以。我愿意接受建议我当前的代码工作方式如下： def search_table_for_report(table, report_name, report_type): #search ro

这是之前用户帮助我解决的问题的后续问题。不过，把这篇后续文章作为一个独立的问题发表会更有意义，这样其他人就更容易找到它

我有一个python脚本，它使用BeautifulSoup查找托管服务上的一些web报告

现在剧本非常严格。我想让它更灵活一点。我觉得reg ex是我需要的，但也许一些嵌套搜索也可以。我愿意接受建议

我当前的代码工作方式如下：

def search_table_for_report(table, report_name, report_type):
    #search rows of table to find given report name, then grab the download URL for the given type
    for row in table.findAll('tr')[1:]:
        #the [1:]: modifier instructs the loop to skip the first item, aka the headers.
        col = row.findAll('td')

        if report_name in col[0].string:
            print "----- parse out file type request url"
            report_type = report_type.upper()
            #this works, using exact match
            label = row.find("input", {"aria-label": "Select " + report_name + " I format " + report_type})
            #this doesn't work, using reg-ex
            #label = row.find("input", {"aria-label": re.compile("\b" + report_name + ".*\b" + report_type + ".*")})

            print "----- okay found the right checkbox, now grab the href link ----"
            link_url = label.find_next_sibling("a", href=True)["href"]
            return link_url

它将在如下表中搜索：

<tr class="odd">
 <td header="c1">
  Report Download
 </td>
 <td header="c2">
  <input aria-label="Select Report I format PDF" id="documentChkBx0" name="documentChkBx" type="checkbox" value="5446"/>
  <a href="/a/document.html?key=5446">
   <img alt="Portable Document Format" src="/img/icons/icon_PDF.gif">
   </img>
  </a>
  <input aria-label="Select Report I format XLS" id="documentChkBx1" name="documentChkBx" type="checkbox" value="5447"/>
  <a href="/a/document.html?key=5447">
   <img alt="Excel Spreadsheet Format" src="/img/icons/icon_XLS.gif">
   </img>
  </a>
 </td>
 <td header="c4">
  04/27/2015
 </td>
 <td header="c5">
  05/26/2015
 </td>
 <td header="c6">
  05/26/2015 10:00AM EDT
 </td>
</tr>

希望我能解释清楚。很高兴澄清任何困惑。

我解决了这个问题。我的BS4搜索技术很好，只是需要更智能一点的正则表达式模式。它的作品伟大的使用下面！我不知道如何使这个搜索不区分大小写，但现在还可以

#build the pattern to search on 
#where report_name and report_type are strings passed into the function
regex_criteria = r'.*' + report_name + r'.*' + report_type

#search the value of the "aria-label" attribute 
#across all the inputs on the page
target_input = row.find("input", {"aria-label": re.compile(regex_criteria)})

告诉我我对你问题的理解是否正确。您的字符串形式为Select WORD1 bla bla format WORD2，您需要从中获取WORD1和WORD2。我说得对吗？

#build the pattern to search on 
#where report_name and report_type are strings passed into the function
regex_criteria = r'.*' + report_name + r'.*' + report_type

#search the value of the "aria-label" attribute 
#across all the inputs on the page
target_input = row.find("input", {"aria-label": re.compile(regex_criteria)})