Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/357.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在BeautifulSoup中查找特定子级的链接(href)_Python_Beautifulsoup_Href_Children - Fatal编程技术网

Python 在BeautifulSoup中查找特定子级的链接(href)

Python 在BeautifulSoup中查找特定子级的链接(href),python,beautifulsoup,href,children,Python,Beautifulsoup,Href,Children,我有类似的字符串: [<tr><td><big>Motion Picture Sound Editors, USA</big></td></tr>, <tr><th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th></tr&g

我有类似的字符串:

[<tr><td><big>Motion Picture Sound Editors, USA</big></td></tr>, <tr><th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th></tr>, <tr><td align="center" rowspan="2" valign="middle"><a href="/Sections/Awards/Motion_Picture_Sound_Editors_USA/2010">2010 </a></td><td align="center" rowspan="2" valign="middle"><b>Nominated</b></td><td align="center" rowspan="2" valign="middle">Golden Reel Award</td><td valign="top">Best Sound Editing - Dialogue and ADR in a Feature Film<a href="/name/nm0613398/">Piero Mura</a> (supervising sound editor)<a href="/name/nm0919527/">Christopher T. Welch</a> (supervising dialogue/adr editor)<a href="/name/nm0270704/">Julie Feiner</a> (dialogue editor)<a href="/name/nm0827953/">Beth Sterner</a> (dialogue editor)<a href="/name/nm2628443/">Judah Getz</a> (adr mixer)</td></tr>, <tr><td valign="top">Best Sound Editing - Music in a Feature Film<a href="/name/nm1084134/">Jen Monnar</a> (supervising music editor)</td></tr>, <tr><td colspan="4"> </td></tr>, <tr><td align="center" bgcolor="#ffffdb" colspan="4" valign="top"></td></tr>]
对于每个名称,我只想得到链接
nm\。你知道我如何做到这一点,但请保留它,以便我能将名称与nm#联系起来吗?(即,
Piero Mura
将与
nm0613398
关联)

我已经从中得出了以下结论:

(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Piero Mura')
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Christopher T. Welch')
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Julie Feiner')
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Beth Sterner')
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Judah Getz')
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Music in a Feature Film', u'Jen Monnar')
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Music in a Feature Film', u' (supervising music editor)')
为此:

    award_rows = award_soup.findAll("tr")
    award_data = [[td.findChildren(text=True) for td in tr.findAll("td")] for tr in award_rows]
    for data in award_data:
        categ = []
        if data == award_data[0]:
            award_show = ''.join(data[0])
        if len(data) == 4 and data != award_data[0]:
            categ = data[3]
            for cat in categ:
                if cat == '&nbsp;':
                    cat = ''
                if cat != categ[0] and len(categ) != 1 and cat[0:2] != ' (':
                    award_shows.append(award_show)
                    years.append(''.join(data[0]))
                    results.append(''.join(data[1]))
                    awards.append(''.join(data[2]))
                    categories.append(''.join(categ[0].replace('&nbsp;','')))
                    recipients.append(cat)
                    print data
                elif cat != categ[0] and len(categ) == 1:
                    award_shows.append(award_show)
                    years.append(''.join(data[0]))
                    results.append(''.join(data[1]))
                    awards.append(''.join(data[2]))
                    categories.append(''.join(categ[0].replace('&nbsp;','')))
                    recipients.append('')

您可以搜索子字符串为
nm
加上数字的所有
链接。提取该部分并另存为哈希:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')

data = []
for a in soup.find_all('a', attrs={"href": re.compile("nm\d+")}):
    s = re.search(r'nm\d+', a['href']).group(0)
    data.append({a.text: s})

print(data)
它产生:

[{'Piero Mura': 'nm0613398'}, 
 {'Christopher T. Welch': 'nm0919527'}, 
 {'Julie Feiner': 'nm0270704'}, 
 {'Beth Sterner': 'nm0827953'}, 
 {'Judah Getz': 'nm2628443'}, 
 {'Jen Monnar': 'nm1084134'}]
[{'Piero Mura': 'nm0613398'}, 
 {'Christopher T. Welch': 'nm0919527'}, 
 {'Julie Feiner': 'nm0270704'}, 
 {'Beth Sterner': 'nm0827953'}, 
 {'Judah Getz': 'nm2628443'}, 
 {'Jen Monnar': 'nm1084134'}]