BeautifulSoup从javascript(编码)变量刮取

BeautifulSoup从javascript(编码)变量刮取,javascript,python,html,beautifulsoup,Javascript,Python,Html,Beautifulsoup,我正在抓取一个页面,无法获取某个字段,因为它存储在javascript变量中 我的问题是,如何刮取以下代码、解码并保存标记内容?使用BeautifulSoup和任何其他python 以下是标记内的代码: <script type="text/javascript"> var html_audition_details_sidebar = ' \u003Cdiv id\u003D\u0022apply_wrapper\u0022\u003E \u003Cdi

我正在抓取一个页面,无法获取某个字段,因为它存储在javascript变量中

我的问题是,如何刮取以下代码、解码并保存
  • 标记内容?使用BeautifulSoup和任何其他python

    以下是
    标记内的代码:

    <script type="text/javascript">
        var html_audition_details_sidebar = '    \u003Cdiv id\u003D\u0022apply_wrapper\u0022\u003E        \u003Cdiv class\u003D\u0022header\u0022\u003E            \u003Cp\u003EAudition Information\u003C/p\u003E        \u003C/div\u003E        \u003Cdiv class\u003D\u0022text  \u0022\u003E            \u003Cdiv class\u003D\u0022roleContainer \u0022 style\u003D\u0022color: #999\u003B font\u002Dsize: 14px\u003B\u0022\u003E                \u003Cp\u003EOnly official members can see audition information for this job\u003C/p\u003E            \u003C/div\u003E        \u003C/div\u003E        \u003Cdiv class\u003D\u0022applyButton\u0022\u003E            \u003Cp\u003E\u003Ca class\u003D\u0022applyLink\u0022                                        href\u003D\u0022/accounts/login/apply/41680/\u0022\u003ESubscribe Now                  \u003C/a\u003E\u003C/p\u003E        \u003C/div\u003E    \u003C/div\u003E';
        var html_additional_requirements = '';
        var html_role_listing = '\u003Cdiv class\u003D\u0022text callListing loggedout \u0022\u003E    \u003Cp class\u003D\u0022title\u0022\u003E\u003Ca name\u003D\u0022roles\u0022\u003E\u003C/a\u003ESeeking Talent \u003Cspan class\u003D\u0022optional\u0022\u003ESelect a role below for more information and submission instructions.\u003C/span\u003E\u003C/p\u003E    \u003Cdiv class\u003D\u0022castingRoles\u0022\u003E        \u003Cul\u003E                                \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/martinique\u002D159296/\u0022\u003E                    Martinique  (Lead):                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Female, 18\u002D25, Caucasian                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    native French speaker.                \u003C/p\u003E            \u003C/li\u003E                                            \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/justin\u002D159297/\u0022\u003E                    Justin  (Lead):                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Male, 20\u002D25, All Ethnicities                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    comedy and improv skills, hopeless romantic.                \u003C/p\u003E            \u003C/li\u003E                                            \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/flower\u002Dshop\u002Dsalesperson\u002D159299/\u0022\u003E                    Flower Shop Salesperson :                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Males \u0026amp\u003B Females, 30+, All Ethnicities                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    impatient.                \u003C/p\u003E            \u003C/li\u003E                                            \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/models\u002D159300/\u0022\u003E                    Models  (Supporting):                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Female, 18\u002D35, All Ethnicities                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    small roles, under five lines.                \u003C/p\u003E            \u003C/li\u003E                            \u003C/ul\u003E    \u003C/div\u003E\u003C/div\u003E';
    </script>
    
    为此:

    /casting/untitled-comedy-short-41680/martinique-159296/
    

    提前感谢您的帮助

    如果你想用一种通用的方式来做,你需要一个分析javascript的库。在本例中,我将使用

    首先,加载数据:

    from bs4 import BeautifulSoup as Soup
    import slimit
    from slimit.parser import Parser
    from slimit.visitors import nodevisitor    
    
    a = """<script type="text/javascript">
        var html_audition_details_sidebar = '    \u003Cdiv id\u003D\u0022apply_wrapper\u0022\u003E        \u003Cdiv class\u003D\u0022header\u0022\u003E            \u003Cp\u003EAudition Information\u003C/p\u003E        \u003C/div\u003E        \u003Cdiv class\u003D\u0022text  \u0022\u003E            \u003Cdiv class\u003D\u0022roleContainer \u0022 style\u003D\u0022color: #999\u003B font\u002Dsize: 14px\u003B\u0022\u003E                \u003Cp\u003EOnly official members can see audition information for this job\u003C/p\u003E            \u003C/div\u003E        \u003C/div\u003E        \u003Cdiv class\u003D\u0022applyButton\u0022\u003E            \u003Cp\u003E\u003Ca class\u003D\u0022applyLink\u0022                                        href\u003D\u0022/accounts/login/apply/41680/\u0022\u003ESubscribe Now                  \u003C/a\u003E\u003C/p\u003E        \u003C/div\u003E    \u003C/div\u003E';
        var html_additional_requirements = '';
        var html_role_listing = '\u003Cdiv class\u003D\u0022text callListing loggedout \u0022\u003E    \u003Cp class\u003D\u0022title\u0022\u003E\u003Ca name\u003D\u0022roles\u0022\u003E\u003C/a\u003ESeeking Talent \u003Cspan class\u003D\u0022optional\u0022\u003ESelect a role below for more information and submission instructions.\u003C/span\u003E\u003C/p\u003E    \u003Cdiv class\u003D\u0022castingRoles\u0022\u003E        \u003Cul\u003E                                \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/martinique\u002D159296/\u0022\u003E                    Martinique  (Lead):                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Female, 18\u002D25, Caucasian                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    native French speaker.                \u003C/p\u003E            \u003C/li\u003E                                            \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/justin\u002D159297/\u0022\u003E                    Justin  (Lead):                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Male, 20\u002D25, All Ethnicities                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    comedy and improv skills, hopeless romantic.                \u003C/p\u003E            \u003C/li\u003E                                            \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/flower\u002Dshop\u002Dsalesperson\u002D159299/\u0022\u003E                    Flower Shop Salesperson :                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Males \u0026amp\u003B Females, 30+, All Ethnicities                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    impatient.                \u003C/p\u003E            \u003C/li\u003E                                            \u003Cli \u003E                \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/models\u002D159300/\u0022\u003E                    Models  (Supporting):                \u003Cspan class\u003D\u0022roletag\u0022\u003E                    Female, 18\u002D35, All Ethnicities                \u003C/span\u003E                \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E                \u003C/a\u003E                \u003Cp class\u003D\u0022role\u002Ddesc\u0022                   style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E                    small roles, under five lines.                \u003C/p\u003E            \u003C/li\u003E                            \u003C/ul\u003E    \u003C/div\u003E\u003C/div\u003E';
    </script>"""
    soup = Soup(a)
    js_content = soup.findAll('script')[0].text
    
    最后,您可以轻松解码该字符串,它是有效的HTML:

    decoded_html = encoded_html.decode('unicode_escape')
    print(decoded_html)
    
    因此,请重新分析此HTML:

    role_listing = Soup(decoded_html)
    output = [ anchor.attrs['href'] for anchor in role_listing.select('li a') ]
    print('---')
    print("\n".join(output))
    
    输出如下所示:

    '<div class="text callListing loggedout ">    <p class="title"><a name="roles"></a>Seeking Talent <span class="optional">Select a role below for more information and submission instructions.</span></p>    <div class="castingRoles">        <ul>                                <li >                <a href="/casting/untitled-comedy-short-41680/martinique-159296/">                    Martinique  (Lead):                <span class="roletag">                    Female, 18-25, Caucasian                </span>                <span class="applyNow"> </span>                </a>                <p class="role-desc"                   style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;">                    native French speaker.                </p>            </li>                                            <li >                <a href="/casting/untitled-comedy-short-41680/justin-159297/">                    Justin  (Lead):                <span class="roletag">                    Male, 20-25, All Ethnicities                </span>                <span class="applyNow"> </span>                </a>                <p class="role-desc"                   style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;">                    comedy and improv skills, hopeless romantic.                </p>            </li>                                            <li >                <a href="/casting/untitled-comedy-short-41680/flower-shop-salesperson-159299/">                    Flower Shop Salesperson :                <span class="roletag">                    Males &amp; Females, 30+, All Ethnicities                </span>                <span class="applyNow"> </span>                </a>                <p class="role-desc"                   style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;">                    impatient.                </p>            </li>                                            <li >                <a href="/casting/untitled-comedy-short-41680/models-159300/">                    Models  (Supporting):                <span class="roletag">                    Female, 18-35, All Ethnicities                </span>                <span class="applyNow"> </span>                </a>                <p class="role-desc"                   style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;">                    small roles, under five lines.                </p>            </li>                            </ul>    </div></div>'
    ---
    /casting/untitled-comedy-short-41680/martinique-159296/
    /casting/untitled-comedy-short-41680/justin-159297/
    /casting/untitled-comedy-short-41680/flower-shop-salesperson-159299/
    /casting/untitled-comedy-short-41680/models-159300/
    
    母语为法语的人

    喜剧和即兴表演技巧,无望的浪漫

    不耐烦

    小角色,五行以下

    ' --- /演员阵容/无标题喜剧短片-41680/马提尼克岛-159296/ /演员阵容/无标题喜剧短片-41680/justin-159297/ /演员阵容/无标题喜剧短片-41680/花店销售员-159299/ /演员/无标题-喜剧-短片-41680/模特-159300/
    无需JS解析器

    我假设您知道您的
    内容将按照示例中的格式进行格式化,并且您已经将该标记的内容刮到了一个名为
    script\u text
    的变量中

    首先,我们需要获取
    html\u role\u listing
    的值,这可以通过一个好的ol'正则表达式来实现:

    >>> import re
    >>> html_role_listing_match = re.search(r'var html_role_listing = \'(.+)\';$', script_text, re.MULTILINE)
    >>> html_role_listing = html_role_listing_match.group(1)
    
    然后,我们利用
    \u003C
    和类似的转义序列在Python Unicode字符串中也有效的事实(就像它们在JS字符串中有效一样),并使用更安全的
    eval
    版本解析它们:

    >>> import ast
    >>> roles_html = ast.literal_eval("u'%s'" % html_role_listing)
    
    为了向您自己证明这一点,您可以检查此文件的前几个字符,并查看它们是否已正确解析:

    >>> print roles_html[:10]
    <div class
    
    并获取这些链接“
    href
    属性”

    >>> links = soup.select('li a')
    >>> for link in links:
    ...     print link.attrs['href']
    ... 
    /casting/untitled-comedy-short-41680/martinique-159296/
    /casting/untitled-comedy-short-41680/justin-159297/
    /casting/untitled-comedy-short-41680/flower-shop-salesperson-159299/
    /casting/untitled-comedy-short-41680/models-159300/
    
    >>> print roles_html[:10]
    <div class
    
    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(roles_html)
    
    >>> links = soup.select('li a')
    >>> for link in links:
    ...     print link.attrs['href']
    ... 
    /casting/untitled-comedy-short-41680/martinique-159296/
    /casting/untitled-comedy-short-41680/justin-159297/
    /casting/untitled-comedy-short-41680/flower-shop-salesperson-159299/
    /casting/untitled-comedy-short-41680/models-159300/