Web scraping 由脚本生成的刮表

Web scraping 由脚本生成的刮表,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我一直在尝试用python和漂亮的汤刮一张网站表。我遇到的问题是,该表是通过脚本生成的,因此该表如下所示: <table class="table table-compact table-striped table-topics"> <thead> <tr> <th data-intro="Clicking a topic will allow you to

我一直在尝试用python和漂亮的汤刮一张网站表。我遇到的问题是,该表是通过脚本生成的,因此该表如下所示:

<table class="table table-compact table-striped table-topics">
            <thead>
                <tr>
                    <th data-intro="Clicking a topic will allow you to view and ask general technical questions about the topic through SITIS." data-position="bottom">Topic #</th>
                    <th>Program</th>
                    <th>Component</th>
                    <th>Technology Area</th>
                    <th>Title</th>
                    <th data-intro="If there is SITIS activity for a topic a clickable 'QA' will appear in this column." data-position="bottom">SITIS</th>
                </tr>
            </thead>
            <tbody>
                {{#each this.Results}}
                <tr>
                    <td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicNumber}}</a></td>
                    <td>{{this.ProgramTypeName}}</td>
                    <td>{{this.AgencyName}}</td>
                    <td>

                      <div class="icons">
                        {{#if this.TechAreaAirPlatform}}
                          <i class="glyph-icon flaticon-air-platform" data-toggle="tooltip" title="Technology Area: Air Platform"></i>
                        {{/if}}
                        {{#if this.TechAreaChemBioDefense }}
                          <i class="glyph-icon flaticon-chem-bio-defense" data-toggle="tooltip" title="Technology Area: Chem Bio Defense"></i>
                        {{/if}}
                        {{#if this.TechAreaInfoSystems}}
                          <i class="glyph-icon flaticon-info-systems" data-toggle="tooltip" title="Technology Area: Info Systems"></i>
                        {{/if}}
                        {{#if this.TechAreaGroundSea }}
                          <i class="glyph-icon flaticon-ground-sea" data-toggle="tooltip" title="Technology Area: Ground Sea"></i>
                        {{/if}}
                        {{#if this.TechAreaMaterials}}
                          <i class="glyph-icon flaticon-materials" data-toggle="tooltip" title="Technology Area: Materials"></i>
                        {{/if}}
                        {{#if this.TechAreaBioMedical }}
                          <i class="glyph-icon flaticon-bio-med" data-toggle="tooltip" title="Technology Area: Bio Medical"></i>
                        {{/if}}
                        {{#if this.TechAreaSensors }}
                          <i class="glyph-icon flaticon-sensors" data-toggle="tooltip" title="Technology Area: Sensors"></i>
                        {{/if}}
                        {{#if this.TechAreaElectronics }}
                          <i class="glyph-icon flaticon-electronics" data-toggle="tooltip" title="Technology Area: Electronics"></i>
                        {{/if}}
                        {{#if this.TechAreaBattlespace }}
                          <i class="glyph-icon flaticon-battlespace" data-toggle="tooltip" title="Technology Area: Battlespace"></i>
                        {{/if}}
                        {{#if this.TechAreaSpacePlatforms }}
                          <i class="glyph-icon flaticon-space-platform" data-toggle="tooltip" title="Technology Area: Space Platforms"></i>
                        {{/if}}
                          {{#if this.TechAreaHumanSystems }}
                          <i class="glyph-icon flaticon-human-systems" data-toggle="tooltip" title="Technology Area: Human Systems"></i>
                        {{/if}}
                        {{#if this.TechAreaWeapons }} 
                          <i class="glyph-icon flaticon-weapons" data-toggle="tooltip" title="Technology Area: Weapons"></i>
                        {{/if}}
                        {{#if this.TechAreaNuclear }}
                          <i class="glyph-icon flaticon-nuclear" data-toggle="tooltip" title="Technology Area: Nuclear"></i>
                        {{/if}}
                      </div>
                    </td>
                    <td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicTitle}}</a></td>
                    <td>{{#if this.PublishedQuestionCount}}<a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">Q&A</a>{{/if}}</td>
                </tr>
                {{else}}
                <tr>
                    <td colspan="6"><div class="alert alert-warning">No topics were found.</div></td>
                </tr>
                {{/each}}
            </tbody>
        </table>

话题#
节目
组成部分
技术领域
标题
西蒂斯
{{{#每个this.Results}
{{this.ProgramTypeName}}
{{this.AgencyName}}
{{#如果这个.TechAreaAirPlatform}
{{/if}
{{#如果是this.techareachembiodefence}
{{/if}
{{#如果这是.TechAreaInfoSystems}
{{/if}
{{{如果是this.TechAreaGroundSea}
{{/if}
{{#如果是这个.TechAreaMaterials}
{{/if}
{{{如果是this.TechAreaBioMedical}
{{/if}
{{{如果是this.techreasensors}
{{/if}
{{{如果是this.techreaelectronics}
{{/if}
{{#如果是this.TechAreaBattlespace}
{{/if}
{{{如果是this.TechAreaSpacePlatforms}
{{/if}
{{{如果是this.TechAreaHumanSystems}
{{/if}
{{{如果这是TechArea武器}
{{/if}
{{{#如果是this.TechAreaNuclear}}
{{/if}
{{{if this.PublishedQuestionCount}{{/if}}
{{else}
未找到任何主题。
{{/每个}}
我想知道是否有人知道刮桌子是否仍然可行。表前面有一个脚本标记,我想知道它是否有用

<script id="topics-template" type="text/x-handlebars-template">


提前谢谢你

评论中关于使用selenium WebDriver的建议可能是解决您的问题的最简单的解决方法。看起来您正试图抓取一个使用Django模板或类似内容动态生成内容的站点

因此,您需要模拟浏览器,以便实际加载页面上的所有内容,因为您当前只获取静态html。您可以使用package manager安装selenium,然后需要为要模拟的浏览器安装驱动程序:

pip install selenium
pip install chromedriver
注意:并非所有的web驱动程序都可以与您的PackageManager一起安装(我不认为),因此您可能必须从常规internet下载它

现在,您可以使用我编写的函数来刮取所需页面:

# purpose: a function which takes a url and extracts the contents as a string
# depends on selenium webdriver to turn js-scripts into html as well as time and os libraries
# signature: pull_html_page(url:string, write:optional boolean) -> string 
def pull_html_page(url, write = False):

    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(1)
    content = driver.page_source.encode('utf-8')

    driver.quit()


    if write == True:
        # the "my dick computer won't let me be root" workaround
        if os.geteuid() == 0:
            print("We're root!")
        else:
            print("We're not root.")
            CURRENT_SCRIPT = os.path.realpath(__file__)
            os.system('echo ' + PASSWORD_FOR_SUDO + '|sudo -S python '+ CURRENT_SCRIPT)

            clean = BeautifulSoup(content, "html.parser").prettify()

            f = open("out.html", "w+")
            f.write(clean)
            f.close()

    return content
如果这个解决方案对您来说不够有效,或者您只需要动态生成的数据,而不需要来自静态html的任何内容。您可以经常使用inspect工具(我更喜欢chrome上的工具)查看网络流量。有时可以看到返回JSON响应的url,这样做可以节省加载页面的时间,并且可以直接从响应url中获取数据


祝你好运

我在使用c#时遇到了类似的问题,发现SeleniumWebDriver是解决这个问题的好方法。它实际打开一个浏览器并加载JavaScript,然后您可以通过内置方法(getElement和innerText等)浏览JavaScript。非常感谢您的建议。