Web scraping 由脚本生成的刮表
我一直在尝试用python和漂亮的汤刮一张网站表。我遇到的问题是,该表是通过脚本生成的,因此该表如下所示:Web scraping 由脚本生成的刮表,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我一直在尝试用python和漂亮的汤刮一张网站表。我遇到的问题是,该表是通过脚本生成的,因此该表如下所示: <table class="table table-compact table-striped table-topics"> <thead> <tr> <th data-intro="Clicking a topic will allow you to
<table class="table table-compact table-striped table-topics">
<thead>
<tr>
<th data-intro="Clicking a topic will allow you to view and ask general technical questions about the topic through SITIS." data-position="bottom">Topic #</th>
<th>Program</th>
<th>Component</th>
<th>Technology Area</th>
<th>Title</th>
<th data-intro="If there is SITIS activity for a topic a clickable 'QA' will appear in this column." data-position="bottom">SITIS</th>
</tr>
</thead>
<tbody>
{{#each this.Results}}
<tr>
<td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicNumber}}</a></td>
<td>{{this.ProgramTypeName}}</td>
<td>{{this.AgencyName}}</td>
<td>
<div class="icons">
{{#if this.TechAreaAirPlatform}}
<i class="glyph-icon flaticon-air-platform" data-toggle="tooltip" title="Technology Area: Air Platform"></i>
{{/if}}
{{#if this.TechAreaChemBioDefense }}
<i class="glyph-icon flaticon-chem-bio-defense" data-toggle="tooltip" title="Technology Area: Chem Bio Defense"></i>
{{/if}}
{{#if this.TechAreaInfoSystems}}
<i class="glyph-icon flaticon-info-systems" data-toggle="tooltip" title="Technology Area: Info Systems"></i>
{{/if}}
{{#if this.TechAreaGroundSea }}
<i class="glyph-icon flaticon-ground-sea" data-toggle="tooltip" title="Technology Area: Ground Sea"></i>
{{/if}}
{{#if this.TechAreaMaterials}}
<i class="glyph-icon flaticon-materials" data-toggle="tooltip" title="Technology Area: Materials"></i>
{{/if}}
{{#if this.TechAreaBioMedical }}
<i class="glyph-icon flaticon-bio-med" data-toggle="tooltip" title="Technology Area: Bio Medical"></i>
{{/if}}
{{#if this.TechAreaSensors }}
<i class="glyph-icon flaticon-sensors" data-toggle="tooltip" title="Technology Area: Sensors"></i>
{{/if}}
{{#if this.TechAreaElectronics }}
<i class="glyph-icon flaticon-electronics" data-toggle="tooltip" title="Technology Area: Electronics"></i>
{{/if}}
{{#if this.TechAreaBattlespace }}
<i class="glyph-icon flaticon-battlespace" data-toggle="tooltip" title="Technology Area: Battlespace"></i>
{{/if}}
{{#if this.TechAreaSpacePlatforms }}
<i class="glyph-icon flaticon-space-platform" data-toggle="tooltip" title="Technology Area: Space Platforms"></i>
{{/if}}
{{#if this.TechAreaHumanSystems }}
<i class="glyph-icon flaticon-human-systems" data-toggle="tooltip" title="Technology Area: Human Systems"></i>
{{/if}}
{{#if this.TechAreaWeapons }}
<i class="glyph-icon flaticon-weapons" data-toggle="tooltip" title="Technology Area: Weapons"></i>
{{/if}}
{{#if this.TechAreaNuclear }}
<i class="glyph-icon flaticon-nuclear" data-toggle="tooltip" title="Technology Area: Nuclear"></i>
{{/if}}
</div>
</td>
<td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicTitle}}</a></td>
<td>{{#if this.PublishedQuestionCount}}<a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">Q&A</a>{{/if}}</td>
</tr>
{{else}}
<tr>
<td colspan="6"><div class="alert alert-warning">No topics were found.</div></td>
</tr>
{{/each}}
</tbody>
</table>
话题#
节目
组成部分
技术领域
标题
西蒂斯
{{{#每个this.Results}
{{this.ProgramTypeName}}
{{this.AgencyName}}
{{#如果这个.TechAreaAirPlatform}
{{/if}
{{#如果是this.techareachembiodefence}
{{/if}
{{#如果这是.TechAreaInfoSystems}
{{/if}
{{{如果是this.TechAreaGroundSea}
{{/if}
{{#如果是这个.TechAreaMaterials}
{{/if}
{{{如果是this.TechAreaBioMedical}
{{/if}
{{{如果是this.techreasensors}
{{/if}
{{{如果是this.techreaelectronics}
{{/if}
{{#如果是this.TechAreaBattlespace}
{{/if}
{{{如果是this.TechAreaSpacePlatforms}
{{/if}
{{{如果是this.TechAreaHumanSystems}
{{/if}
{{{如果这是TechArea武器}
{{/if}
{{{#如果是this.TechAreaNuclear}}
{{/if}
{{{if this.PublishedQuestionCount}{{/if}}
{{else}
未找到任何主题。
{{/每个}}
我想知道是否有人知道刮桌子是否仍然可行。表前面有一个脚本标记,我想知道它是否有用
<script id="topics-template" type="text/x-handlebars-template">
提前谢谢你 评论中关于使用selenium WebDriver的建议可能是解决您的问题的最简单的解决方法。看起来您正试图抓取一个使用Django模板或类似内容动态生成内容的站点 因此,您需要模拟浏览器,以便实际加载页面上的所有内容,因为您当前只获取静态html。您可以使用package manager安装selenium,然后需要为要模拟的浏览器安装驱动程序:
pip install selenium
pip install chromedriver
注意:并非所有的web驱动程序都可以与您的PackageManager一起安装(我不认为),因此您可能必须从常规internet下载它
现在,您可以使用我编写的函数来刮取所需页面:
# purpose: a function which takes a url and extracts the contents as a string
# depends on selenium webdriver to turn js-scripts into html as well as time and os libraries
# signature: pull_html_page(url:string, write:optional boolean) -> string
def pull_html_page(url, write = False):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)
content = driver.page_source.encode('utf-8')
driver.quit()
if write == True:
# the "my dick computer won't let me be root" workaround
if os.geteuid() == 0:
print("We're root!")
else:
print("We're not root.")
CURRENT_SCRIPT = os.path.realpath(__file__)
os.system('echo ' + PASSWORD_FOR_SUDO + '|sudo -S python '+ CURRENT_SCRIPT)
clean = BeautifulSoup(content, "html.parser").prettify()
f = open("out.html", "w+")
f.write(clean)
f.close()
return content
如果这个解决方案对您来说不够有效,或者您只需要动态生成的数据,而不需要来自静态html的任何内容。您可以经常使用inspect工具(我更喜欢chrome上的工具)查看网络流量。有时可以看到返回JSON响应的url,这样做可以节省加载页面的时间,并且可以直接从响应url中获取数据
祝你好运 我在使用c#时遇到了类似的问题,发现SeleniumWebDriver是解决这个问题的好方法。它实际打开一个浏览器并加载JavaScript,然后您可以通过内置方法(getElement和innerText等)浏览JavaScript。非常感谢您的建议。