Python 行程顾问正在清除“moreLink”

Python 行程顾问正在清除“moreLink”,python,html,web-scraping,beautifulsoup,data-science,Python,Html,Web Scraping,Beautifulsoup,Data Science,我一直在BS4中构建一个web刮板,但却被卡住了。我使用Trip Advisor作为我将要查找的其他数据的测试,但无法隔离“整个”审查的标签。以下是一个例子: 请注意,在第一篇评论中,酒单下方有一个图标是。。。。我能够很容易地隔离部分评论,但还没有找到一种方法让BS4在模拟的“更多”点击后提取评论。我想知道这需要什么工具?我需要改用硒吗 原始元素如下所示: <span class="partnerRvw"> <span class="taLnk hvrIE6 tr4750919

我一直在BS4中构建一个web刮板,但却被卡住了。我使用Trip Advisor作为我将要查找的其他数据的测试,但无法隔离“整个”审查的标签。以下是一个例子:

请注意,在第一篇评论中,酒单下方有一个图标是。。。。我能够很容易地隔离部分评论,但还没有找到一种方法让BS4在模拟的“更多”点击后提取评论。我想知道这需要什么工具?我需要改用硒吗

原始元素如下所示:

<span class="partnerRvw">
<span class="taLnk hvrIE6 tr475091998 moreLink ulBlueLinks" onclick="  ta.util.cookie.setPIDCookie(4444); ta.call('ta.servlet.Reviews.expandReviews', {type: 'dummy'}, ta.id('review_475091998'), 'review_475091998', '1', 4444);
  ">
More&nbsp; </span>
<span class="ui_icon caret-down"></span>
</span>
单击“更多”链接后查看HTML,您会发现一个新的动态添加的类,该类包含我需要的信息,请参见以下内容:

<div class="review dyn_full_review inlineReviewUpdate provider0 first newFlag" style="display: block;">
<a name="UR475091998" class=""></a>
<div id="UR475091998" class="extended provider0 first newFlag">
<div class="col1of2">
<div class="member_info">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-SRC_475091998" class="memberOverlayLink" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorwidth="90">
<div class="avatar profile_6875524F623CC948F4F9CA95BB4A9567 ">
<a onclick="">

<img src="https://media-cdn.tripadvisor.com/media/photo-l/0d/97/43/bf/joannecarpenter.jpg" class="avatar potentialFacebookAvatar avatarGUID:6875524F623CC948F4F9CA95BB4A9567" width="74" height="74">
</a>
</div>
<div class="username mo">
<span class="expand_inline scrname mbrName_6875524F623CC948F4F9CA95BB4A9567" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">joannecarpenter</span>
</div>
</div>
<div class="location">
Humble, Texas
</div>
</div>
<div class="memberBadging g10n">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-CONT" class="no_cpu" onclick="ta.util.cookie.setPIDCookie('15984'); requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'review_count');" data-anchorwidth="90">
<div class="levelBadge badge lvl_02">
Level <span><img src="https://static.tacdn.com/img2/badges/20px/lvl_02.png" alt="" class="icon" width="20" height="20/"></span> Contributor </div>
<div class="reviewerBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/rev_03.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 reviews</span> </div>
<div class="contributionReviewBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/Foodie.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 restaurant reviews</span>
</div>
</div>
</div>
</div>
<div class="col2of2">
<div class="innerBubble">
<div class="quote"><a href="/ShowUserReviews-g56010-d470148-r475091998-Chez_Nous-Humble_Texas.html#CHECK_RATES_CONT" onclick="ta.setEvtCookie('Reviews','title','',0,this.href); setPID();" id="r475091998">“<span class="noQuotes">Dinner</span>”</a></div>
<div class="rating reviewItemInline">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s50" width="70" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="ratingDate relativeDate" title="April 12, 2017">Reviewed 3 days ago
<span class="new redesigned">NEW</span> </span>
<a class="viaMobile" href="/apps" target="_blank" onclick="ta.util.cookie.setPIDCookie(24687)">
<span class="ui_icon mobile-phone"></span>
via mobile
</a>
</div>
<div class="entry">
<p>
Our favorite restaurant in Houston. Definitely the best and friendliest service! The food is not only served with a flair, it is absolutely delicious. My favorite is the Lamb. It is the best! Also the duck moose, fois gras, the crispy salad and the French onion soup are all spectacular! This is a must try restaurant! The wine list is fantastic. Just ask Daniel for suggestions. He not only knows his wines; he loves what he does! We Love this place!
</p>
</div>
<div class="rating-list">
<div class="recommend">
<span class="recommend-titleInline noRatings">Visited April 2017</span>
</div>
</div>
<div class="expanded lessLink">
<span class="taLnk collapse ulBlueLinks no_cpu ">
Less&nbsp;
</span>
<span class="textArrow_more ui_icon caret-up"></span>
</div>
<div id="helpfulq475091998_expanded" class="helpful redesigned white_btn_container ">
<span class="isHelpful">Helpful?</span> <div class="tgt_helpfulq475091998 rnd_white_thank_btn" onclick="ta.call('ta.servlet.Reviews.helpfulVoteHandlerOb', event, this, 'LeJIVqd4EVIpECri1GII2t6mbqgqguuuxizSxiniaqgeVtIJpEJCIQQoqnQQeVsSVuqHyo3KUKqHMdkKUdvqHxfqHfGVzCQQoqnQQZiptqH5paHcVQQoqnQQrVxEJtxiGIac6XoXmqoTpcdkoKAUAAv0tEn1dkoKAUAAv0zH1o3KUK0pSM13vkooXdqn3XmffAdvqndqnAfbAo77dbAo3k0npEEeJIV1K0EJIVqiJcpV1U0Ii9VC1rZlU3XozxbZZxE2crHN2TDUJiqnkiuzsVEOxdkXqi7TxXpUgyR2xXvOfROwaqILkrzz9MvzCxMva7xEkq8xXNq8ymxbAq8AzzrhhzCxbx2vdNvEn2fnwEfq8alzCeqi53ZrgnMrHhshTtowGpNSmq89IwiVb7crUJxdevaCnJEqI33qiE5JGErJExXKx5ooItGCy5wnCTx2VA7RvxEsO3'); ta.trackEventOnPage('HELPFUL_VOTE_TEST', 'helpfulvotegiven_v2');">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_white.png" class="helpful_thumbs_up white">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_green.png" class="helpful_thumbs_up green">
<span class="helpful_text">Thank joannecarpenter</span> </div>
</div>
<div class="tooltips vertically_centered">
<div class="reportProblem">
<span id="ReportIAP_475091998" class="problem collapsed taLnk" onclick="ta.trackEventOnPage('Report_IAP', 'Report_Button_Clicked', 'member'); ta.call('ta.servlet.Reviews.iapFlyout', event, this, '475091998')" onmouseover="if (!this.getAttribute('data-first')) {ta.trackEventOnPage('Reviews', 'report_problem', 'hover_over_flag'); this.setAttribute('data-first', 1)} uiOverlay(event, this)" data-tooltip="" data-position="above" data-content="Problem with this review?">
<img src="https://static.tacdn.com/img2/icons/gray_flag.png" width="13" height="14" alt="">
<span class="reportTxt">Report</span> </span>
</div>
</div>
<div class="userLinks">
<div class="sameGeoActivity">
<a href="/members-citypage/joannecarpenter/g56010" target="_blank" onclick="ta.setEvtCookie('Reviews','more_reviews_by_user','',0,this.href); ta.util.cookie.setPIDCookie(19160)">
See all 5 reviews by joannecarpenter for Humble </a>
</div>
<div class="askQuestion">
<span class="taLnk ulBlueLinks" onclick="ta.trackEventOnPage('answers_review','ask_user_intercept_click' ); ta.load('ta-answers', (function() {require('answers/misc').askReviewerIntercept(this, '470148', 'joannecarpenter', '6875524F623CC948F4F9CA95BB4A9567', 'en', '475091998','Chez Nous', 39151)}).bind(this), true);">Ask joannecarpenter about Chez Nous</span>
</div>
</div>
<div class="note">
This review is the subjective opinion of a TripAdvisor member and not of TripAdvisor LLC. </div>
<div class="duplicateReviewsInline">
<div class="previous">joannecarpenter has 1 more review of Chez Nous</div> <ul class="dupReviews">
<li class="dupReviewItem">
<div class="reviewTitle">
<a href="/ShowUserReviews-g56010-d470148-r453237869-Chez_Nous-Humble_Texas.html#REVIEWS">“Joanne Carpenter”</a>
</div>
<div class="rating">
<span class="rate sprite-rating_ss rating_ss"> <img class="sprite-rating_ss_fill rating_ss_fill ss50" width="50" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="date">Reviewed January 18, 2017</span>
</div>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="large">

</div>
<div class="ad iab_inlineBanner">
<div id="gpt-ad-468x60" class="adInner gptAd"></div>
</div>
</div>

BS4有没有办法帮我解决这个问题

下面是一个简单的示例,让您开始学习:

import selenium
from selenium import webdriver
driver = webdriver.PhantomJS()
url = "https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html"
driver.get(url)

elem = driver.get_element_by_class_name("taLnk")
...
您可以在此处找到有关这些方法的更多信息:
很可能您还需要检查这些页面中的一些页面,以确定HTML代码中的变化。对于您提供的示例,并且假设您能够通过模拟压力机获得它,下面的代码可以选择您似乎想要的段落

from bs4 import BeautifulSoup

HTML = open('temp.htm').read()
soup = BeautifulSoup(HTML, 'lxml')

para = soup.select('.entry > p')
print (para[0].text)
结果:

我们在休斯顿最喜欢的餐馆。绝对是最好最友好的服务!这道菜不仅味道鲜美,而且绝对美味。我最喜欢的是羔羊肉。这是最好的!此外,驼鹿鸭、鹅肝、酥脆沙拉和法国洋葱汤都非常美味!这是一家必尝的餐厅!酒单太棒了。只要问丹尼尔一些建议就行了。他不仅知道他的酒;他热爱他的工作!我们爱这个地方


请注意,段落前后都有换行符。

BS4只是一个HTML解析器;如果您需要与页面交互以获取所需的元素,那么是的,您需要使用像Selenium这样的浏览器驱动程序。据我所知,BS4只是一个HTML解析器,因此您需要额外或不同的东西来处理此问题,因为额外的数据可能是通过Ajax加载的。我看到两种方法:您可以在浏览器中检查Ajax调用是什么,并在代码中重现它,或者您可以使用phantomjs或casperjs之类的东西为您加载整个页面。前者可能更简单,除非你预计你会从这些页面中获得大量不同的动态数据。当然,通常的免责声明是关于你可以对你正在抓取的数据做什么的法律限制。你需要什么信息?它是否包含在您单击“更多”时获得的HTML中?@BillBell,准确地说。在我点击“更多”之后,HTML被动态添加,一个段落显示为更长的评论。谢谢谢谢你的硒溶液。