Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/305.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/laravel/11.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用python、搜索框结果刮取嵌套html_Python_Web Scraping_Beautifulsoup_Css Selectors - Fatal编程技术网

使用python、搜索框结果刮取嵌套html

使用python、搜索框结果刮取嵌套html,python,web-scraping,beautifulsoup,css-selectors,Python,Web Scraping,Beautifulsoup,Css Selectors,我试图了解如何从西里尔文网站的搜索框中获取URL列表。 这是结果页面,搜索词是“imk 1-1251”: 我试图仅在这样的标记下获取URL: <div class="ttl mb0"><a href="/notifikacii/2020/03/24/greta-tunberg-veroiatno-bila-bolna-ot-covid-19.443414">Грета Тунберг "вероятно" била болна от COVID-19</a>

我试图了解如何从西里尔文网站的搜索框中获取URL列表。 这是结果页面,搜索词是“imk 1-1251”:

我试图仅在这样的标记下获取URL:

<div class="ttl mb0"><a href="/notifikacii/2020/03/24/greta-tunberg-veroiatno-bila-bolna-ot-covid-19.443414">Грета Тунберг "вероятно" била болна от COVID-19</a></div>
有了BeautifulSoup的find_all('a'),我找到了所有的链接,我不需要任何东西,只需要搜索结果

完整的代码答案将是最有帮助的

从bs4导入美化组
导入请求
r=requests.get(
"https://www.dnes.bg/search.php?q=%EA%EE%F0%EE%ED%E0%E2%E8%F0%F3%F1")
soup=BeautifulSoup(r.content'html.parser')
url=[f”{r.url[:19]}{item.get('href')}用于汤中的项目。选择(
“div.ttl.mb0 a”)]
打印(URL)
输出:

['https://www.dnes.bg/notifikacii/2020/03/24/greta-tunberg-veroiatno-bila-bolna-ot-covid-19.443414', 'https://www.dnes.bg/koronavirusat/2020/03/24/kitai-otpuska-merkite-a-evropa-i-sasht-zatiagat-rejima.443411', 'https://www.dnes.bg/mish-mash/2020/03/24/etiket-po-vreme-na-koronavirus-ne-pliuite-i-spazvaite-distanciia.443348', 'https://www.dnes.bg/eu/2020/03/24/jivotyt-v-shveciia-na-pylni-oboroti-koronavirus-li.443384', 'https://www.dnes.bg/akoshtete-vqrvaite/2020/03/24/pri-izolaciia-5-syveta-protiv-preiajdane.443357', 'https://www.dnes.bg/akoshtete-vqrvaite/2020/03/24/po-vreme-na-pandemiia-zashto-panicheski-se-prezapasiavame.443402', 'https://www.dnes.bg/obshtestvo/2020/03/24/v-kriza-podkrepiame-merkite-i-vlastta-strah-ni-e-ot-bezrabotica.443342', 'https://www.dnes.bg/notifikacii/2020/03/24/bolnite-v-italiia-namaliavat-no-bolnicite-vse-oshte-sa-pretovareni.443409', 'https://www.dnes.bg/obshtestvo/2020/03/24/bolnite-ot-koronavirus-u-nas-veche-sa-218.443395', 'https://www.dnes.bg/cars/2020/03/24/avtomobilnite-kompanii-shte-zapochnat-da-proizvejdat-ventilatori.443330', 'https://www.dnes.bg/stranata/2020/03/24/deteto-s-pnevmoniia-v-tyrnovskata-bolnica-bez-vaksini.443405', 'https://www.dnes.bg/koronavirusat/2020/03/24/razrabotiha-inhalator-za-cialostno-lechenie-sreshtu-koronavirus.443295', 'https://www.dnes.bg/koronavirusat/2020/03/24/kiril-domuschiev-prebori-koronavirusa-veche-e-dobre.443401', 'https://www.dnes.bg/koronavirusat/2020/03/24/matematicheski-model-shte-pokazva-licata-pod-karantina-v-burgas.443389', 'https://www.dnes.bg/koronavirusat/2020/03/24/osma-jertva-vze-koronavirusyt-v-rumyniia.443331', 'https://www.dnes.bg/balkani/2020/03/24/syrbiia-nastypva-sreshtu-covid-19-s-masovi-testove.443376', 'https://www.dnes.bg/koronavirusat/2020/03/24/blizo-do-kitai-a-samo-1128-zarazeni-s-koronavirus-kak-go-postigna-iaponiia.443374', 'https://www.dnes.bg/sport/2020/03/24/oficialno-olimpiiskite-igri-shte-se-provedat-prez-2021-g.443380', 'https://www.dnes.bg/sport/2020/03/24/bez-tenis-i-sport-kak-se-podgotvia-viktoriia-tomova-vkyshti.443321', 'https://www.dnes.bg/koronavirusat/2020/03/24/vinovnikyt-za-pandemiiata-ot-covid-19-globalizaciiata.443316']
请查收

另一个解决方案

from simplified_scrapy import SimplifiedDoc, req, utils
url = 'https://www.dnes.bg/search.php?q=%EA%EE%F0%EE%ED%E0%E2%E8%F0%F3%F1'
html = '''
<div class="ttl mb0"><a href="/notifikacii/2020/03/24/greta-tunberg-veroiatno-bila-bolna-ot-covid-19.443414">Грета
    Тунберг "вероятно" била болна от COVID-19</a></div>
'''
doc = SimplifiedDoc(html)
urls = doc.selects('div.ttl mb0').a
urls = [(utils.absoluteUrl(url,u.href),u.text) for u in urls]
print (urls)

Thx,它可以工作,但我仍然无法计算像f这样的代码元素“,:19和其他非显式。比如说,我想在嵌套结构中获得一个不同的标记,我应该遵循什么?例如,我想从hi那里提取标题-再见亲爱的αԋɱҽԃαМєιcαη-我是本网站的零学习者。非常感谢您的解决方案,我可以看到这对我也很有效。顺便说一句-使用workin示例测试我的设置-并且您的解决方案有效-并且是很好的学习资产-继续进行出色的工作-它会让人震撼;)@朱利安:你所问的问题完全导致你不知道你所用语言的基本知识,例如
格式字符串
切片字符串
,如果你在
社区
上搜索,你所问的问题会被完全回答上千次,如果我们每天都回答重复的问题,这将不会更有效率@显然,我不是这方面的专家,而且我在网上找不到任何直接适用于我的案例的东西。然而,我相信,对于熟悉此事的人来说,这是一项简单的工作。在您的评论中:我在这里或链接解释中没有找到此代码工作的原因,因此这无助于理解如何获取嵌套标记并将其与其他标记隔离。毕竟,没有人让你回答,但是,问题越是“重复”的,将来就越容易被任何人发现。
from simplified_scrapy import SimplifiedDoc, req, utils
url = 'https://www.dnes.bg/search.php?q=%EA%EE%F0%EE%ED%E0%E2%E8%F0%F3%F1'
html = '''
<div class="ttl mb0"><a href="/notifikacii/2020/03/24/greta-tunberg-veroiatno-bila-bolna-ot-covid-19.443414">Грета
    Тунберг "вероятно" била болна от COVID-19</a></div>
'''
doc = SimplifiedDoc(html)
urls = doc.selects('div.ttl mb0').a
urls = [(utils.absoluteUrl(url,u.href),u.text) for u in urls]
print (urls)
[('https://www.dnes.bg/notifikacii/2020/03/24/greta-tunberg-veroiatno-bila-bolna-ot-covid-19.443414', 'Грета Тунберг "вероятно" била болна от COVID-19')]