Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/qt/7.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python/bs4:Span内部div标记-文本提取_Python_Html_Tags_Beautifulsoup - Fatal编程技术网

Python/bs4:Span内部div标记-文本提取

Python/bs4:Span内部div标记-文本提取,python,html,tags,beautifulsoup,Python,Html,Tags,Beautifulsoup,我正在从div标签中提取文本。关键是在div标记中有一个没有开头对的标记。如果我这样做:raw=soup.find('div',class='inside')。text 我只得到标签前的文本 例如: <div class='inside'><div>sth0</div><div>sth1</div></span><div>sth2<div></div> soup.find('div',

我正在从div标签中提取文本。关键是在div标记中有一个没有开头对的标记。如果我这样做:
raw=soup.find('div',class='inside')。text
我只得到标签前的文本

例如:

<div class='inside'><div>sth0</div><div>sth1</div></span><div>sth2<div></div>

soup.find('div', class_='inside').text

>>> sth0  sth1 
我明白了

Katalóg   Obchody a veľkoobchod
而不是:

Katalóg   Obchody a veľkoobchod   Stavebniny   Izolačný materiál...
这是我代码的一部分

Ing。米兰卡拉夫特


也许我看不到什么

我正在python 2.7和python 3.3中完成
sth0sth1sth2
@谢谢你的回答。我已经把问题的主要部分附加到我的问题上了。这可能是解析器的问题。选中此项并尝试使用不同的解析器。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup as BS

html_text = '<div class="inside"><div class="inside2"><a  href="/katalog/" style="font-size:12px"  title="Katalóg"><span>Katalóg</span></a> <span class="sipka s1">&nbsp;</span> <a  href="/katalog/obchody-a-velkoobchod/" style="font-size:12px"  itemprop="url"  title="Obchody a veľkoobchod"><span itemprop="title" >Obchody a veľkoobchod</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child"  itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a  href="/katalog/stavebniny_1/" style="font-size:12px"  itemprop="url"  title="Stavebniny"><span itemprop="title" >Stavebniny</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child"  itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a  href="/katalog/izolacny-material/" style="font-size:12px"  itemprop="url"  title="Izolačný materiál"><span itemprop="title" >Izolačný materiál</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child"  itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a  href="/katalog/protipoziarne-izolacie/" style="font-size:12px"  itemprop="url"  title="Protipožiarne izolácie"><span itemprop="title" >Protipožiarne izolácie</span></a></span> <span class="sipka s1">&nbsp;</span> Ing. Milan Kalafut</div></div></div><div id="main"><div id="content"><div  itemscope itemtype="http://schema.org/LocalBusiness"  class="business-container"><div id="lavy"><div class="foto s3"><img src="http://s.aimg.sk/katalog/css/images/nologo.gif" alt="Logo nieje k dispozícii" /></div><div id="moznosti">'

#html_text = open("a.html",'r').read() #I have commented this, you can do like this too; a.html file contains the same html code as above

firmHtml = BS(html_text)
raw = firmHtml.find('div', class_='inside').text

print (raw)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup as BS

html_text = '<div class="inside"><div class="inside2"><a  href="/katalog/" style="font-size:12px"  title="Katalóg"><span>Katalóg</span></a> <span class="sipka s1">&nbsp;</span> <a  href="/katalog/obchody-a-velkoobchod/" style="font-size:12px"  itemprop="url"  title="Obchody a veľkoobchod"><span itemprop="title" >Obchody a veľkoobchod</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child"  itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a  href="/katalog/stavebniny_1/" style="font-size:12px"  itemprop="url"  title="Stavebniny"><span itemprop="title" >Stavebniny</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child"  itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a  href="/katalog/izolacny-material/" style="font-size:12px"  itemprop="url"  title="Izolačný materiál"><span itemprop="title" >Izolačný materiál</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child"  itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a  href="/katalog/protipoziarne-izolacie/" style="font-size:12px"  itemprop="url"  title="Protipožiarne izolácie"><span itemprop="title" >Protipožiarne izolácie</span></a></span> <span class="sipka s1">&nbsp;</span> Ing. Milan Kalafut</div></div></div><div id="main"><div id="content"><div  itemscope itemtype="http://schema.org/LocalBusiness"  class="business-container"><div id="lavy"><div class="foto s3"><img src="http://s.aimg.sk/katalog/css/images/nologo.gif" alt="Logo nieje k dispozícii" /></div><div id="moznosti">'

#html_text = open("a.html",'r').read() #I have commented this, you can do like this too; a.html file contains the same html code as above

firmHtml = BS(html_text)
raw = firmHtml.find('div', class_='inside').text

print (raw)
Katalóg   Obchody a veľkoobchod   Stavebniny   Izolačný materiál   Protipožiarne izolácie   Ing. Milan Kalafut