Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/76.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在BeautifulSoup中将html提取为文本?_Python_Html_Html Parsing_Beautifulsoup - Fatal编程技术网

Python 如何在BeautifulSoup中将html提取为文本?

Python 如何在BeautifulSoup中将html提取为文本?,python,html,html-parsing,beautifulsoup,Python,Html,Html Parsing,Beautifulsoup,我使用以下代码浏览html页面,并尝试使用BeautifulSoup获取所需的数据。一切看起来都很好,但我碰上了墙,卡住了 我需要完成的是从此行中提取9h7a2m值: D: string-1.string2 15030 9h7a2m string3 我得到的结果是: <p>D: string-1.string2 15030 9h7a2m string3.string<br/> D: string-1.string2 15030 9h7a2m string3.string

我使用以下代码浏览html页面,并尝试使用BeautifulSoup获取所需的数据。一切看起来都很好,但我碰上了墙,卡住了

我需要完成的是从此行中提取9h7a2m值:

D: string-1.string2 15030 9h7a2m string3
我得到的结果是:

<p>D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string</p>
<p><span id="more-1203"></span></p>
<p>D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
<p>pinging test is positive but no works</p>
<p>how much time are online?</p>
<p><input aria-required="true" id="author" name="author" size="22" tabindex="1" type="text" value=""/>
<label for="author"><small>Name (required)</small></label></p>
<p><input aria-required="true" id="email" name="email" size="22" tabindex="2" type="text" value=""/>
<label for="email"><small>Mail (will not be published) (required)</small></label></p>
<p><input id="url" name="url" size="22" tabindex="3" type="text" value=""/>
<label for="url"><small>Website</small></label></p>
<p><textarea cols="100%" id="comment" name="comment" rows="10" tabindex="4"></textarea></p>
<p><input id="submit" name="submit" tabindex="5" type="submit" value="Submit Comment"/>
<input id="comment_post_ID" name="comment_post_ID" type="hidden" value="41"/>
<input id="comment_parent" name="comment_parent" type="hidden" value="0"/>
</p>
<p style="display: none;"><input id="akismet_comment_nonce" name="akismet_comment_nonce" type="hidden" value="1709964457"/></p>
<p style="display: none;"><input id="ak_js" name="ak_js" type="hidden" value="99"/></p>

提前感谢。

您可以使用正则表达式提取它:

import re
from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)

s = soup.find('p').br.previous_sibling  # find "p" element and get the part before the 1st br
match = re.search('string\-1\.string2 \d+ (\w+) string3\.string', s)
print match.group(1)
打印
9h7a2m


UPD(真实网站):


我得到这个错误
打印匹配。组(1)AttributeError:'NoneType'对象没有属性'group'
@Al1nuX确实如此:)我已经在您提供的输入上测试了代码。您拥有的实际输入可能有所不同。但我粘贴的是我从上述代码中获得的确切输出。奇怪的是,我自己尝试了您的代码,但它工作正常,但当我尝试将其添加到代码中时,它却不工作。我还需要先打印整行,然后提取单词。例如D:string-1.string2 15030 9h7a2mstring3@Al1nuX运行代码时,
s
的值是多少?
import re
from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)

s = soup.find('p').br.previous_sibling  # find "p" element and get the part before the 1st br
match = re.search('string\-1\.string2 \d+ (\w+) string3\.string', s)
print match.group(1)
from urllib2 import urlopen
from bs4 import BeautifulSoup

data = urlopen('your URL here')
soup = BeautifulSoup(data)

entry = soup.find('div', class_="entry")

for p in entry.find_all('p'):
    for row in p.find_all(text=True):
        try:
            print row.split(' ')[-2]
        except IndexError:
            continue