python如何在br之后提取文本?
我使用的是2.7.8和有点惊讶的bcz,我得到了所有的文本,但是包含after last的文本没有得到。就像我的html页面:python如何在br之后提取文本?,python,html,beautifulsoup,html-parsing,Python,Html,Beautifulsoup,Html Parsing,我使用的是2.7.8和有点惊讶的bcz,我得到了所有的文本,但是包含after last的文本没有得到。就像我的html页面: <html> <body> <div class="entry-content" > <p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions: &l
<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>
<p>Which of the following is not a valid C variable name?<br>
a) int number;<br>
b) float rate;<br>
c) int variable_count;<br>
d) int $main;</p> <!--not getting-->
<p> more </p>
<p>Which of the following is true for variable names in C?<br>
a) They can contain alphanumeric characters as well as special characters<br>
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br>
c) Variable names cannot start with a digit<br>
d) Variable can be of any length</p> <!--not getting -->!
</div>
</body>
</html>
输出:
Found:
a) int number;
Found:
b) float rate;
Found:
c) int variable_count;
Found:
a) They can contain alphanumeric characters as well as special characters
Found:
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found:
c) Variable names cannot start with a digit
但是,我没有得到最后一个“文本”,例如:
d) int $main
and
d) Variable can be of any length
哪个在后面
我想得到的结果是:
Found:
a) int number;
Found:
b) float rate;
Found:
c) int variable_count;
Found:
d) int $main
Found:
a) They can contain alphanumeric characters as well as special characters
Found:
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found:
c) Variable names cannot start with a digit
d) Variable can be of any length
这是因为BeautifulSoup通过在
前面关闭
标记将文本强制转换为有效的xml。经过美化的版本对此很清楚:
<p>
Which of the following is not a valid C variable name?
<br>
a) int number;
<br>
b) float rate;
<br>
c) int variable_count;
<br>
d) int $main;
</br>
</br>
</br>
</br>
</p>
正如预期的那样:
找到
a) 整数;
建立
b) 浮动汇率;
建立
c) int变量_计数;
建立
d) 新台币$main;
建立
a) 它们可以包含字母数字字符以及特殊字符
建立
b) 将变量声明为关键字之一(如goto、static)不是错误
建立
c) 变量名不能以数字开头
建立
d) 变量可以是任意长度
您可以使用urllib2代替,并通过的html模块提取xml
from lxml import html
import requests
#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")
#get content in html format
page_content=html.fromstring(page.content)
#recover all text from <p> elements
items=page_content.xpath('//p/text()')
从lxml导入html
导入请求
#请求页
页面=请求。获取(“http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")
#获取html格式的内容
page\u content=html.fromstring(page.content)
#从元素中恢复所有文本
items=page_content.xpath(“//p/text()”)
上述代码返回
元素中包含的文档中所有文本的数组。有了它,您只需索引到数组中即可打印所需内容。添加更多打印语句。当您
继续时
打印您跳过的内容。将else语句放到if语句中,并打印您跳过的内容。好的,我正在尝试………为什么您仍然以旧的方式而不是我建议的方式进行操作?。在某种程度上,我面临一些问题,因为我的代码要大得多。因为你提到的小原因,我解决了我的最后一个问题。但在这里,我也面临着与您的解决方案相同的情况,我得到了这个:indexer-ror:list-indexrange@user3440716:没有你真正的投入很难说。我想这是因为br.contents[0]
。我上一次编辑应该会修复它
...
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
if len(br.contents) > 0: # avoid errors if a tag is correctly closed as <br/>
print 'Found', br.contents[0]
from lxml import html
import requests
#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")
#get content in html format
page_content=html.fromstring(page.content)
#recover all text from <p> elements
items=page_content.xpath('//p/text()')