python如何在br之后提取文本?

python如何在br之后提取文本?,python,html,beautifulsoup,html-parsing,Python,Html,Beautifulsoup,Html Parsing,我使用的是2.7.8和有点惊讶的bcz,我得到了所有的文本,但是包含after last的文本没有得到。就像我的html页面: <html> <body> <div class="entry-content" > <p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions: &l

我使用的是2.7.8和有点惊讶的bcz,我得到了所有的文本,但是包含after last的文本没有得到。就像我的html页面:

<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>

<p>Which of the following is not a valid C variable name?<br>
a) int number;<br>
b) float rate;<br>
c) int variable_count;<br>
d) int $main;</p>   <!--not getting-->

<p> more </p>

<p>Which of the following is true for variable names in C?<br>
a) They can contain alphanumeric characters as well as special characters<br>
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br>
c) Variable names cannot start with a digit<br>
d) Variable can be of any length</p> <!--not getting -->!

</div>
</body>
</html>
输出:

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit
但是,我没有得到最后一个“文本”,例如:

 d) int $main
    and 
 d) Variable can be of any length  
哪个在后面

我想得到的结果是:

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;
Found:
d) int $main

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit
d) Variable can be of any length

这是因为BeautifulSoup通过在

前面关闭

标记将文本强制转换为有效的xml。经过美化的版本对此很清楚:

<p>
 Which of the following is not a valid C variable name?
 <br>
  a) int number;
  <br>
   b) float rate;
   <br>
    c) int variable_count;
    <br>
     d) int $main;
    </br>
   </br>
  </br>
 </br>
</p>
正如预期的那样:

找到

a) 整数;
建立
b) 浮动汇率;
建立
c) int变量_计数;
建立
d) 新台币$main;
建立
a) 它们可以包含字母数字字符以及特殊字符
建立
b) 将变量声明为关键字之一(如goto、static)不是错误
建立
c) 变量名不能以数字开头
建立
d) 变量可以是任意长度
您可以使用urllib2代替,并通过的html模块提取xml

from lxml import html
import requests

#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")

#get content in html format
page_content=html.fromstring(page.content)

#recover all text from <p> elements
items=page_content.xpath('//p/text()')
从lxml导入html
导入请求
#请求页
页面=请求。获取(“http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")
#获取html格式的内容
page\u content=html.fromstring(page.content)
#从元素中恢复所有文本
items=page_content.xpath(“//p/text()”)
上述代码返回
元素中包含的文档中所有文本的数组。

有了它,您只需索引到数组中即可打印所需内容。

添加更多打印语句。当您
继续时
打印您跳过的内容。将else语句放到if语句中,并打印您跳过的内容。好的,我正在尝试………为什么您仍然以旧的方式而不是我建议的方式进行操作?。在某种程度上,我面临一些问题,因为我的代码要大得多。因为你提到的小原因,我解决了我的最后一个问题。但在这里,我也面临着与您的解决方案相同的情况,我得到了这个:indexer-ror:list-indexrange@user3440716:没有你真正的投入很难说。我想这是因为
br.contents[0]
。我上一次编辑应该会修复它
...
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
    if len(br.contents) > 0:  # avoid errors if a tag is correctly closed as <br/>
        print 'Found', br.contents[0]
from lxml import html
import requests

#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")

#get content in html format
page_content=html.fromstring(page.content)

#recover all text from <p> elements
items=page_content.xpath('//p/text()')