python如何在br之后提取文本？_Python_Html_Beautifulsoup_Html Parsing

python如何在br之后提取文本？

python html

python如何在br之后提取文本？,python,html,beautifulsoup,html-parsing,Python,Html,Beautifulsoup,Html Parsing,我使用的是2.7.8和有点惊讶的bcz，我得到了所有的文本，但是包含after last的文本没有得到。就像我的html页面： <html> <body> <div class="entry-content" > <p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions: &l

我使用的是2.7.8和有点惊讶的bcz，我得到了所有的文本，但是包含after last的文本没有得到。就像我的html页面：

<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>

<p>Which of the following is not a valid C variable name?<br>
a) int number;<br>
b) float rate;<br>
c) int variable_count;<br>
d) int $main;</p>   <!--not getting-->

<p> more </p>

<p>Which of the following is true for variable names in C?<br>
a) They can contain alphanumeric characters as well as special characters<br>
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br>
c) Variable names cannot start with a digit<br>
d) Variable can be of any length</p> <!--not getting -->!

</div>
</body>
</html>

输出：

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit

但是，我没有得到最后一个“文本”，例如：

 d) int $main
    and 
 d) Variable can be of any length

哪个在后面

我想得到的结果是：

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;
Found:
d) int $main

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit
d) Variable can be of any length

这是因为BeautifulSoup通过在

前面关闭

标记将文本强制转换为有效的xml。经过美化的版本对此很清楚：

<p>
 Which of the following is not a valid C variable name?
 <br>
  a) int number;
  <br>
   b) float rate;
   <br>
    c) int variable_count;
    <br>
     d) int $main;
    </br>
   </br>
  </br>
 </br>
</p>

正如预期的那样：

找到


a） 整数；
建立
b） 浮动汇率；
建立
c） int变量_计数；
建立
d） 新台币$main；
建立
a） 它们可以包含字母数字字符以及特殊字符
建立
b） 将变量声明为关键字之一（如goto、static）不是错误
建立
c） 变量名不能以数字开头
建立
d） 变量可以是任意长度

您可以使用urllib2代替，并通过的html模块提取xml

from lxml import html
import requests

#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")

#get content in html format
page_content=html.fromstring(page.content)

#recover all text from <p> elements
items=page_content.xpath('//p/text()')

从lxml导入html
导入请求
#请求页
页面=请求。获取（“http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")
#获取html格式的内容
page\u content=html.fromstring（page.content）
#从元素中恢复所有文本
items=page_content.xpath（“//p/text（）”）

上述代码返回

元素中包含的文档中所有文本的数组。

有了它，您只需索引到数组中即可打印所需内容。

添加更多打印语句。当您

继续时

打印您跳过的内容。将else语句放到if语句中，并打印您跳过的内容。好的，我正在尝试………为什么您仍然以旧的方式而不是我建议的方式进行操作？。在某种程度上，我面临一些问题，因为我的代码要大得多。因为你提到的小原因，我解决了我的最后一个问题。但在这里，我也面临着与您的解决方案相同的情况，我得到了这个：indexer-ror:list-indexrange@user3440716：没有你真正的投入很难说。我想这是因为

br.contents[0]

。我上一次编辑应该会修复它

...
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
    if len(br.contents) > 0:  # avoid errors if a tag is correctly closed as <br/>
        print 'Found', br.contents[0]

from lxml import html
import requests

#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")

#get content in html format
page_content=html.fromstring(page.content)

#recover all text from <p> elements
items=page_content.xpath('//p/text()')