在python中使用.get_text（）后删除空格_Python

在python中使用.get_text（）后删除空格

python

在python中使用.get_text（）后删除空格,python,Python,我想从bog standard.html文件中删除空白。我正在使用python 3.6.2 到目前为止我的代码 #!/usr/bin/python import re import logging import textwrap from bs4 import BeautifulSoup print('opening file....') with open("./scraped_pages/doc.html") as fp: soup = BeautifulS

我想从bog standard.html文件中删除空白。我正在使用python 3.6.2

到目前为止我的代码

#!/usr/bin/python

import re
import logging
import textwrap

from bs4 import BeautifulSoup

print('opening file....')
with open("./scraped_pages/doc.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")
    print('closing file...') 
    fp.close()
    print('..... file closed  ...')
    # print out the original text, in this case html source code
    # print(soup)   
    # only retrieve the text from the document, remove all html tags
    soup = soup.get_text()
    print(soup)

    lines = soup.split("\n")
    #Use the list comprehension syntax [line for line in lines if condition] with lines as the previous result and condition as line.strip() != "" to remove any empty lines from lines.
    no_soup = [line for line in lines if line.strip() != ""]

    # Declare an empty string and use a for-loop to iterate over the previous result. 
    no_empty_soup = ""
    # At each iteration, use the syntax str1 += str2 + "/n" to add the current element in the list str2 and a newline to the empty string str1.
    for line in no_soup:
        no_empty_soup += line + "\n"

    print("no empty lines:\n", no_empty_soup)

    soup = no_empty_soup.strip()
    print(soup)
    
   print(textwrap.dedent(soup))

还有doc.html代码

<!DOCTYPE html>
<html lang="en-GB">
<head>
  <title>Head's title</title>
</head>

<body>
  <p class="title"><b>Body's title</b></p>
  <p class="story">line begins
    <a href="http://example.com/element1" class="element" id="link1">1</a>
    <a href="http://example.com/element2" class="element" id="link2"> 2</a>
    <a href="http://example.com/avatar1" class="avatar" id="link3">3</a>
  <p>     line ends</p>
</body>
</html>

我想要的是什么

Head's title,
Body's title
line begins
1
2
3
line ends

我不明白为什么在使用

.strip（）

或

textwrap.dedent（）后仍保留空白。如果有人能解释一下
我本以为'1'会像B中的和l
中的<1'一样，在使用.get\u text（）
后，行开始。有什么想法吗
谢谢,，
Tommy。
您的列表理解缺少.strip（）
，应该是：
no_soup = [line.strip() for line in lines if line.strip() != ""]

然后它就会工作。
您的列表理解缺少.strip（）
，应该是：
no_soup = [line.strip() for line in lines if line.strip() != ""]

那就行了。
@yvesonline是的，太好了。谢谢@yvesonline实际上，第一行仍然包含一个初始空格，即“Head's title”而不是“Head's title”。我可以使用soup=no\u empty\u soup.strip（）
删除此空间，但这可能不是pythonic。此空间来自print
调用。当您执行print（“无空行：\n”，无空汤）
Python用空格分隔对象时，请参见。为了避免这种情况，请将print语句更改为print（“无空行：\n”，无空行，sep=”“）
。也就是说，无空行
是您的输出，每行没有尾随或前导空格，您不需要无空行。strip（）
@yvesonline谢谢。按照你的建议工作。@yvesonline是的，很好。谢谢@yvesonline实际上，第一行仍然包含一个初始空格，即“Head's title”而不是“Head's title”。我可以使用soup=no\u empty\u soup.strip（）
删除此空间，但这可能不是pythonic。此空间来自print
调用。当您执行print（“无空行：\n”，无空汤）
Python用空格分隔对象时，请参见。为了避免这种情况，请将print语句更改为print（“无空行：\n”，无空行，sep=”“）
。也就是说，无空行
是您的输出，每行没有尾随或前导空格，您不需要无空行。strip（）
@yvesonline谢谢。按照你的建议工作。