Python 如何清除html代码以仅返回数值？_Python_Html_Regex_Web Scraping_Text

Python 如何清除html代码以仅返回数值？

python html regex web-scraping text

Python 如何清除html代码以仅返回数值？,python,html,regex,web-scraping,text,Python,Html,Regex,Web Scraping,Text,假设每个div都保存为变量div中的字符串，则可以执行以下操作： 10 3 18 3 24 每个div的格式都应该相同。我建议使用Beautiful Soup，这是一个非常流行的html解析模块，非常适合这种情况。如果每个元素都具有title属性，则可以执行以下操作： number = div.split()[3].split('=')[1] Beautiful Soup是我进行html解析的常用工具，希望这对我有所帮助。这里有3种可能性。在前两个版本中，我们确保在将类附加到列表之前将其签出

假设每个div都保存为变量

div

中的字符串，则可以执行以下操作：

每个div的格式都应该相同。我建议使用Beautiful Soup，这是一个非常流行的html解析模块，非常适合这种情况。如果每个元素都具有

title

属性，则可以执行以下操作：

number = div.split()[3].split('=')[1]

Beautiful Soup是我进行html解析的常用工具，希望这对我有所帮助。

这里有3种可能性。在前两个版本中，我们确保在将类附加到列表之前将其签出-以防您不想包含其他div。在第三种方法中，没有一种真正的好方法。与adrianp的拆分方法不同，我的方法不在乎标题在哪里

第三种方法可能有点混乱，请允许我解释一下。首先，我们将

标题拆分到所有位置=“

出现。我们转储该列表的第一个索引，因为它是第一个标题之前的所有内容。然后，我们循环剩余部分，并对第一个报价进行拆分。现在，您想要的数字位于该拆分的第一个索引中。我们通过内联pop来获取该值，这样我们就可以将所有内容都保存在一个列表中，而不是扩展整个循环，并用特定的索引来处理这些值

要远程加载html，请取消注释已注释的

html

var，并用适合您的URL替换“yourURL”

我想我已经给了你们做这件事的所有可能的方法——当然是最明显的方法

from bs4 import BeautifulSoup
import requests
def randomFacts(url):
    r = requests.get(url)
    bs = BeautifulSoup(r.content, 'html.parser')
    title = bs.find_all('div')
    for each in title:
        print(each['title'])

从bs4导入美化组
重新导入请求
html='1〕 \
 \
 \
 \
'
#html=requests.get（yourURL.content）
#可能性1：美丽集团
soup=BeautifulSoup（html，'html.parser'）
#假设所有bb fl分类的div都有一个标题，并且所有div都有一个类
#您可能需要拆解此发电机并添加一些额外的检查
bs_titleval=[div['title']表示汤中的div。如果div['class']中的'bb fl'为'div，则查找所有（'div'）]
打印（bs_标题）
#可能性2：正则表达式~不是最好的方法
#如果标记属性签名发生更改，这将不起作用
title_re=re.compile（“…”、“18”>…”等）
sp_titleval=[s.split（“”）.pop（0）表示标题中的s]
打印（sp标题）

您尝试过做什么？@adrianp我尝试过使用regex来清除文本。我尝试过使用regex来删除文本如果这些解决方案中的任何一个解决了您的问题，请接受。这将不起作用。使用此方法将返回数据，如

“10”>

from bs4 import BeautifulSoup
import requests
def randomFacts(url):
    r = requests.get(url)
    bs = BeautifulSoup(r.content, 'html.parser')
    title = bs.find_all('div')
    for each in title:
        print(each['title'])

from bs4 import BeautifulSoup
import re, requests

html = '<div class="bb-fl" style="background:Tomato;width:0.63px" title="10"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24"></div>'

#html = requests.get(yourURL).content

# possibility 1: BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

# assumes that all bb-fl classed divs have a title and all divs have a class
# you may need to disassemble this generator and add some extra checks
bs_titleval = [div['title'] for div in soup.find_all('div') if 'bb-fl' in div['class']]
  
print(bs_titleval)


# possibility 2: Regular Expressions ~ not the best way to go
# this isn't going to work if the tag attribute signature changes

title_re = re.compile('<div class="bb-fl" style="[^"]*" title="([0-9]+)">', re.I)

re_titleval = [m.group(1) for m in title_re.finditer(html)]
    
print(re_titleval)


# possibility 3: String Splitting ~ 
# probably the best method if there is nothing extra to weed out

title_sp = html.split('title="')
title_sp.pop(0) # get rid of first index

# title_sp is now ['10"></div>...', '3"></div>...', '18"></div>...', etc...]
sp_titleval = [s.split('"').pop(0) for s in title_sp]

print(sp_titleval)