Python 如何使用BeautifulSoup处理特定标记中的不同格式
我希望能够单独处理HTML文件中的一些标记。我的代码对除两个之外的所有标记都正常工作(到目前为止)。这两个都有两行,而不是一行。这是我的密码:Python 如何使用BeautifulSoup处理特定标记中的不同格式,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我希望能够单独处理HTML文件中的一些标记。我的代码对除两个之外的所有标记都正常工作(到目前为止)。这两个都有两行,而不是一行。这是我的密码: from bs4 import BeautifulSoup with open("F:/gpu.txt") as f: soup = BeautifulSoup(f) section = soup.find_all("td") #print(section[2]) for section in section:
from bs4 import BeautifulSoup
with open("F:/gpu.txt") as f:
soup = BeautifulSoup(f)
section = soup.find_all("td")
#print(section[2])
for section in section:
if section.parent(text="GPU Name:"):
print(section.text)
elif section.parent(text="GPU Variant:"):
print (section.text)
elif section.parent(text="Bus Interface:"):
print (section.text)
elif section.parent(text="Transistors:"):
print (section.text)
事情还在继续。但是,当我们谈到“进程大小:”时,html代码是不同的:
<th>Process Size:</th>
<td>
Something
<br />
Something Else
</td>
</tr>
我需要的是能够单独处理“Something”和“Something other”(没有那些白线和空格)和/或使其成为一件事,将其转换为类似“Something/Something”的字符串
对不起,如果我的信息不够清楚,英语不是我的第一语言。谢谢大家! 您可以在节中找到所有文本节点(使用),并使用
/
将它们连接起来:
print('/'.join(item.strip() for item in section.find_all(text=True)))
例如:
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<th>GPU Name:</th>
<td>BLABLA</td>
</tr>
<tr>
<th>GPU Variant:</th>
<td>BLABLA</td>
</tr>
<tr>
<th>Process Size: </th>
<td>BLABLA</td>
</tr>
<tr>
<th>Transistors:</th>
<td>BLABLA</td>
</tr>
<tr>
<th>Process Size:</th>
<td>
Something
<br />
Something Else
</td>
</tr>
</table>
"""
soup = BeautifulSoup(data)
section = soup.find_all("td")
for section in section:
if section.parent(text="GPU Name:"):
print(section.text)
elif section.parent(text="GPU Variant:"):
print (section.text)
elif section.parent(text="Process Size:"):
print ('/'.join(item.strip() for item in section.find_all(text=True)))
elif section.parent(text="Transistors:"):
print (section.text)
这是非常特定于示例HTML的,它依赖于换行符的存在,但您可以这样做:
from bs4 import BeautifulSoup
with open("F:/gpu.txt") as f:
soup = BeautifulSoup(f)
for section in soup.find_all("td"):
print '/'.join([s.strip() for s in section.text.split('\n') if s.strip()])
谢谢你的回复!
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<th>GPU Name:</th>
<td>BLABLA</td>
</tr>
<tr>
<th>GPU Variant:</th>
<td>BLABLA</td>
</tr>
<tr>
<th>Process Size: </th>
<td>BLABLA</td>
</tr>
<tr>
<th>Transistors:</th>
<td>BLABLA</td>
</tr>
<tr>
<th>Process Size:</th>
<td>
Something
<br />
Something Else
</td>
</tr>
</table>
"""
soup = BeautifulSoup(data)
section = soup.find_all("td")
for section in section:
if section.parent(text="GPU Name:"):
print(section.text)
elif section.parent(text="GPU Variant:"):
print (section.text)
elif section.parent(text="Process Size:"):
print ('/'.join(item.strip() for item in section.find_all(text=True)))
elif section.parent(text="Transistors:"):
print (section.text)
BLABLA
BLABLA
BLABLA
Something/Something Else
from bs4 import BeautifulSoup
with open("F:/gpu.txt") as f:
soup = BeautifulSoup(f)
for section in soup.find_all("td"):
print '/'.join([s.strip() for s in section.text.split('\n') if s.strip()])