Python 如何使用BeautifulSoup处理特定标记中的不同格式_Python_Web Scraping_Beautifulsoup

Python 如何使用BeautifulSoup处理特定标记中的不同格式

python web-scraping

Python 如何使用BeautifulSoup处理特定标记中的不同格式,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我希望能够单独处理HTML文件中的一些标记。我的代码对除两个之外的所有标记都正常工作（到目前为止）。这两个都有两行，而不是一行。这是我的密码： from bs4 import BeautifulSoup with open("F:/gpu.txt") as f: soup = BeautifulSoup(f) section = soup.find_all("td") #print(section[2]) for section in section:

我希望能够单独处理HTML文件中的一些标记。我的代码对除两个之外的所有标记都正常工作（到目前为止）。这两个都有两行，而不是一行。这是我的密码：

from bs4 import BeautifulSoup

with open("F:/gpu.txt") as f:
    soup = BeautifulSoup(f)
    section = soup.find_all("td")
    #print(section[2])
    for section in section:
        if section.parent(text="GPU Name:"):
            print(section.text)
        elif section.parent(text="GPU Variant:"):
            print (section.text)
        elif section.parent(text="Bus Interface:"):
            print (section.text)
        elif section.parent(text="Transistors:"):
            print (section.text)

事情还在继续。但是，当我们谈到“进程大小：”时，html代码是不同的：

        <th>Process Size:</th>
      <td>
        Something 
                <br />
                Something Else
              </td>
    </tr>

我需要的是能够单独处理“Something”和“Something other”（没有那些白线和空格）和/或使其成为一件事，将其转换为类似“Something/Something”的字符串

对不起，如果我的信息不够清楚，英语不是我的第一语言。谢谢大家!

您可以在节中找到所有文本节点（使用），并使用

将它们连接起来：

print('/'.join(item.strip() for item in section.find_all(text=True)))

例如：

from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <th>GPU Name:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>GPU Variant:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>Process Size: </th>
      <td>BLABLA</td>
    </tr>
    <tr>
      <th>Transistors:</th>
      <td>BLABLA</td>
    </tr>
    <tr>
      <th>Process Size:</th>
      <td>
        Something
                <br />
                Something Else
              </td>
    </tr>
</table>
"""

soup = BeautifulSoup(data)
section = soup.find_all("td")

for section in section:
    if section.parent(text="GPU Name:"):
        print(section.text)
    elif section.parent(text="GPU Variant:"):
        print (section.text)
    elif section.parent(text="Process Size:"):
        print ('/'.join(item.strip() for item in section.find_all(text=True)))
    elif section.parent(text="Transistors:"):
        print (section.text)

这是非常特定于示例HTML的，它依赖于换行符的存在，但您可以这样做：

from bs4 import BeautifulSoup

with open("F:/gpu.txt") as f:
    soup = BeautifulSoup(f)
    for section in soup.find_all("td"):
        print '/'.join([s.strip() for s in section.text.split('\n') if s.strip()])

谢谢你的回复！

from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <th>GPU Name:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>GPU Variant:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>Process Size: </th>
      <td>BLABLA</td>
    </tr>
    <tr>
      <th>Transistors:</th>
      <td>BLABLA</td>
    </tr>
    <tr>
      <th>Process Size:</th>
      <td>
        Something
                <br />
                Something Else
              </td>
    </tr>
</table>
"""

soup = BeautifulSoup(data)
section = soup.find_all("td")

for section in section:
    if section.parent(text="GPU Name:"):
        print(section.text)
    elif section.parent(text="GPU Variant:"):
        print (section.text)
    elif section.parent(text="Process Size:"):
        print ('/'.join(item.strip() for item in section.find_all(text=True)))
    elif section.parent(text="Transistors:"):
        print (section.text)

BLABLA
BLABLA
BLABLA
Something/Something Else

from bs4 import BeautifulSoup

with open("F:/gpu.txt") as f:
    soup = BeautifulSoup(f)
    for section in soup.find_all("td"):
        print '/'.join([s.strip() for s in section.text.split('\n') if s.strip()])