Python:-从a<；中包含的行[0]中提取列标题；th>；标签。删除行星名称后的unicode符号和链接编号_Python_Python 3.x_Beautifulsoup

Python:-从a<；中包含的行[0]中提取列标题；th>；标签。删除行星名称后的unicode符号和链接编号

python python-3.x

Python:-从a<；中包含的行[0]中提取列标题；th>；标签。删除行星名称后的unicode符号和链接编号,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,我试图迭代第[0]行的所有列标题，然后删除unicode符号以及行星名称后的链接编号。现在，我的代码如下所示：- URL\u太阳能系统=”https://en.wikipedia.org/wiki/List_of_gravitationally_rounded_objects_of_the_Solar_System" 从bs4导入BeautifulSoup 导入请求 html\u content=requests.get（URL\u solar\u system）.text soup_sola

我试图迭代第[0]行的所有列标题，然后删除unicode符号以及行星名称后的链接编号。现在，我的代码如下所示：-

URL\u太阳能系统=”https://en.wikipedia.org/wiki/List_of_gravitationally_rounded_objects_of_the_Solar_System"
从bs4导入BeautifulSoup
导入请求
html\u content=requests.get（URL\u solar\u system）.text
soup_solar=BeautifulSoup（html_内容，“lxml”）
tables=soup\u solar.find\u all（'table'，attrs={'class'：'wikitable'}）
行星=表[2]
行=行星。查找所有（'tr'）
标题=[0]行中th的[th.text.strip（）]。如果th.get_text（）.strip（）！=''，则查找所有（'th'）
打印（'标题：{}'。格式（标题））

输出

headers: ['*Mercury[6][7]', '*Venus[8][9]', '*Earth[10][11]', '*Mars[12][13]', '°Jupiter[14][15]', '°Saturn[16][17]', '‡Uranus[18][19]', '‡Neptune[20][21]']

我正在尝试实现如下所示的输出。期望输出：

headers: ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']

编写代码的提示：-

标题=[]

for循环，该循环将在第[0]行中循环，并在“th”上找到全部

first_variable=使用迭代器（假设我们在for循环定义中称之为i）并对其使用text方法

second_variable=从具有text方法的第一个变量中查找此值“[” 通过将第二个元素索引到定义的第二个变量元素，重新定义第一个变量

where is标签的if语句将第一个变量附加到您创建的标题列表中

加

然后只需在代码末尾添加以下行

headers = re.findall(r'(\w+)\[\d+]',''.join(headers))

您的最终代码

import re

URL_solar_system = "https://en.wikipedia.org/wiki/List_of_gravitationally_rounded_objects_of_the_Solar_System"

from bs4 import BeautifulSoup

import requests

html_content = requests.get(URL_solar_system).text

soup_solar = BeautifulSoup(html_content, "lxml")

tables = soup_solar.find_all('table', attrs={'class': 'wikitable'})

planets = tables[2]

rows = planets.find_all('tr')

headers = [th.text.strip() for th in rows[0].find_all('th') if th.get_text().strip() != '' ] 

headers = re.findall(r'(\w+)\[\d+]',''.join(headers))

print('headers: {}'.format(headers))

标题文本位于第四个表下的

href

。请尝试使用CSS选择器选择第四个表：

table:nth（4）

，然后使用

th

选择

th

标记下的

标记（标题）

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/List_of_gravitationally_rounded_objects_of_the_Solar_System"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

print([tag.text for tag in soup.select("table:nth-of-type(4) th > a")])

输出：

['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']

我已经修改了anser，在你的模型开始时我们需要

导入re

，以便使用常规扩展。

['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']