试图刮除所有'；a'；特定'内的文本；td'；s使用python和bs4_Python_Html_Web Scraping_Beautifulsoup

试图刮除所有'；a'；特定'内的文本；td'；s使用python和bs4

python html web-scraping

试图刮除所有'；a'；特定'内的文本；td'；s使用python和bs4,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我正在尝试提取'a'标记中包含的文本，特别是在类为“table main”的表中，然后提取其中的每一行。第一个td包含两个团队名称的文本，以及一个td类“h-text-left”。不确定问题是否与我的循环有关，但我收到的错误消息似乎表明我在循环的最后一行中错误地使用了bs4 我可以用类“table main”刮除表中的每个tr，然后用类“h-text-left”刮除每个td。但是，当我试图单独提取“a”元素时，甚至是提取“a”文本时，我遇到了一个死胡同 import requests from

我正在尝试提取'a'标记中包含的文本，特别是在类为“table main”的表中，然后提取其中的每一行。第一个td包含两个团队名称的文本，以及一个td类“h-text-left”。不确定问题是否与我的循环有关，但我收到的错误消息似乎表明我在循环的最后一行中错误地使用了bs4

我可以用类“table main”刮除表中的每个tr，然后用类“h-text-left”刮除每个td。但是，当我试图单独提取“a”元素时，甚至是提取“a”文本时，我遇到了一个死胡同

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent':
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

r = requests.get('https://www.betexplorer.com/soccer/england/premier-league/fixtures/', headers=headers)

c = r.content

soup = BeautifulSoup(c)

fixture_table = soup.find('table', attrs = {'class': 'table-main'})

for tr in soup.find_all('tr'):
    match_tds = tr.find_all('td', attrs = {'class': 'h-text-left'})
    matches = match_tds.find_all('a')

当我试图查找所有“a”标记时，最后一行出现以下错误：

...     matches = match_tds.find_all('a')
...
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "C:\Users\Glypt\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
>>>

。。。matches=match\u tds.find\u all（'a'）
...
回溯（最近一次呼叫最后一次）：
文件“”，第4行，在
文件“C:\Users\Glypt\AppData\Local\Programs\Python\Python36-32\lib\site packages\bs4\element.py”，第1884行，位于__
“ResultSet对象没有属性“%s”。您可能将项目列表视为单个项目。当您打算调用find（）时是否调用find_all（）？%key？”
AttributeError:ResultSet对象没有“全部查找”属性。您可能将项目列表视为单个项目。当您打算调用find（）时，是否调用了find_all（）？
>>>

match\u tds

是一个列表，而不是单个元素-您可以使用

tr.find\u all（…）

-因此您必须使用

for

循环来运行另一个

find\u all（）

如果使用

find（）

获取第一个元素，则可以与另一个

find（）

或

find\u all（）

但是您不能在

find\u all（）之后使用find（）
或find\u all（）

match\u tds
是一个列表，而不是单个元素-您通过tr.find\u all（…）
获得它-因此您必须使用for
循环运行另一个find\u all（）

如果使用find（）
获取第一个元素，则可以与另一个find（）
或find\u all（）

但是您不能在find\u all（）之后使用find（）
或find\u all（）

您应该使用内置功能查找嵌套结构。您可以使用'.class\u name'
指定.css
类，并使用“第一个选择器”>“第二个选择器”（或更多选择器）查找嵌套结构。这看起来就像：
import requests
from bs4 import BeautifulSoup

s = requests.session()
s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'
res = s.get('https://www.betexplorer.com/soccer/england/premier-league/fixtures/')
soup = BeautifulSoup(res.text, 'html.parser')

matches = soup.select('.table-main  tr  td  a')
for match in matches:
    print(match.getText())

行matches=soup.select（'.table main tr td a'）
将选择td
元素中tr
元素中class=table main
元素中的所有a
元素。此外，您还可以使用matches=soup。选择（'td>a'）
（
运算符）以指定a
元素直接位于td
元素内。我想这可能会大大简化您的代码
注意：我无法在我的计算机上测试这一点，因为SSL证书无法被确认并引发请求。异常。SSLError
您应该使用内置功能查找嵌套结构。您可以使用'.class\u name'
指定.css
类，并使用“第一个选择器”>“第二个选择器”（或更多选择器）查找嵌套结构。这看起来就像：
import requests
from bs4 import BeautifulSoup

s = requests.session()
s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'
res = s.get('https://www.betexplorer.com/soccer/england/premier-league/fixtures/')
soup = BeautifulSoup(res.text, 'html.parser')

matches = soup.select('.table-main  tr  td  a')
for match in matches:
    print(match.getText())

行matches=soup.select（'.table main tr td a'）
将选择td
元素中tr
元素中class=table main
元素中的所有a
元素。此外，您还可以使用matches=soup。选择（'td>a'）
（
运算符）以指定a
元素直接位于td
元素内。我想这可能会大大简化您的代码
注意：我无法在我的计算机上测试这一点，因为无法确认SSL证书并引发请求。异常。SSLError
要获取文本，请尝试：
for td in soup.findAll('td', attrs = {'class': 'h-text-left'}):
    print(td.findAll('a')[0].text)

要获取文本，请尝试：
for td in soup.findAll('td', attrs = {'class': 'h-text-left'}):
    print(td.findAll('a')[0].text)

您可以使用单个类将其简化为更快的选择器方法。所有链接都具有相同的类名，因此您可以将其传递到列表中的选择，以提供所有链接
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/soccer/england/premier-league/fixtures/')
soup = BeautifulSoup(r.content, 'lxml')
matches = [item['href'] for item in soup.select('.in-match')]


赔率
您可以使用单个类将其简化为更快的选择器方法。所有链接都具有相同的类名，因此您可以将其传递到列表中的选择，以提供所有链接
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/soccer/england/premier-league/fixtures/')
soup = BeautifulSoup(r.content, 'lxml')
matches = [item['href'] for item in soup.select('.in-match')]


赔率
谢谢，我找到了我想要的东西。谢谢，我找到了我想要的东西。我现在正在尝试这种方法，但我真正想做的是提取链接文本，而不是href属性。我试图修改您的示例以提取赔率文本，但在打印时没有得到任何输出，即[]很抱歉，这些链接没有“.in match”类，或者更确切地说没有任何类。不过，我试图解析出“title”属性，它似乎总是“addtomyselections”。但是，在此之后尝试打印匹配列表时，我没有得到任何输出。对于糟糕的格式，我深表歉意，但对于我上面提供的示例元素，我希望从a元素中提取文本“2.08”。您将更改为item.text for item inmatches=[item.text for item in soup.select（'.in match'）]赔率=[item.text for item in soup.select（'Add to My Selections'）]第一个选项如您所述起作用，但是，当我尝试将a元素的标题属性作为目标时，第二个选项不起作用。仍在打印[]我现在正在尝试这种方法，但实际上我正在尝试提取链接文本，而不是href属性。我试图修改您的示例以提取赔率文本，但在打印它时我没有得到任何输出，即[]sor
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/soccer/england/premier-league/fixtures/')
soup = BeautifulSoup(r.content, 'lxml')
odds = [item['data-odd'] for item in soup.select('.table-main__odds [data-odd]')]
print(odds)