Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/angular/31.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 4网站解析_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 4网站解析

Python 4网站解析,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图用Beautifulsoup4从一个网站上搜集一些体育数据,但我在弄清楚如何进行时遇到了一些困难。我对HTML不是很在行,似乎无法理解最后一点必要的语法。一旦数据被解析,我将把它插入到一个数据帧中。我正在努力争取主队、客队和得分。以下是我目前的代码: from bs4 import BeautifulSoup import urllib2 import csv url = 'http://www.bbc.com/sport/football/premier-league/results

我正试图用Beautifulsoup4从一个网站上搜集一些体育数据,但我在弄清楚如何进行时遇到了一些困难。我对HTML不是很在行,似乎无法理解最后一点必要的语法。一旦数据被解析,我将把它插入到一个数据帧中。我正在努力争取主队、客队和得分。以下是我目前的代码:

from bs4 import BeautifulSoup
import urllib2
import csv

url = 'http://www.bbc.com/sport/football/premier-league/results'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

def has_class_but_no_id(tag):
    return tag.has_attr('score')

writer = csv.writer(open("webScraper.csv", "w"))

for tag in soup.find_all('span', {'class':['team-away', 'team-home', 'score']}):
    print(tag)
以下是一个示例输出:

<span class="team-home teams">
<a href="/sport/football/teams/newcastle-united">Newcastle</a> </span>
<span class="score"> <abbr title="Score"> 0-3 </abbr> </span>
<span class="team-away teams">
<a href="/sport/football/teams/sunderland">Sunderland</a> </span>


我需要将主队(纽卡斯尔)、比分(0-3)和客队(桑德兰)存储在三个独立的字段中。基本上,我一直在尝试从每个标记中提取“值”,似乎无法理解
bs4
中的语法。我需要一个
tag.value
属性,但我在文档中找到的只是
tag.name
tag.attrs
。任何帮助或指点都将不胜感激

您可以使用tag.string属性来获取标记的值

有关更多详细信息,请参阅文档。

每个分数单元位于
元素中,循环这些元素以提取匹配细节

从那里,您可以使用生成器从子元素中提取文本;只需将其传递给
'.join()
,即可获得标记中包含的所有字符串。分别选择
主队
得分
客队
,以便于解析:

for match in soup.find_all('td', class_='match-details'):
    home_tag = match.find('span', class_='team-home')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='score')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='team-away')
    away = away_tag and ''.join(away_tag.stripped_strings)
通过额外的
打印
,这将提供:

>>> for match in soup.find_all('td', class_='match-details'):
...     home_tag = match.find('span', class_='team-home')
...     home = home_tag and ''.join(home_tag.stripped_strings)
...     score_tag = match.find('span', class_='score')
...     score = score_tag and ''.join(score_tag.stripped_strings)
...     away_tag = match.find('span', class_='team-away')
...     away = away_tag and ''.join(away_tag.stripped_strings)
...     if home and score and away:
...         print home, score, away
... 
Newcastle 0-3 Sunderland
West Ham 2-0 Swansea
Cardiff 2-1 Norwich
Everton 2-1 Aston Villa
Fulham 0-3 Southampton
Hull 1-1 Tottenham
Stoke 2-1 Man Utd
Aston Villa 4-3 West Brom
Chelsea 0-0 West Ham
Sunderland 1-0 Stoke
Tottenham 1-5 Man City
Man Utd 2-0 Cardiff
# etc. etc. etc.
由于重定向到此处:

这是对已接受答案的更新,该答案仍然正确。如果您编辑您的答案,请ping我,我将删除此答案

for match in soup.find_all('article', class_='sp-c-fixture'):
    home_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-home').find('span').find('span')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='sp-c-fixture__number sp-c-fixture__number--time')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-away').find('span').find('span')
    away = away_tag and ''.join(away_tag.stripped_strings)
    if home and score and away:
        print(home, score, away)

我一直在查找文档,但无法使tag.string值正常工作。它每次都返回“none”。如果您查看输出,我返回的标记中还有另一个“标记”。是否只有在另一个标记中设置了Beautifulsoup才能返回标记?例如上面,我需要进一步搜索a标记,但仅当a标记位于span标记中时。啊,我一定错过了match details类,谢谢!这正是我需要的。我不知道我可以使用.find在标记内重新搜索。