Python 使用beautifulsoup在HTML中获取链接标记内的标题
我正在从中提取数据 我得到了我想要的输出,但现在的问题是:我得到的输出是业务支持和。。。和澳大利亚储备银行…,不是完整的文本,我想打印整个文本,而不是所有的“…”。我将答案中的第9行和第10行替换为jezrael,请参考代码Python 使用beautifulsoup在HTML中获取链接标记内的标题,python,beautifulsoup,Python,Beautifulsoup,我正在从中提取数据 我得到了我想要的输出,但现在的问题是:我得到的输出是业务支持和。。。和澳大利亚储备银行…,不是完整的文本,我想打印整个文本,而不是所有的“…”。我将答案中的第9行和第10行替换为jezrael,请参考代码 org=soup.find_all('a',{'class':'nav-item active'})[0]。get('title')) groups=soup.find_all('a',{'class':'nav-item active'})[1]。get('title')
org=soup.find_all('a',{'class':'nav-item active'})[0]。get('title'))
groups=soup.find_all('a',{'class':'nav-item active'})[1]。get('title')
. 我单独运行它,并得到错误:列表索引超出范围。我应该用什么来提取完整的句子?我还尝试:
org=soup.find_all('span',class=“filtered pill”)
,当我单独运行时,它给出了字符串类型的答案,但无法使用整个代码运行。我猜您正在尝试这样做。在每个链接中都有title属性。所以在这里,我只是检查是否有任何标题属性存在,如果是,那么我只是打印它
这里有空行,因为在title=”“
中几乎没有链接,所以可以使用条件语句避免这种情况,然后从中获取所有标题
>>> l = soup.find_all('a')
>>> for i in l:
... if i.has_attr('title'):
... print(i['title'])
...
Remove
Remove
Reserve Bank of Australia
Business Support and Regulation
Creative Commons Attribution 3.0 Australia
>>>
所有文本较长的数据都在attribute
title
中,较短的数据在文本中。因此,如果,则添加双:
for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")
lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
l = soup.find_all('li', class_="nav-item active")
org = l[0].a.get('title')
if org == '':
org = l[0].span.get_text()
groups = l[1].a.get('title')
if groups == '':
groups = l[1].span.get_text()
data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups
#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
谢谢,它为一个网址工作,现在我已经为所有网址运行程序。让我们看看将输出什么。@shashank,在运行大量URL时,我得到了相同的结果。我认为应该是循环中的循环。你能详细说明你想做什么吗?我的意思是你打算怎么去获取数据@shashank,已经完成了,谢谢你的关注。我的问题中有一个链接,里面有我想做的细节。非常感谢。你能解释一下这个逻辑“if org=''”在做什么吗?如果检查html有些属性标题是空的,就有问题了,所以需要if。如果省略它,就得不到文本。@jezrael,:),祝你有美好的一天!
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
print (df1.head())
Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
link \
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...
Organisation Group
0 Reserve Bank of Australia Business Support and Regulation
1 Reserve Bank of Australia Business Support and Regulation
2 Reserve Bank of Australia Business Support and Regulation
3 Reserve Bank of Australia Business Support and Regulation
4 Reserve Bank of Australia Business Support and Regulation