Python 在href Beautifulsoup之后解析出文本
我不擅长美容。其中有几个问题: 我只是想把这三列放在一个数据框中 *以下是从url获取汤数据的代码(将有空):Python 在href Beautifulsoup之后解析出文本,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我不擅长美容。其中有几个问题: 我只是想把这三列放在一个数据框中 *以下是从url获取汤数据的代码(将有空): import pandas as pd from bs4 import BeautifulSoup from urllib.request import Request, urlopen import requests import re req_headers = { 'accept': 'text/html,application/xhtml+xml,applicatio
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
import re
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as s:
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'lxml')
soup
下面是我要分析的html(每页上都有一堆这样的div):
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.S-A_FLUXO_-_COMERCIO_E_ASSESSORIA_INTERNACION_AL.02f1cc56465eb3286f769daad5262d91.html">
S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
Recife,
Pernambuco,
<br>
Brazil</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
250.620749M</div>
</div>
#sales rev
sales_revenue = soup.find_all("div", {"class": "col-md-2 last"})
#location
country = soup.find_all("div", {"class": "col-md-4"})
#thought something like this would work for country but it doesn't"
classToIgnore = ["col-sm-4", "col-xs-4"]
classes = "col-md-4"
for a in soup:
a = soup.find_all("div", class_= lambda c: classes in c and classToIgnore not in c)
#company name
for div in soup.find_all('div',class_="col-md-6"):
x = div.find_all("a", href=re.compile("business-directory"))
print(x)
结果应类似于
revenue location company
$250620749 Recife, Pernambuco, Brazil S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL
问题
-销售收入的工作-不是很大。获取了很多其他信息
-这个位置不太好用
-公司名称很难理解,因为它是href后面的文本。我可以获取HREF,但不确定如何获取url后的文本
有什么想法吗?要保存页面上的表格,可以使用以下示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
# remove unnecessary information:
for t in soup.select('.show-mobile'):
t.extract()
all_data = []
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
soup.select('#companyResults .col-md-4')[1:],
soup.select('#companyResults .col-md-2')[1:]):
all_data.append({
'Name': c1.get_text(strip=True),
'Location': ' '.join(c2.get_text(strip=True).split()),
'Revenue': c3.get_text(strip=True)
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
params = {'page': 1}
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'
all_data = []
for params['page'] in range(1, 3): # <-- increase number of pages here
print('Page {}...'.format(params['page']))
soup = BeautifulSoup(requests.get(url, headers=headers, params=params).content, 'html.parser')
# remove unnecessary information:
for t in soup.select('.show-mobile'):
t.extract()
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
soup.select('#companyResults .col-md-4')[1:],
soup.select('#companyResults .col-md-2')[1:]):
all_data.append({
'Name': c1.get_text(strip=True),
'Location': ' '.join(c2.get_text(strip=True).split()),
'Revenue': c3.get_text(strip=True)
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
印刷品:
Name Location Revenue
0 S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL Recife, Pernambuco,Brazil 250.620749M
1 POINT SHOES EIRELI Franca, Sao Paulo,Brazil
2 Cooperativa Triticola Caçapavana Ltda Caçapava Do Sul, Rio Grande Do Sul,Brazil 142.786551M
3 CRT2 REPRESENTACOES EMPRESARIAIS LTDA Curitiba, Parana,Brazil
4 Mercantil Palmeirense Ltda Sao Paulo, Sao Paulo,Brazil
5 GVD IMPORTACAO E EXPORTACAO LTDA Campo Bom, Rio Grande Do Sul,Brazil
6 COOPERATIVA TRITICOLA DE GETULIO VARGAS LTDA Estacao, Rio Grande Do Sul,Brazil 75.176735M
7 Golden Distribuidora Ltda Vitoria, Espirito Santo,Brazil
8 JTF COMERCIO E REPRESENTACOES LTDA Colider, Mato Grosso,Brazil
9 MARINHO VESTUARIO LTDA Eusebio, Ceara,Brazil
10 COTIA FOODS COMERCIO E REPRESENTACAO LTDA - EM RECUPERACAO JUDICIAL Cotia, Sao Paulo,Brazil
11 FOKUS LOGISTICA LTDA Aparecida De Goiania, Goias,Brazil
12 R. SHIBUYA TENDENCIA MARKETING E REPRESENTACOES LTDA Rio De Janeiro, Rio De Janeiro,Brazil
13 TIM COMERCIO DE EMBALAGENS LTDA Belo Horizonte, Minas Gerais,Brazil
14 PEDRAFORT LTDA Sete Lagoas, Minas Gerais,Brazil
15 FARMA-RAPIDA MEDICAMENTOS E MATERIAIS ESPECIAIS SA Natal, Rio Grande Do Norte,Brazil 48.913861M
16 PORTOFINO REPRESENTACOES LTDA Botuvera, Santa Catarina,Brazil
17 NOROEST REPRESENTACOES COMERCIAIS LTDA Jaru, Rondonia,Brazil
18 LOGIMED DISTRIBUIDORA SOCIEDADE EMPRESARIA LIMITADA Sao Paulo, Sao Paulo,Brazil
19 Filon Confecções - EIRELI São Paulo, Sao Paulo,Brazil
20 LEMES & LIMA COMERCIO E LOGISTICA LTDA Goiania, Goias,Brazil
21 CERAMICA JACARANDA LTDA Ribeirao Das Neves, Minas Gerais,Brazil
22 NORDICAL REPRESENTANTE DE ALIMENTOS LTDA Jaboatao Dos Guararapes, Pernambuco,Brazil
23 QUESALON REPRESENTACAO DE PRODUTOS FARMACEUTICOS LTDA Alhandra, Paraiba,Brazil
24 ATACK REPRESENTACAO COMERCIAL LTDA Vitoria, Espirito Santo,Brazil
25 LESTE BRASILEIRA IMPORTADORA E EXPORTADORA LTDA Cariacica, Espirito Santo,Brazil
26 JUCELITO BORDIGNON & CIA LTDA Sao Sepe, Rio Grande Do Sul,Brazil
27 CASAS DA LAVOURA REPRESENTACOES COMERCIAIS LTDA Goiania, Goias,Brazil
28 UNISOAP COSMETICOS LTDA Praia Grande, Sao Paulo,Brazil
29 MOTIVA MAQUINAS LTDA Salvador, Bahia,Brazil
30 BC COSMETICOS LTDA Sao Paulo, Sao Paulo,Brazil
31 ORGANIZACOES ALMEIDA SOARES LTDA Belo Horizonte, Minas Gerais,Brazil
32 Refinitiv Brasil Servicos Economicos Ltda Sao Paulo, Sao Paulo,Brazil
33 JBC REPRESENTACOES LTDA Conchal, Sao Paulo,Brazil
34 P & P RIO DISTRIBUIDORA DE ALIMENTOS LTDA Rio De Janeiro, Rio De Janeiro,Brazil
35 FORMATTO TELHAS E TELHADOS REPRESENTACAO COMERCIAL EIRELI Jaguaruna, Santa Catarina,Brazil
36 MACLENY - DISTRIBUIDORA DE PRODUTOS DE BELEZA LTDA Sao Paulo, Sao Paulo,Brazil
37 ELG REPRESENTACAO E COMERCIO LTDA Jaragua Do Sul, Santa Catarina,Brazil
38 ELFA PRODUTOS FARMACEUTICOS E HOSPITALARES LTDA Cabedelo, Paraiba,Brazil
39 COMERCIO E EXPORTACAO DE CEREAIS MUNARETTO LTDA Bom Sucesso Do Sul, Parana,Brazil
40 RGE DISTRIBUIDORA DE BEBIDAS LTDA Montes Claros, Minas Gerais,Brazil
41 A.S. REPRESENTACAO DE EMBALAGENS LTDA Sao Paulo, Sao Paulo,Brazil
42 ON LINE TRADING SA Novo Hamburgo, Rio Grande Do Sul,Brazil 21.79624M
43 AMX COMERCIO E SERVICOS DE AUTOMOTORES LTDA Itaborai, Rio De Janeiro,Brazil
44 SOL EMBALAGENS PLASTICAS EIRELI Camacari, Bahia,Brazil
45 MJB COMERCIO DE EQUIPAMENTOS ELETRONICOS E GESTAO DE PESSOAL LTDA Cuiaba, Mato Grosso,Brazil
46 EBANOS REPRESENTACOES LTDA Estancia Velha, Rio Grande Do Sul,Brazil
47 TENXE SERVICOS DE REPRESENTACAO COMERCIAL E TELEATENDIMENTO LTDA Curitiba, Parana,Brazil
48 BRASILVEST REPRESENTACOES LTDA Gaspar, Santa Catarina,Brazil
49 EURO MED INDUSTRIA E COMERCIO LTDA Timbauba, Pernambuco,Brazil
并保存data.csv
(来自LibreOffice的屏幕截图):
编辑:要刮取多个页面,请使用以下示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
# remove unnecessary information:
for t in soup.select('.show-mobile'):
t.extract()
all_data = []
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
soup.select('#companyResults .col-md-4')[1:],
soup.select('#companyResults .col-md-2')[1:]):
all_data.append({
'Name': c1.get_text(strip=True),
'Location': ' '.join(c2.get_text(strip=True).split()),
'Revenue': c3.get_text(strip=True)
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
params = {'page': 1}
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'
all_data = []
for params['page'] in range(1, 3): # <-- increase number of pages here
print('Page {}...'.format(params['page']))
soup = BeautifulSoup(requests.get(url, headers=headers, params=params).content, 'html.parser')
# remove unnecessary information:
for t in soup.select('.show-mobile'):
t.extract()
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
soup.select('#companyResults .col-md-4')[1:],
soup.select('#companyResults .col-md-2')[1:]):
all_data.append({
'Name': c1.get_text(strip=True),
'Location': ' '.join(c2.get_text(strip=True).split()),
'Revenue': c3.get_text(strip=True)
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
导入请求
作为pd进口熊猫
从bs4导入BeautifulSoup
headers={'User-Agent':'Mozilla/5.0(X11;Ubuntu;Linux x86_64;rv:80.0)Gecko/20100101 Firefox/80.0'}
参数={'page':1}
url='1〕https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'
所有_数据=[]
对于范围(1,3)中的参数['page']:#!你太棒了。每次我有一个漂亮的问题,你总能想出一个令人惊奇的解决方案。这非常好用。还有一个问题——假设我有很多页面——我可以在每个页面的循环中写一个for循环,其中包含所有这些信息吗?url的末尾有一个“page=x”。