Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/postgresql/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在href Beautifulsoup之后解析出文本_Python_Pandas_Beautifulsoup - Fatal编程技术网

Python 在href Beautifulsoup之后解析出文本

Python 在href Beautifulsoup之后解析出文本,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我不擅长美容。其中有几个问题: 我只是想把这三列放在一个数据框中 *以下是从url获取汤数据的代码(将有空): import pandas as pd from bs4 import BeautifulSoup from urllib.request import Request, urlopen import requests import re req_headers = { 'accept': 'text/html,application/xhtml+xml,applicatio

我不擅长美容。其中有几个问题:

我只是想把这三列放在一个数据框中

*以下是从url获取汤数据的代码(将有空):

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
import re

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
    url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
    r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'lxml')
soup
下面是我要分析的html(每页上都有一堆这样的div):

<div class="col-md-12 data">
                            <div class="col-md-6">
                                <a href="/business-directory/company-profiles.S-A_FLUXO_-_COMERCIO_E_ASSESSORIA_INTERNACION_AL.02f1cc56465eb3286f769daad5262d91.html">
                                        S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL</a>
                                </div>
                            <div class="col-md-4">
                                <div class="show-mobile">Country:</div>
                                Recife,
                                Pernambuco,
                                <br>
                                Brazil</div>
                            <div class="col-md-2 last">
                                <div class="show-mobile">Sales Revenue ($M):</div>
                                250.620749M</div>
                        </div>
#sales rev
sales_revenue = soup.find_all("div", {"class": "col-md-2 last"})

#location
country = soup.find_all("div", {"class": "col-md-4"})

#thought something like this would work for country but it doesn't"
classToIgnore = ["col-sm-4", "col-xs-4"]
classes = "col-md-4"
for a in soup:
    a = soup.find_all("div", class_= lambda c: classes in c and classToIgnore not in c)

#company name
for div in soup.find_all('div',class_="col-md-6"):
    x = div.find_all("a", href=re.compile("business-directory"))
    print(x)
结果应类似于

revenue         location                     company
$250620749      Recife, Pernambuco, Brazil   S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL
问题

-销售收入的工作-不是很大。获取了很多其他信息

-这个位置不太好用

-公司名称很难理解,因为它是href后面的文本。我可以获取HREF,但不确定如何获取url后的文本


有什么想法吗?

要保存页面上的表格,可以使用以下示例:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

# remove unnecessary information:
for t in soup.select('.show-mobile'):
    t.extract()

all_data = []
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
                    soup.select('#companyResults .col-md-4')[1:],
                    soup.select('#companyResults .col-md-2')[1:]):
    all_data.append({
        'Name': c1.get_text(strip=True),
        'Location': ' '.join(c2.get_text(strip=True).split()),
        'Revenue': c3.get_text(strip=True)
    })
    
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
params = {'page': 1}
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'

all_data = []
for params['page'] in range(1, 3):  # <-- increase number of pages here
    print('Page {}...'.format(params['page']))
    soup = BeautifulSoup(requests.get(url, headers=headers, params=params).content, 'html.parser')

    # remove unnecessary information:
    for t in soup.select('.show-mobile'):
        t.extract()

    for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
                        soup.select('#companyResults .col-md-4')[1:],
                        soup.select('#companyResults .col-md-2')[1:]):
        all_data.append({
            'Name': c1.get_text(strip=True),
            'Location': ' '.join(c2.get_text(strip=True).split()),
            'Revenue': c3.get_text(strip=True)
        })
    
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
印刷品:

                                                                   Name                                    Location      Revenue
0                      S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL                   Recife, Pernambuco,Brazil  250.620749M
1                                                    POINT SHOES EIRELI                    Franca, Sao Paulo,Brazil             
2                                 Cooperativa Triticola Caçapavana Ltda   Caçapava Do Sul, Rio Grande Do Sul,Brazil  142.786551M
3                                 CRT2 REPRESENTACOES EMPRESARIAIS LTDA                     Curitiba, Parana,Brazil             
4                                            Mercantil Palmeirense Ltda                 Sao Paulo, Sao Paulo,Brazil             
5                                      GVD IMPORTACAO E EXPORTACAO LTDA         Campo Bom, Rio Grande Do Sul,Brazil             
6                          COOPERATIVA TRITICOLA DE GETULIO VARGAS LTDA           Estacao, Rio Grande Do Sul,Brazil   75.176735M
7                                             Golden Distribuidora Ltda              Vitoria, Espirito Santo,Brazil             
8                                    JTF COMERCIO E REPRESENTACOES LTDA                 Colider, Mato Grosso,Brazil             
9                                                MARINHO VESTUARIO LTDA                       Eusebio, Ceara,Brazil             
10  COTIA FOODS COMERCIO E REPRESENTACAO LTDA - EM RECUPERACAO JUDICIAL                     Cotia, Sao Paulo,Brazil             
11                                                 FOKUS LOGISTICA LTDA          Aparecida De Goiania, Goias,Brazil             
12                 R. SHIBUYA TENDENCIA MARKETING E REPRESENTACOES LTDA       Rio De Janeiro, Rio De Janeiro,Brazil             
13                                      TIM COMERCIO DE EMBALAGENS LTDA         Belo Horizonte, Minas Gerais,Brazil             
14                                                       PEDRAFORT LTDA            Sete Lagoas, Minas Gerais,Brazil             
15                   FARMA-RAPIDA MEDICAMENTOS E MATERIAIS ESPECIAIS SA           Natal, Rio Grande Do Norte,Brazil   48.913861M
16                                        PORTOFINO REPRESENTACOES LTDA             Botuvera, Santa Catarina,Brazil             
17                               NOROEST REPRESENTACOES COMERCIAIS LTDA                       Jaru, Rondonia,Brazil             
18                  LOGIMED DISTRIBUIDORA SOCIEDADE EMPRESARIA LIMITADA                 Sao Paulo, Sao Paulo,Brazil             
19                                            Filon Confecções - EIRELI                 São Paulo, Sao Paulo,Brazil             
20                               LEMES & LIMA COMERCIO E LOGISTICA LTDA                       Goiania, Goias,Brazil             
21                                              CERAMICA JACARANDA LTDA     Ribeirao Das Neves, Minas Gerais,Brazil             
22                             NORDICAL REPRESENTANTE DE ALIMENTOS LTDA  Jaboatao Dos Guararapes, Pernambuco,Brazil             
23                QUESALON REPRESENTACAO DE PRODUTOS FARMACEUTICOS LTDA                    Alhandra, Paraiba,Brazil             
24                                   ATACK REPRESENTACAO COMERCIAL LTDA              Vitoria, Espirito Santo,Brazil             
25                      LESTE BRASILEIRA IMPORTADORA E EXPORTADORA LTDA            Cariacica, Espirito Santo,Brazil             
26                                        JUCELITO BORDIGNON & CIA LTDA          Sao Sepe, Rio Grande Do Sul,Brazil             
27                      CASAS DA LAVOURA REPRESENTACOES COMERCIAIS LTDA                       Goiania, Goias,Brazil             
28                                              UNISOAP COSMETICOS LTDA              Praia Grande, Sao Paulo,Brazil             
29                                                 MOTIVA MAQUINAS LTDA                      Salvador, Bahia,Brazil             
30                                                   BC COSMETICOS LTDA                 Sao Paulo, Sao Paulo,Brazil             
31                                     ORGANIZACOES ALMEIDA SOARES LTDA         Belo Horizonte, Minas Gerais,Brazil             
32                            Refinitiv Brasil Servicos Economicos Ltda                 Sao Paulo, Sao Paulo,Brazil             
33                                              JBC REPRESENTACOES LTDA                   Conchal, Sao Paulo,Brazil             
34                            P & P RIO DISTRIBUIDORA DE ALIMENTOS LTDA       Rio De Janeiro, Rio De Janeiro,Brazil             
35            FORMATTO TELHAS E TELHADOS REPRESENTACAO COMERCIAL EIRELI            Jaguaruna, Santa Catarina,Brazil             
36                   MACLENY - DISTRIBUIDORA DE PRODUTOS DE BELEZA LTDA                 Sao Paulo, Sao Paulo,Brazil             
37                                    ELG REPRESENTACAO E COMERCIO LTDA       Jaragua Do Sul, Santa Catarina,Brazil             
38                      ELFA PRODUTOS FARMACEUTICOS E HOSPITALARES LTDA                    Cabedelo, Paraiba,Brazil             
39                      COMERCIO E EXPORTACAO DE CEREAIS MUNARETTO LTDA           Bom Sucesso Do Sul, Parana,Brazil             
40                                    RGE DISTRIBUIDORA DE BEBIDAS LTDA          Montes Claros, Minas Gerais,Brazil             
41                                A.S. REPRESENTACAO DE EMBALAGENS LTDA                 Sao Paulo, Sao Paulo,Brazil             
42                                                   ON LINE TRADING SA     Novo Hamburgo, Rio Grande Do Sul,Brazil    21.79624M
43                          AMX COMERCIO E SERVICOS DE AUTOMOTORES LTDA             Itaborai, Rio De Janeiro,Brazil             
44                                      SOL EMBALAGENS PLASTICAS EIRELI                      Camacari, Bahia,Brazil             
45    MJB COMERCIO DE EQUIPAMENTOS ELETRONICOS E GESTAO DE PESSOAL LTDA                  Cuiaba, Mato Grosso,Brazil             
46                                           EBANOS REPRESENTACOES LTDA    Estancia Velha, Rio Grande Do Sul,Brazil             
47     TENXE SERVICOS DE REPRESENTACAO COMERCIAL E TELEATENDIMENTO LTDA                     Curitiba, Parana,Brazil             
48                                       BRASILVEST REPRESENTACOES LTDA               Gaspar, Santa Catarina,Brazil             
49                                   EURO MED INDUSTRIA E COMERCIO LTDA                 Timbauba, Pernambuco,Brazil             
并保存
data.csv
(来自LibreOffice的屏幕截图):


编辑:要刮取多个页面,请使用以下示例:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

# remove unnecessary information:
for t in soup.select('.show-mobile'):
    t.extract()

all_data = []
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
                    soup.select('#companyResults .col-md-4')[1:],
                    soup.select('#companyResults .col-md-2')[1:]):
    all_data.append({
        'Name': c1.get_text(strip=True),
        'Location': ' '.join(c2.get_text(strip=True).split()),
        'Revenue': c3.get_text(strip=True)
    })
    
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
params = {'page': 1}
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'

all_data = []
for params['page'] in range(1, 3):  # <-- increase number of pages here
    print('Page {}...'.format(params['page']))
    soup = BeautifulSoup(requests.get(url, headers=headers, params=params).content, 'html.parser')

    # remove unnecessary information:
    for t in soup.select('.show-mobile'):
        t.extract()

    for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
                        soup.select('#companyResults .col-md-4')[1:],
                        soup.select('#companyResults .col-md-2')[1:]):
        all_data.append({
            'Name': c1.get_text(strip=True),
            'Location': ' '.join(c2.get_text(strip=True).split()),
            'Revenue': c3.get_text(strip=True)
        })
    
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
导入请求
作为pd进口熊猫
从bs4导入BeautifulSoup
headers={'User-Agent':'Mozilla/5.0(X11;Ubuntu;Linux x86_64;rv:80.0)Gecko/20100101 Firefox/80.0'}
参数={'page':1}
url='1〕https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'
所有_数据=[]

对于范围(1,3)中的参数['page']:#!你太棒了。每次我有一个漂亮的问题,你总能想出一个令人惊奇的解决方案。这非常好用。还有一个问题——假设我有很多页面——我可以在每个页面的循环中写一个for循环,其中包含所有这些信息吗?url的末尾有一个“page=x”。