Python web抓取-处理每个ID的多个观察_Python

Python web抓取-处理每个ID的多个观察

python

Python web抓取-处理每个ID的多个观察,python,Python,请原谅我的英语。我是python新手，我正在通过边做边学尽我最大的努力。我想从一个黄页网站上收集一些数据。我面临两大问题：公司名称和电话号码之间存在不一致（有些没有，有些有很多）。如何收集并正确匹配公司名称或ID及其电话号码我也无法获得公司的电子邮件地址或网站。问候下面是我的代码： import pandas as pd from bs4 import BeautifulSoup as bs import urllib.request url = 'www.somesite.

请原谅我的英语。我是python新手，我正在通过边做边学尽我最大的努力。我想从一个黄页网站上收集一些数据。我面临两大问题：

公司名称和电话号码之间存在不一致（有些没有，有些有很多）。如何收集并正确匹配公司名称或ID及其电话号码
我也无法获得公司的电子邮件地址或网站。问候

下面是我的代码：

import pandas as pd
from bs4 import BeautifulSoup as bs
import urllib.request

url = 'www.somesite.com'
page = urllib.request.urlopen(url,timeout=5)
soup = bs(page)
soup

# Nom de l'entreprise
entreprise = soup.find_all('a', {'class': 'some_class_address_link_1'})
entreprise
entreprises = []
for e in entreprise :
    e = e.text
    e = e.replace('\n','')
    e = e.replace('"','')
    entreprises.append(e)
entreprises

# Domaine
Domain = soup.find_all('div', {'class': 'some_class_address_link_2'})
Domain 
Domaine = []
for e in Domain :
    e = e.text
    e = e.replace('\n','')
    e = e.replace('"','')
    Domaine.append (e)
Domaine

# Adresse
Addr = soup.find_all('address', {'class': 'some_class_address_link_3'})
Addr 
Adresse = []
for e in Addr :
    e = e.text
    e = e.replace('\n','')
    e = e.replace('"','')
    Adresse.append (e)
Adresse

# Fiche
Fich = soup.find_all('div', {'class': 'some_class_address_link_4'})
Fich 
Fiche = []
for e in Fich :
    e = e.text
    e = e.replace('\n','')
    e = e.replace('"','')
    Fiche.append (e)
Fiche

# Téléphone
tel = soup.find_all('a', {'class': 'some_class_address_link_5'})
Telephone = []
for e in tel :
    e = e.text
    e = e.replace('\n','')
    e = e.replace('"','')
    Telephone.append (e)
Telephone

# Saving to database in excel 
df = pd.DataFrame({'Nom de l'entreprise':entreprises,'Domaine d'Activités':Domaine,'Adresse':Adresse,'Fiche':Fiche,'Telephone':Telephone})

df.to_csv('company_full.csv', index=False, encoding='utf-8')

data = pd.read_csv('company_full.csv')
data 

end

当我运行时，出现以下错误：

文件“”，第1行 df=pd.DataFrame（{'Nom de l'enterprise'：enterprises，'Domaine d'Activités'：Domaine，'adrese'：adrese，'Fiche'：Fiche，'Telephone'：Telephone}） ^ SyntaxError:无效语法（跳过11行）

我猜这是因为向量的大小不一样。可能是因为有些公司没有电话号码，有些公司只有一个，而有些公司有很多电话线

我的问题是如何在同一个汤中获得两个或多个变量，例如变量company和telephone。find_all（）命令？

您的问题需要更多细节。你在使用什么模块？