Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/333.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用python和beautifulsoup从BBB网站获取数据_Python_Debugging_Web Scraping_Beautifulsoup - Fatal编程技术网

使用python和beautifulsoup从BBB网站获取数据

使用python和beautifulsoup从BBB网站获取数据,python,debugging,web-scraping,beautifulsoup,Python,Debugging,Web Scraping,Beautifulsoup,我正在使用python和beautifulsoup从BBB网站获取清单 我的代码在yelp和黄页上运行得很好,但在那之后,当我开始使用BBB网站链接时,我遇到了错误 from bs4 import BeautifulSoup import requests import sys import csv ## Get the min and max page numbers pagenum=0 maxpage =0 ## loop go thourgh the pages while pagenu

我正在使用python和beautifulsoup从BBB网站获取清单

我的代码在yelp和黄页上运行得很好,但在那之后,当我开始使用BBB网站链接时,我遇到了错误

from bs4 import BeautifulSoup
import requests
import sys
import csv
## Get the min and max page numbers
pagenum=0

maxpage =0
## loop go thourgh the pages
while pagenum <= maxpage:

    page = 'https://www.bbb.org/search?find_country=USA&find_entity=60980-000&find_id=396_60980-000_alias&find_latlng=40.762801%2C-73.977818&find_loc=New%20York%2C%20NY&find_text=web%20development&find_type=Category&page=2'
    source= requests.get(page).text
    soup= BeautifulSoup(source, 'lxml')
    pagenum = pagenum+10
    for PParentDiv in soup.find_all('div' , class_="fbHYdT MuiPaper-rounded"):
        try:
            PName= PParentDiv.find('a' , class_='Name-sc-1srnbh5-0').get_text()
            print(PName)
        except Exception as e:
            g=''
            print('notworking')
从bs4导入美化组
导入请求
导入系统
导入csv
##获取最小和最大页码
pagenum=0
maxpage=0
##循环翻页

虽然pagenum您可以轻松地从包含此信息的脚本标记中正则化json,然后使用json库进行解析。这里的优势在于
数据
变量实际上您拥有一切。我从中提取姓名、地址和电话

import requests, re, json

headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.bbb.org/search?find_country=USA&find_entity=60980-000&find_id=396_60980-000_alias&find_latlng=40.762801%2C-73.977818&find_loc=New%20York%2C%20NY&find_text=web%20development&find_type=Category&page=2', headers = headers)
p = re.compile(r'PRELOADED_STATE__ = (.*?);')
data = json.loads(p.findall(r.text)[0])
results = [(item['businessName'], ' '.join([item['address'],item['city'], item['state'], item['postalcode']]), item['phone']) for item in data['searchResult']['results']]
print(results)
试试这个,没有正则表达式:

scr = soup.find_all('script', id="BbbDtmData")
scr2 = soup.find_all('div', class_="Details-sc-1vh1927-0 hHqWfJ")

companies = []
ids = []

for co in range(len(scr2)):
    companies.append(scr2[co].find('a').text)
    companies.append(scr2[co].find('strong').text)

id_dat = scr[0].text
target = id_dat.split('var bbbDtmData = ')
data = json.loads(target[1])
final = data2['search']['results']
for i in final:
    ids.append(i['businessId'])

for co, id in zip(companies, ids):
    print(co,id)
链接页面的输出:

Template Studios/Jinx Studios 94645
115 East 57th St, New York, NY 10022 144428
Roark Tech Services 120257
 New York, NY 10017-2452 85275

等等。

你想要的结果是什么?想要的结果是得到公司的名称、电话号码和地址。我理解你想要得到的是什么。你能解释一个类似于魅力的小技巧吗?我能得到这个的参考资料以便我能了解更多。谢谢,而且我现在得到了这个错误的回溯(最近的一次电话):文件“E:\Python\Python36\scraper\bb.py”,第23行,在data=json.loads(p.findall(r.text)[0])json.decoder.jsondecoderror:Unterminated string开始于:第1行第81902列(char 81901)这是非常正确的,并且清楚了我的想法谢谢,
Template Studios/Jinx Studios 94645
115 East 57th St, New York, NY 10022 144428
Roark Tech Services 120257
 New York, NY 10017-2452 85275