Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/338.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python JSONDecodeError:应为值:第1行第1列(字符0)和json.loads(片段)_Python_Json_Web Scraping_Beautifulsoup - Fatal编程技术网

Python JSONDecodeError:应为值:第1行第1列(字符0)和json.loads(片段)

Python JSONDecodeError:应为值:第1行第1列(字符0)和json.loads(片段),python,json,web-scraping,beautifulsoup,Python,Json,Web Scraping,Beautifulsoup,作为新手,我从“数据科学的实用网络抓取”开始练习网络抓取。当我回溯时,我遇到了“JSONDecodeError:期望值:第1行第1列(字符0)”,我从一开始就遇到了问题。如果有人帮助我,那将对我很有帮助 # Required packages import requests import json import re from bs4 import BeautifulSoup as bs import dataset # Creating Dataset into Mongodb / SQ

作为新手,我从“数据科学的实用网络抓取”开始练习网络抓取。当我回溯时,我遇到了“JSONDecodeError:期望值:第1行第1列(字符0)”,我从一开始就遇到了问题。如果有人帮助我,那将对我很有帮助

# Required packages
import requests
import json 
import re 
from bs4 import BeautifulSoup as bs
import dataset

# Creating Dataset into Mongodb / SQLite
db = dataset.connect('sqlite:/// reviews.db')

review_url = 'https://www.amazon.com/ss/customer-reviews/ajax/reviews/get/'
product_id = '1449355730'
session = requests.Session()
session.headers.update({
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' +
    '(KHTML, like Gecko) Chrome/ 62.0.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))


def parse_reviews(reply):
    reviews = []
    for fragment in reply.split('&&&'):
        if not fragment.strip():
            continue
        json_fragment = json.loads(fragment)
        if json_fragment[0] != 'append':
            continue
        html_soup = bs(json_fragment[2], 'html.parser')
        div = html_soup.find('div', class_='review')
        if not div:
            continue
        review_id = div.get('id')
        # find & clean the rating : 
        review_classes = ' '.join(html_soup.find(class_ = 'review-rating').get('class'))
        rating = re.search('a-star-(\d+)', review_classes).group(1)
        title = html_soup.find(class_='review-title').get_text(strip = True)
        review = html_soup.find(class_='review-text').get_text(strip = True)
        review.append({'review_id' : review_id,
                      'rating' : rating,
                      'title' : title,
                      'review' : review})
    return reviews


def get_reviews(product_id, page):
    data = {
        'sortBy' : '',
        'reveiwerType' : 'all_reviews',
        'formatType' : '',
        'mediaType' : '',
        'filterByStar' : 'all_stars',
        'pageNumber' : page,
        'filterByKeyword' : '',
        'shouldAppend' : 'undefined',
        'deviceType' : 'desktop',
        'reftag' : 'cm_cr_getr_d_paging_btm_{}'.format(page),
        'pageSize' : 15,
        'asin' : product_id,
        'scope' : 'reviewsAjax1'
    }
    r = session.post(review_url + 'ref=' + data['reftag'], data = data)
    reviews = parse_reviews(r.text)
    return reviews

page = 1
while True:
    print("Scraping page", page)
    reviews = get_reviews(product_id, page)
    if not reviews:
        break
    for review in reviews:
        print(' -', review['rating'], review['title'])
        db['reviews'].upsert(review, ['review_id'])
    page += 1
下面的错误消息给了我-

**JSONDecodeError**                          Traceback (most recent call last)
<ipython-input-5-75cef79b98a4> in <module>
     60 while True:
     61     print("Scraping page", page)
---> 62     reviews = get_reviews(product_id, page)
     63     if not reviews:
     64         break

<ipython-input-5-75cef79b98a4> in get_reviews(product_id, page)
     54     }
     55     r = session.post(review_url + 'ref=' + data['reftag'], data = data)
---> 56     reviews = parse_reviews(r.text)
     57     return reviews
     58 

<ipython-input-5-75cef79b98a4> in parse_reviews(reply)
     17         if not fragment.strip():
     18             continue
---> 19         json_fragment = json.loads(fragment)
     20         if json_fragment[0] != 'append':
     21             continue

**JSONDecodeError:** Expecting value: line 1 column 1 (char 0)
**JSONDecodeError**回溯(最近一次调用)
在里面
60虽然正确:
61打印(“刮页”,第页)
--->62评论=获取评论(产品id,第页)
63如果没有审查:
64休息
获取评论(产品id,第页)
54     }
55 r=session.post(查看url+'ref='+data['reftag'],data=data)
--->56条评论=解析_评论(r.text)
57返回审查
58
在parse_评论中(回复)
17如果不是片段。带():
18继续
--->19 json_fragment=json.loads(片段)
20如果json_片段[0]!='附加':
21继续
**JSONDecodeError:*期望值:第1行第1列(字符0)
请帮帮我,我试过里面的所有东西,但还是卡住了。
如前所述,
片段可能不是有效的json格式(我选中时不是)。我怀疑这本书已经过时几年了,所以他们使用的示例/代码可能不起作用。只是打了一个回合,看起来亚马逊确实改变了一些事情

这确实对我有用,我注意到了细微的变化,所以你可以比较。我还注释掉了mongoDB的内容,因为这更多的是关于webscrape的问题。我不知道该部分是否会给您带来任何错误:

# Required packages
import requests
import json 
import re 
from bs4 import BeautifulSoup as bs
#import dataset

# Creating Dataset into Mongodb / SQLite
#db = dataset.connect('sqlite:/// reviews.db')


review_url = 'https://www.amazon.com/hz/reviews-render/ajax/reviews/get/' #<-- slight change
product_id = '1449355730'
session = requests.Session()
session.headers.update({'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'
})
url = 'https://www.amazon.com/product-reviews/{}/'.format(product_id)
session.get(url)


def parse_reviews(reply):
    reviews = []
    for fragment in reply.split('&&&'):
        if not fragment.strip():
            continue
        json_fragment = json.loads(fragment)
        if json_fragment[0] != 'append':
            continue
        html_soup = bs(json_fragment[2], 'html.parser')
        div = html_soup.find('div', {'data-hook':'review'}) #<-- changed
        if not div:
            continue
        review_id = div.get('id')
        # find & clean the rating : 
        review_classes = ' '.join(html_soup.find(class_ = 'review-rating').get('class'))
        rating = re.search('a-star-(\d+)', review_classes).group(1)
        title = html_soup.find(class_='review-title').get_text(strip = True)
        review = html_soup.find(class_='review-text').get_text(strip = True)
        reviews.append({'review_id' : review_id,             #<-- here may be a typo. should be reviews that you are appending to
                      'rating' : rating,
                      'title' : title,
                      'review' : review})
    return reviews


def get_reviews(product_id, page):
    data = {
        'sortBy' : '',
        'reveiwerType' : 'all_reviews',
        'formatType' : '',
        'mediaType' : '',
        'filterByStar' : 'all_stars',
        'pageNumber' : page,
        'filterByKeyword' : '',
        'shouldAppend' : 'undefined',
        'deviceType' : 'desktop',
        'reftag' : 'cm_cr_getr_d_paging_btm_{}'.format(page),
        'pageSize' : 15,
        'asin' : product_id,
        'scope' : 'reviewsAjax2' #<-- changed
    }
    r = session.post(review_url + 'ref=' + data['reftag'], data = data)
    reviews = parse_reviews(r.text)
    return reviews

page = 1
while True:
    print("Scrapping page", page)
    reviews = get_reviews(product_id, page)
    if not reviews:
        break
    for review in reviews:
        print(' -', review['rating'], review['title'])
        #db['reviews'].upsert(review, ['review_id'])
    page += 1

您需要检查
片段
-它是一个有效的json字符串吗?我想您的意思是“抓取”。报废意味着丢弃。错误是一个提示,当您尝试
json.load
it时,
fragment
是一个空字符串。
Scrapping page 1
 - 5 Best Python book for a beginner
 - 2 Thorough but bloated
 - 5 let me try to explain why this 1600 page book may actually end up saving you a lot of time and making you a better Python progra
 - 3 Very dense. Too much apology for being dense. Very detailed, yet inefficient.
 - 5 The book is long because it's thorough, and it's a quality book
 - 4 The Python Bible - not for beginners
 - 1 Making Python, and programming, the most boring experience you can think of
 - 4 Not great for learning, good object oriented chapters
 - 5 Perfect for ... in-between noob and professional, and wanting a deep understanding
 - 3 I think there might be an excellent 300-page book somewhere in these 1500 pages
 - 5 A Mark Lutz Trifecta of Python Winners
 - 5 Perfect for self-learners of Python
 - 5 Excellent Reference (Probably not for beginners)
 - 3 I'm glad it's here but it needs to be two books.
 - 4 From Noob to Expert
Scrapping page 2
 - 5 This is the real deal.  The full Python experience
 - 1 Incredibly verbose and repetitve.
 - 5 Very good Python beginner to intermediate book for an experienced programmer
 - 1 Bloated and not very useful
 - 5 Yeah it's that long for a reason
 - 3 Not bad, but not recommended, especially not for beginners.
 - 2 Too much fluff
 - 5 This is most comprehensive for beginner to build solid foundation for python programming! Must buy! Believe me!
 - 3 Broad, but occasionally confusing and unfocused
 - 4 Really Good Overall, But Long-Winded
 - 5 Book is up-to-date despite publication date
 - 5 This is the BEST book on the Python programming language I have found.
 - 5 Highly recommend for the new user (avoid being put off by the length of the text)
 - 5 Terrific book
 - 5 Great start, and written for the novice
Scrapping page 3
 - 4 Great Book but, geez, 8-point type?
 - 5 Incredibly detailed, thorough, but not a quick read
 - 2 Very wordy beginning programming with Python.
 - 5 A great tool for achieving Python programming expertise
 - 3 Brief and honest review
....