Web scraping 无法理解ValueError:以10为基数的int()的文本无效:'تومان';

Web scraping 无法理解ValueError:以10为基数的int()的文本无效:'تومان';,web-scraping,scrapy,web-crawler,Web Scraping,Scrapy,Web Crawler,我的爬虫程序工作不正常,我找不到解决方案 以下是我的spider的相关部分: def parse(self, response): original_price=0 discounted_price=0 star=0 discounted_percent=0 try: for product in response.xpath("//ul[@class='c-listing__ite

我的爬虫程序工作不正常,我找不到解决方案

以下是我的spider的相关部分:

def parse(self, response):
        original_price=0
        discounted_price=0
        star=0
        discounted_percent=0
        try:
            for product in response.xpath("//ul[@class='c-listing__items js-plp-products-list']/li"):
                title= product.xpath(".//div/div[2]/div/div/a/text()").get()
                if product.xpath(".//div/div[2]/div[2]/div[1]/text()"):
                    star= float(str(product.xpath(".//div/div[2]/div[2]/div[1]/text()").get()))
                if product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()"):
                    discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
                if product.xpath(".//div/div[2]/div[3]/div/div/div/text()"):
                    discounted_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/div/text()").get().strip()).replace(',', ''))
                if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"):
                    original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', ''))
                    discounted_amount= original_price-discounted_price
                else:
                    original_price= print("not available")
                    discounted_amount= print("not available")
                url= response.urljoin(product.xpath(".//div/div[2]/div/div/a/@href").get())
这是我的日志:

2020-10-21 16:49:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book>
2020-10-21 16:49:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None)
2020-10-21 16:49:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.digikala.com/search/category-book/> (referer: None)
Traceback (most recent call last):
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\shima\projects\digi_allbooks\digi_allbooks\spiders\allbooks.py", line 31, in parse
    discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
ValueError: invalid literal for int() with base 10: 'تومان'
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 16:49:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 939,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 90506,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 21, 13, 19, 57, 630044),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2020, 10, 21, 13, 19, 55, 914304)}
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Spider closed (finished)
2020-10-21 16:49:56[scrapy.downloadermiddleware.redirect]调试:将(301)重定向到
2020-10-21 16:49:57[刮屑核心引擎]调试:爬网(200)(参考:无)
2020-10-21 16:49:57[刮板式堆芯刮板]错误:十字轴错误处理(参考:无)
回溯(最近一次呼叫最后一次):
文件“C:\Users\shima\anaconda3\envs\virtual\u workspace\lib\site packages\scrapy\utils\defer.py”,第102行,在iter\u errback中
下一个(it)
文件“C:\Users\shima\anaconda3\envs\virtual\u workspace\lib\site packages\scrapy\spidermiddleware\offsite.py”,第29行,进程中\u spider\u输出
对于结果中的x:
文件“C:\Users\shima\anaconda3\envs\virtual\u workspace\lib\site packages\scrapy\spidermiddleware\referer.py”,第339行,在
返回(_set_referer(r)表示结果中的r或())
文件“C:\Users\shima\anaconda3\envs\virtual\u workspace\lib\site packages\scrapy\spidermiddleware\urlength.py”,第37行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“C:\Users\shima\anaconda3\envs\virtual\u workspace\lib\site packages\scrapy\spidermiddleware\depth.py”,第58行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“C:\Users\shima\projects\digi\u allbooks\digi\u allbooks\spider\allbooks.py”,第31行,在parse中
折扣百分比=int(str(product.xpath(“.//div/div[2]/div[3]/div/div/div[1]/span/text()”).get().strip()).replace(“٪”和“))
ValueError:以10为基数的int()的文本无效:“تومان”
2020-10-21 16:49:57[刮屑芯发动机]信息:关闭卡盘(完成)
2020-10-21 16:49:57[斯拉比统计局]信息:倾销斯拉比统计局:
{'downloader/request_bytes':939,
“下载程序/请求计数”:3,
“下载程序/请求方法\计数/获取”:3,
“下载程序/响应字节”:90506,
“下载程序/响应计数”:3,
“下载程序/响应状态\计数/200”:2,
“下载程序/响应状态\计数/301”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2020,10,21,13,19,57,630044),
“日志计数/调试”:3,
“日志计数/错误”:1,
“日志计数/信息”:9,
“日志计数/警告”:1,
“响应\u已收到\u计数”:2,
“机器人文本/请求计数”:1,
“机器人文本/响应计数”:1,
“robotstxt/response\u status\u count/200”:1,
“调度程序/出列”:2,
“调度程序/出列/内存”:2,
“调度程序/排队”:2,
“调度程序/排队/内存”:2,
“spider_异常/ValueError”:1,
“开始时间”:datetime.datetime(2020,10,21,13,19,55914304)}
2020-10-21 16:49:57[刮屑堆芯发动机]信息:十字轴关闭(完成)
我猜它说int()函数中有一个字符串,返回ValueError,但我使用的XPath目标是一个数字,而不是字符串。
我不能正确地得到错误,所以我找不到解决方法。有人能帮我一下吗?

在至少一次迭代中,这行是在刮取
而不是整数

discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))

从谷歌的搜索结果来看,这似乎是一个货币单位。您需要处理您的XPath,或者让蜘蛛忽略此退货,因为此商品没有折扣

对于您的意图来说,这个XPath可能是一个更好的选择:(虽然我没有检查所有项目)

product.xpath(".//div[@class="c-price__discount-oval"]/span/text()").get()