删除刮取数据之间的空格-Python

删除刮取数据之间的空格-Python,python,web-scraping,Python,Web Scraping,我正试图从网站上抓取一些数据,并将其保存到csv文件中。当我得到scaraped数据时,每行之间都有一个巨大的空间。我希望能够删除这个不必要的空间。下面是我的代码 from bs4 import BeautifulSoup import requests import csv #URL to be scraped url_to_scrape = 'https://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/CategoryDisplay?l

我正试图从网站上抓取一些数据,并将其保存到csv文件中。当我得到scaraped数据时,每行之间都有一个巨大的空间。我希望能够删除这个不必要的空间。下面是我的代码

from bs4 import BeautifulSoup
import requests
import csv

#URL to be scraped
url_to_scrape = 'https://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/CategoryDisplay?langId=44&storeId=10151&catalogId=10241&categoryId=310864&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&beginIndex=0&promotionId=&listId=&searchTerm=&hasPreviousOrder=&previousOrderId=&categoryFacetId1=&categoryFacetId2=&ImportedProductsCount=&ImportedStoreName=&ImportedSupermarket=&bundleId=&parent_category_rn=13343&top_category=13343&pageSize=120#langId=44&storeId=10151&catalogId=10241&categoryId=310864&parent_category_rn=13343&top_category=13343&pageSize=120&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&searchTerm=&beginIndex=0&hideFilters=true'
#Load html's plain data into a variable
plain_html_text = requests.get(url_to_scrape)
#parse the data
soup = BeautifulSoup(plain_html_text.text, "lxml")
#
# #Get the name of the class

csv_file = open('sainsburys.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Description','Price'])

for name_of in soup.find_all('li',class_='gridItem'):
    name = name_of.h3.a.text
    print(name)
    try:
        price = name_of.find('div', class_='product')
        pricen = price.find('div', class_='addToTrolleytabBox').p.text
        print(pricen)
        csv_writer.writerow([name, pricen])
    except:
        print('Sold Out')
        print()

csv_writer.writerow([name, pricen])
csv_file.close()
我得到的结果是:

                                       J. James Chicken Goujons 270g



        £1.25/unit


                                        Sainsbury's Chicken Whole Bird (approx. 0.9-1.35kg)



        £1.90/kg


                                        Sainsbury's British Fresh Chicken Fajita Mini Fillets 320g



        £2.55/unit


                                        Sainsbury's Slow Cook Fire Cracker Chicken 573g



        £4.75/unit

谢谢

您可以使用
.strip()
。。。它删除前导空格和尾随空格

>>> s = "      I'm a sentence     "
>>>s.strip()
I'm a sentence

应用于您的问题

查找汤中的名称。查找所有('li',class='gridItem'):
name=name_of.h3.a.text.strip()
印刷品(名称)
尝试:
price=name\u of.find('div',class='product')
pricen=price.find('div',class='addtotrolleytabox').p.text.strip()
印刷品(价格)
csv_writer.writerow([名称,价格])
除:
打印(“售罄”)
打印()
csv_writer.writerow([名称,价格])
csv_文件.close()
如果无法复制您的代码,我就无法对其进行测试。但是,如果
pricen
name
是具有重要尾随和前导空格的字符串,则这应该是可行的。



我希望这有帮助您可以使用
.strip()。。。它删除前导空格和尾随空格

>>> s = "      I'm a sentence     "
>>>s.strip()
I'm a sentence

应用于您的问题

查找汤中的名称。查找所有('li',class='gridItem'):
name=name_of.h3.a.text.strip()
印刷品(名称)
尝试:
price=name\u of.find('div',class='product')
pricen=price.find('div',class='addtotrolleytabox').p.text.strip()
印刷品(价格)
csv_writer.writerow([名称,价格])
除:
打印(“售罄”)
打印()
csv_writer.writerow([名称,价格])
csv_文件.close()
如果无法复制您的代码,我就无法对其进行测试。但是,如果
pricen
name
是具有重要尾随和前导空格的字符串,则这应该是可行的。


我希望这有帮助将从两侧删除所有空白字符:

>>> "              a             ".strip()
'a'
只需将此应用于每个打印语句。

将从两侧删除所有空白字符:

>>> "              a             ".strip()
'a'

只需将此应用于每个打印语句。

如果您记录网络流量并对其进行过滤以仅查看XHR资源,您将发现一个与AJAX web应用程序对话的资源。它与服务器对话,服务器生成HTML(不幸的是,不完全是JSON,它是嵌入JSON响应的HTML)。这并不是必需的,因为您的代码似乎正在抓取页面。然而,这是一种更可爱的获得产品的方式。您也不必担心分页之类的事情。正如其他人已经指出的那样,要去除前导和尾随空格,请使用
str.strip
。在本例中,我只打印前十个产品(共114个)。是的,我意识到我可以将查询字符串附加到url,而不是创建一个
params
dict,但是这样更容易阅读和更改:

import requests
from bs4 import BeautifulSoup


class Product:

    def __init__(self, html):
        from bs4 import BeautifulSoup

        soup = BeautifulSoup(html, "html.parser")
        self.name, _, self.weight = soup.find("a").text.strip().rpartition(" ")
        self.price_per_unit = soup.find("p", {"class": "pricePerUnit"}).text.strip()
        self.price_per_measure = soup.find("p", {"class": "pricePerMeasure"}).text.strip()


    def __str__(self):
        return f"\"{self.name}\" ({self.weight}) - {self.price_per_unit}"

url = "https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/AjaxApplyFilterBrowseView"

params = {
    "langId": "44",
    "storeId": "10151",
    "catalogId": "10241",
    "categoryId": "310864",
    "parent_category_rn": "13343",
    "top_category": "13343",
    "pageSize": "120",
    "orderBy": "FAVOURITES_ONLY|SEQUENCING|TOP_SELLERS",
    "searchTerm": "",
    "beginIndex": "0",
    "hideFilters": "true",
    "requesttype": "ajax"
}

response = requests.get(url, params=params)
response.raise_for_status()

product_info = response.json()[4]["productLists"][0]["products"]

products = [Product(p["result"]) for p in product_info[:10]]

for product in products:
    print(product)
输出:

"Sainsbury's Chicken Thigh Fillets" (640g) - £3.40/unit
"Sainsbury's Mini Chicken Breast Fillets" (320g) - £2.00/unit
"Sainsbury's Chicken Thighs" (1kg) - £1.95/unit
"Sainsbury's Chicken Breast Fillets" (300g) - £1.70/unit
"Sainsbury's Chicken Drumsticks" (1kg) - £1.70/unit
"Sainsbury's Chicken Thigh Fillets" (320g) - £1.85/unit
"Sainsbury's Chicken Breast Diced" (410g) - £2.40/unit
"Sainsbury's Chicken Small Whole Bird" (1.35kg) - £2.80/unit
"Sainsbury's Chicken Thighs & Drumsticks" (540g) - £1.00/unit
"Sainsbury's Chicken Breast Fillets" (640g) - £3.60/unit
>>> product.price_per_measure
'£5.63/kg'
>>> 

如果您记录网络流量并对其进行过滤以仅查看XHR资源,您将发现一个与AJAX web应用程序对话的资源。它与服务器对话,服务器生成HTML(不幸的是,不完全是JSON,它是嵌入JSON响应的HTML)。这并不是必需的,因为您的代码似乎正在抓取页面。然而,这是一种更可爱的获得产品的方式。您也不必担心分页之类的事情。正如其他人已经指出的那样,要去除前导和尾随空格,请使用
str.strip
。在本例中,我只打印前十个产品(共114个)。是的,我意识到我可以将查询字符串附加到url,而不是创建一个
params
dict,但是这样更容易阅读和更改:

import requests
from bs4 import BeautifulSoup


class Product:

    def __init__(self, html):
        from bs4 import BeautifulSoup

        soup = BeautifulSoup(html, "html.parser")
        self.name, _, self.weight = soup.find("a").text.strip().rpartition(" ")
        self.price_per_unit = soup.find("p", {"class": "pricePerUnit"}).text.strip()
        self.price_per_measure = soup.find("p", {"class": "pricePerMeasure"}).text.strip()


    def __str__(self):
        return f"\"{self.name}\" ({self.weight}) - {self.price_per_unit}"

url = "https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/AjaxApplyFilterBrowseView"

params = {
    "langId": "44",
    "storeId": "10151",
    "catalogId": "10241",
    "categoryId": "310864",
    "parent_category_rn": "13343",
    "top_category": "13343",
    "pageSize": "120",
    "orderBy": "FAVOURITES_ONLY|SEQUENCING|TOP_SELLERS",
    "searchTerm": "",
    "beginIndex": "0",
    "hideFilters": "true",
    "requesttype": "ajax"
}

response = requests.get(url, params=params)
response.raise_for_status()

product_info = response.json()[4]["productLists"][0]["products"]

products = [Product(p["result"]) for p in product_info[:10]]

for product in products:
    print(product)
输出:

"Sainsbury's Chicken Thigh Fillets" (640g) - £3.40/unit
"Sainsbury's Mini Chicken Breast Fillets" (320g) - £2.00/unit
"Sainsbury's Chicken Thighs" (1kg) - £1.95/unit
"Sainsbury's Chicken Breast Fillets" (300g) - £1.70/unit
"Sainsbury's Chicken Drumsticks" (1kg) - £1.70/unit
"Sainsbury's Chicken Thigh Fillets" (320g) - £1.85/unit
"Sainsbury's Chicken Breast Diced" (410g) - £2.40/unit
"Sainsbury's Chicken Small Whole Bird" (1.35kg) - £2.80/unit
"Sainsbury's Chicken Thighs & Drumsticks" (540g) - £1.00/unit
"Sainsbury's Chicken Breast Fillets" (640g) - £3.60/unit
>>> product.price_per_measure
'£5.63/kg'
>>> 

嘿,这似乎工作得更好,但我不知道如何才能得到的价格以及现在。我希望它也能显示每种产品的价格。谢谢大家!@RamgithUnniJagajith我已经更新了答案中的代码,看一看。我基本上刚刚创建了一个
Product
类,它从HTML中提取它感兴趣的数据。每个产品都有一个
名称
重量
每单位价格
每计量单位价格
。完美!非常感谢。嘿,这似乎效果更好,但我不确定我现在如何才能得到这个价格。我希望它也能显示每种产品的价格。谢谢大家!@RamgithUnniJagajith我已经更新了答案中的代码,看一看。我基本上刚刚创建了一个
Product
类,它从HTML中提取它感兴趣的数据。每个产品都有一个
名称
重量
每单位价格
每计量单位价格
。完美!谢谢