删除刮取数据之间的空格-Python
我正试图从网站上抓取一些数据,并将其保存到csv文件中。当我得到scaraped数据时,每行之间都有一个巨大的空间。我希望能够删除这个不必要的空间。下面是我的代码删除刮取数据之间的空格-Python,python,web-scraping,Python,Web Scraping,我正试图从网站上抓取一些数据,并将其保存到csv文件中。当我得到scaraped数据时,每行之间都有一个巨大的空间。我希望能够删除这个不必要的空间。下面是我的代码 from bs4 import BeautifulSoup import requests import csv #URL to be scraped url_to_scrape = 'https://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/CategoryDisplay?l
from bs4 import BeautifulSoup
import requests
import csv
#URL to be scraped
url_to_scrape = 'https://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/CategoryDisplay?langId=44&storeId=10151&catalogId=10241&categoryId=310864&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&beginIndex=0&promotionId=&listId=&searchTerm=&hasPreviousOrder=&previousOrderId=&categoryFacetId1=&categoryFacetId2=&ImportedProductsCount=&ImportedStoreName=&ImportedSupermarket=&bundleId=&parent_category_rn=13343&top_category=13343&pageSize=120#langId=44&storeId=10151&catalogId=10241&categoryId=310864&parent_category_rn=13343&top_category=13343&pageSize=120&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&searchTerm=&beginIndex=0&hideFilters=true'
#Load html's plain data into a variable
plain_html_text = requests.get(url_to_scrape)
#parse the data
soup = BeautifulSoup(plain_html_text.text, "lxml")
#
# #Get the name of the class
csv_file = open('sainsburys.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Description','Price'])
for name_of in soup.find_all('li',class_='gridItem'):
name = name_of.h3.a.text
print(name)
try:
price = name_of.find('div', class_='product')
pricen = price.find('div', class_='addToTrolleytabBox').p.text
print(pricen)
csv_writer.writerow([name, pricen])
except:
print('Sold Out')
print()
csv_writer.writerow([name, pricen])
csv_file.close()
我得到的结果是:
J. James Chicken Goujons 270g
£1.25/unit
Sainsbury's Chicken Whole Bird (approx. 0.9-1.35kg)
£1.90/kg
Sainsbury's British Fresh Chicken Fajita Mini Fillets 320g
£2.55/unit
Sainsbury's Slow Cook Fire Cracker Chicken 573g
£4.75/unit
谢谢您可以使用
.strip()
。。。它删除前导空格和尾随空格
>>> s = " I'm a sentence "
>>>s.strip()
I'm a sentence
应用于您的问题
查找汤中的名称。查找所有('li',class='gridItem'):
name=name_of.h3.a.text.strip()
印刷品(名称)
尝试:
price=name\u of.find('div',class='product')
pricen=price.find('div',class='addtotrolleytabox').p.text.strip()
印刷品(价格)
csv_writer.writerow([名称,价格])
除:
打印(“售罄”)
打印()
csv_writer.writerow([名称,价格])
csv_文件.close()
如果无法复制您的代码,我就无法对其进行测试。但是,如果pricen
和name
是具有重要尾随和前导空格的字符串,则这应该是可行的。
我希望这有帮助您可以使用
.strip()。。。它删除前导空格和尾随空格
>>> s = " I'm a sentence "
>>>s.strip()
I'm a sentence
应用于您的问题
查找汤中的名称。查找所有('li',class='gridItem'):
name=name_of.h3.a.text.strip()
印刷品(名称)
尝试:
price=name\u of.find('div',class='product')
pricen=price.find('div',class='addtotrolleytabox').p.text.strip()
印刷品(价格)
csv_writer.writerow([名称,价格])
除:
打印(“售罄”)
打印()
csv_writer.writerow([名称,价格])
csv_文件.close()
如果无法复制您的代码,我就无法对其进行测试。但是,如果pricen
和name
是具有重要尾随和前导空格的字符串,则这应该是可行的。
我希望这有帮助将从两侧删除所有空白字符:
>>> " a ".strip()
'a'
只需将此应用于每个打印语句。将从两侧删除所有空白字符:
>>> " a ".strip()
'a'
只需将此应用于每个打印语句。如果您记录网络流量并对其进行过滤以仅查看XHR资源,您将发现一个与AJAX web应用程序对话的资源。它与服务器对话,服务器生成HTML(不幸的是,不完全是JSON,它是嵌入JSON响应的HTML)。这并不是必需的,因为您的代码似乎正在抓取页面。然而,这是一种更可爱的获得产品的方式。您也不必担心分页之类的事情。正如其他人已经指出的那样,要去除前导和尾随空格,请使用
str.strip
。在本例中,我只打印前十个产品(共114个)。是的,我意识到我可以将查询字符串附加到url,而不是创建一个params
dict,但是这样更容易阅读和更改:
import requests
from bs4 import BeautifulSoup
class Product:
def __init__(self, html):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
self.name, _, self.weight = soup.find("a").text.strip().rpartition(" ")
self.price_per_unit = soup.find("p", {"class": "pricePerUnit"}).text.strip()
self.price_per_measure = soup.find("p", {"class": "pricePerMeasure"}).text.strip()
def __str__(self):
return f"\"{self.name}\" ({self.weight}) - {self.price_per_unit}"
url = "https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/AjaxApplyFilterBrowseView"
params = {
"langId": "44",
"storeId": "10151",
"catalogId": "10241",
"categoryId": "310864",
"parent_category_rn": "13343",
"top_category": "13343",
"pageSize": "120",
"orderBy": "FAVOURITES_ONLY|SEQUENCING|TOP_SELLERS",
"searchTerm": "",
"beginIndex": "0",
"hideFilters": "true",
"requesttype": "ajax"
}
response = requests.get(url, params=params)
response.raise_for_status()
product_info = response.json()[4]["productLists"][0]["products"]
products = [Product(p["result"]) for p in product_info[:10]]
for product in products:
print(product)
输出:
"Sainsbury's Chicken Thigh Fillets" (640g) - £3.40/unit
"Sainsbury's Mini Chicken Breast Fillets" (320g) - £2.00/unit
"Sainsbury's Chicken Thighs" (1kg) - £1.95/unit
"Sainsbury's Chicken Breast Fillets" (300g) - £1.70/unit
"Sainsbury's Chicken Drumsticks" (1kg) - £1.70/unit
"Sainsbury's Chicken Thigh Fillets" (320g) - £1.85/unit
"Sainsbury's Chicken Breast Diced" (410g) - £2.40/unit
"Sainsbury's Chicken Small Whole Bird" (1.35kg) - £2.80/unit
"Sainsbury's Chicken Thighs & Drumsticks" (540g) - £1.00/unit
"Sainsbury's Chicken Breast Fillets" (640g) - £3.60/unit
>>> product.price_per_measure
'£5.63/kg'
>>>
如果您记录网络流量并对其进行过滤以仅查看XHR资源,您将发现一个与AJAX web应用程序对话的资源。它与服务器对话,服务器生成HTML(不幸的是,不完全是JSON,它是嵌入JSON响应的HTML)。这并不是必需的,因为您的代码似乎正在抓取页面。然而,这是一种更可爱的获得产品的方式。您也不必担心分页之类的事情。正如其他人已经指出的那样,要去除前导和尾随空格,请使用
str.strip
。在本例中,我只打印前十个产品(共114个)。是的,我意识到我可以将查询字符串附加到url,而不是创建一个params
dict,但是这样更容易阅读和更改:
import requests
from bs4 import BeautifulSoup
class Product:
def __init__(self, html):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
self.name, _, self.weight = soup.find("a").text.strip().rpartition(" ")
self.price_per_unit = soup.find("p", {"class": "pricePerUnit"}).text.strip()
self.price_per_measure = soup.find("p", {"class": "pricePerMeasure"}).text.strip()
def __str__(self):
return f"\"{self.name}\" ({self.weight}) - {self.price_per_unit}"
url = "https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/AjaxApplyFilterBrowseView"
params = {
"langId": "44",
"storeId": "10151",
"catalogId": "10241",
"categoryId": "310864",
"parent_category_rn": "13343",
"top_category": "13343",
"pageSize": "120",
"orderBy": "FAVOURITES_ONLY|SEQUENCING|TOP_SELLERS",
"searchTerm": "",
"beginIndex": "0",
"hideFilters": "true",
"requesttype": "ajax"
}
response = requests.get(url, params=params)
response.raise_for_status()
product_info = response.json()[4]["productLists"][0]["products"]
products = [Product(p["result"]) for p in product_info[:10]]
for product in products:
print(product)
输出:
"Sainsbury's Chicken Thigh Fillets" (640g) - £3.40/unit
"Sainsbury's Mini Chicken Breast Fillets" (320g) - £2.00/unit
"Sainsbury's Chicken Thighs" (1kg) - £1.95/unit
"Sainsbury's Chicken Breast Fillets" (300g) - £1.70/unit
"Sainsbury's Chicken Drumsticks" (1kg) - £1.70/unit
"Sainsbury's Chicken Thigh Fillets" (320g) - £1.85/unit
"Sainsbury's Chicken Breast Diced" (410g) - £2.40/unit
"Sainsbury's Chicken Small Whole Bird" (1.35kg) - £2.80/unit
"Sainsbury's Chicken Thighs & Drumsticks" (540g) - £1.00/unit
"Sainsbury's Chicken Breast Fillets" (640g) - £3.60/unit
>>> product.price_per_measure
'£5.63/kg'
>>>
嘿,这似乎工作得更好,但我不知道如何才能得到的价格以及现在。我希望它也能显示每种产品的价格。谢谢大家!@RamgithUnniJagajith我已经更新了答案中的代码,看一看。我基本上刚刚创建了一个
Product
类,它从HTML中提取它感兴趣的数据。每个产品都有一个名称
,重量
每单位价格
和每计量单位价格
。完美!非常感谢。嘿,这似乎效果更好,但我不确定我现在如何才能得到这个价格。我希望它也能显示每种产品的价格。谢谢大家!@RamgithUnniJagajith我已经更新了答案中的代码,看一看。我基本上刚刚创建了一个Product
类,它从HTML中提取它感兴趣的数据。每个产品都有一个名称
,重量
每单位价格
和每计量单位价格
。完美!谢谢