Web scraping 美化组4,查找没有标识符的文本
我正在帮助一个非营利组织清理他们的易趣商店物品。 到目前为止,我的代码工作正常:Web scraping 美化组4,查找没有标识符的文本,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我正在帮助一个非营利组织清理他们的易趣商店物品。 到目前为止,我的代码工作正常: testlink = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF' r = requests.get(testlink, headers=headers) soup = Beauti
testlink = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find('h1', class_='it-ttl').text.strip("Details, about")
price = soup.find('span', class_='notranslate').text.strip("US, $")
ebayID = soup.find('div', class_='u-flL iti-act-num itm-num-txt').text
color = soup.find('h2', itemprop='color').text
brand = soup.find('h2', itemprop='brand').text
但是,我无法从下面的图像中提取以下信息:
另外,从下面的图片中抓取信息也很棒:
谢谢要提取项目编号和属性,可以使用以下示例:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = {}
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
label = label.rstrip(':').lower()
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
data[label] = value
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number
df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)
编辑:
要将标签/值保存在字典中并保存到csv,可以使用以下示例:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = {}
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
label = label.rstrip(':').lower()
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
data[label] = value
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number
df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)
嘿,安德烈。首先,谢谢你的帮助。我的教程似乎使用了“lxml”而不是“html.parser”,我想这就是为什么我没有得到任何值的原因。有没有办法将值存储在与标签对应的变量中,而不是打印它们?我的目标是最终将标签设置为键,将值设置为字典上的值,并将所有内容导出到csv文件中。真是棒极了!!!!如果我没有滥用你的时间,我仍然不知道如何获取图像链接。我想把这个也作为一个columns@jonathan例如,可以执行image=soup。如果需要高分辨率图像,请选择一个“[itemprop=image]”['src'],然后选择image=soup。选择一个“[itemprop=image]”['src']。替换“l300”、“l1600”。然后data['image']=image我在下面解析了你的代码data['item number']=number获取此错误类型错误:'NoneType'对象不可订阅它正在工作!!!非常感谢。我希望有一天我会那么好。你太棒了。
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# https://i.ebayimg.com/images/g/DWAAAOSwNZFfEHjF/s-l1600.jpg
data = {}
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
label = label.rstrip(':').lower()
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
data[label] = value
# extract the image
image = soup.select_one('[itemprop="image"]')['src'].replace('l300', 'l1600')
data['image'] = image
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number
df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)