Web scraping 美化组4,查找没有标识符的文本

Web scraping 美化组4,查找没有标识符的文本,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我正在帮助一个非营利组织清理他们的易趣商店物品。 到目前为止,我的代码工作正常: testlink = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF' r = requests.get(testlink, headers=headers) soup = Beauti

我正在帮助一个非营利组织清理他们的易趣商店物品。 到目前为止,我的代码工作正常:

    testlink = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'

r = requests.get(testlink, headers=headers)

soup = BeautifulSoup(r.content, 'lxml')

name = soup.find('h1', class_='it-ttl').text.strip("Details, about")
price = soup.find('span', class_='notranslate').text.strip("US, $")
ebayID = soup.find('div', class_='u-flL iti-act-num itm-num-txt').text
color = soup.find('h2', itemprop='color').text
brand = soup.find('h2', itemprop='brand').text
但是,我无法从下面的图像中提取以下信息:

另外,从下面的图片中抓取信息也很棒:


谢谢

要提取项目编号和属性,可以使用以下示例:

import requests
from bs4 import BeautifulSoup


url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
    label = label.get_text(strip=True)
    value = value.get_text(strip=True)
    print('{:<30} {}'.format(label, value))
  
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

data = {}

# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
    label = label.get_text(strip=True)
    label = label.rstrip(':').lower()
    value = value.get_text(strip=True)
    print('{:<30} {}'.format(label, value))
    data[label] = value

# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number

df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)
编辑:

要将标签/值保存在字典中并保存到csv,可以使用以下示例:

import requests
from bs4 import BeautifulSoup


url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
    label = label.get_text(strip=True)
    value = value.get_text(strip=True)
    print('{:<30} {}'.format(label, value))
  
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

data = {}

# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
    label = label.get_text(strip=True)
    label = label.rstrip(':').lower()
    value = value.get_text(strip=True)
    print('{:<30} {}'.format(label, value))
    data[label] = value

# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number

df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)

嘿,安德烈。首先,谢谢你的帮助。我的教程似乎使用了“lxml”而不是“html.parser”,我想这就是为什么我没有得到任何值的原因。有没有办法将值存储在与标签对应的变量中,而不是打印它们?我的目标是最终将标签设置为键,将值设置为字典上的值,并将所有内容导出到csv文件中。真是棒极了!!!!如果我没有滥用你的时间,我仍然不知道如何获取图像链接。我想把这个也作为一个columns@jonathan例如,可以执行image=soup。如果需要高分辨率图像,请选择一个“[itemprop=image]”['src'],然后选择image=soup。选择一个“[itemprop=image]”['src']。替换“l300”、“l1600”。然后data['image']=image我在下面解析了你的代码data['item number']=number获取此错误类型错误:'NoneType'对象不可订阅它正在工作!!!非常感谢。我希望有一天我会那么好。你太棒了。
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# https://i.ebayimg.com/images/g/DWAAAOSwNZFfEHjF/s-l1600.jpg
data = {}

# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
    label = label.get_text(strip=True)
    label = label.rstrip(':').lower()
    value = value.get_text(strip=True)
    print('{:<30} {}'.format(label, value))
    data[label] = value

# extract the image
image = soup.select_one('[itemprop="image"]')['src'].replace('l300', 'l1600')
data['image'] = image

# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number

df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)