Python 如何解析字符串以查找特定的单词/数字，并在找到时显示它们_Python_Python 3.x_Beautifulsoup_Screen Scraping

Python 如何解析字符串以查找特定的单词/数字，并在找到时显示它们

python python-3.x

Python 如何解析字符串以查找特定的单词/数字，并在找到时显示它们,python,python-3.x,beautifulsoup,screen-scraping,Python,Python 3.x,Beautifulsoup,Screen Scraping,我确信我已经写了一些相当可疑的代码，但它似乎可以完成这项工作。问题是，它正在将数据打印到电子表格中，并在我希望找到车辆年份的专栏中，如果广告中的第一个词不是年份，那么它会显示第一个词，可能是制造商基本上，我想设置if语句，这样，如果车辆年份不在第一个单词中，但在字符串的其他地方，它仍然会找到它并将其打印到my.csv 另外，我一直在努力解析多个页面，希望这里的人也能帮上忙。url中有page=2等，但我无法让它解析所有url并获取所有页面上的数据。目前，我所尝试的一切只做了第一页。正如您可能已

我确信我已经写了一些相当可疑的代码，但它似乎可以完成这项工作。问题是，它正在将数据打印到电子表格中，并在我希望找到车辆年份的专栏中，如果广告中的第一个词不是年份，那么它会显示第一个词，可能是制造商

基本上，我想设置if语句，这样，如果车辆年份不在第一个单词中，但在字符串的其他地方，它仍然会找到它并将其打印到my.csv

另外，我一直在努力解析多个页面，希望这里的人也能帮上忙。url中有page=2等，但我无法让它解析所有url并获取所有页面上的数据。目前，我所尝试的一切只做了第一页。正如您可能已经猜到的，我对Python相当陌生

import csv ; import requests

from bs4 import BeautifulSoup

outfile = open('carandclassic-new.csv','w', newline='', encoding='utf-8')
writer = csv.writer(outfile)
writer.writerow(["Link", "Title", "Year", "Make", "Model", "Variant", "Image"])

url = 'https://www.carandclassic.co.uk/cat/3/?page=2'

get_url = requests.get(url)

get_text = get_url.text

soup = BeautifulSoup(get_text, 'html.parser')


car_link = soup.find_all('div', 'titleAndText', 'image')


for div in car_link:
    links = div.findAll('a')
    for a in links:
        link = ("https://www.carandclassic.co.uk" + a['href'])
        title = (a.text.strip())
        year = (title.split(' ', 1)[0])
        make = (title.split(' ', 2)[1])
        model = (title.split(' ', 3)[2])
        date = "\d"
        for line in title:
        yom = title.split()
        if yom[0] == "\d":
            yom[0] = (title.split(' ', 1)[0])
        else:
            yom = title.date

        writer.writerow([link, title, year, make, model])
        print(link, title, year, make, model)



outfile.close()

有人能帮我吗？我意识到底部的if语句可能有点离题

代码成功地从字符串中获取了第一个单词，但遗憾的是，数据的结构并不总是车辆的制造年份（yom）

注释

“1978年完全恢复的Datsun 280Z”

变为

'1978'1978'280Z'

而不是“1978”Datsun“280z”

要改进

年度

验证，请更改为使用

re

模块：

import re

if not (len(year) == 4 and year.isdigit()):
    match = re.findall('\d{4}', title)
    if match:
        for item in match:
            if int(item) in range(1900,2010):
                # Assume year
                year = item
                break

输出变为：

'1978 Full restored Datsun 280Z', '1978', 'Full', '280Z'

关于假结果

make='Full'

您有两个选项

停止单词列表
用
['full'、'restored'等]和循环标题项目等术语建立一个停止词列表，以在停止词列表中找到第一个项目而不是
制造商列表建立一个制造商列表，如['Mercedes'，Datsun'等] 和循环标题项目，以查找第一个匹配项目问题：如果广告中的第一个词不是年份，请查找车辆的年份使用内置和模块： , ,， , 使用的示例标题： # Simulating html Element class Element(): def __init__(self, text): self.text = text for a in [Element('Mercedes Benz 280SL 1980 Cabriolet in beautiful condition'), Element('1964 Mercedes Benz 220SEb Saloon Manual RHD')]: 从获取标题，这是一个更广泛的问题<代码>汤。代码中的find_all（'div'，'titleAndText'，'image'）正在获取不一致的数据类型Shi Stovfl非常感谢这一点，这似乎是有意义的，但我似乎无法将其与我的代码连接以使其工作，您能否建议将您的代码添加到我的代码中，以使其正常工作？@BenWillis:Read@BenWillis:Replace-all-in-links:块中的中，除了链接= 和writer.writerow之外（…line.Hi@stovfl谢谢，我已经设法让它正常工作了。现在唯一的问题是一些“品牌”和“型号”使用了数字：1978年完全恢复的Datsun 280Z变成了“1978”1978“280Z”。而不是“1978”Datsun“280Z”。@BenWillis:更新了我的答案 title = a.text.strip() title_items = title.split() # Default year = title_items[0] make = title_items[1] model = title_items[2] # Verify 'year' if not (len(year) == 4 and year.isdigit()): # Test all items for item in title_items: if len(item) == 4 and item.isdigit(): # Assume year year = item break make = title_items[0] model = title_items[1] # Condition: Model have to start with digit if not model[0].isdigit(): for item in title_items: if item[0].isdigit() and not item == year: model = item print('{}'.format([title, year, make, model])) ['Mercedes Benz 280SL 1980 Cabriolet in beautiful condition', '1980', 'Mercedes', '280SL'] ['1964 Mercedes Benz 220SEb Saloon Manual RHD', '1964', 'Mercedes', '220SEb']