Python清理不一致的字段_Python_Html_Selenium_Xpath

Python清理不一致的字段

python html selenium xpath

Python清理不一致的字段,python,html,selenium,xpath,Python,Html,Selenium,Xpath,我从一个网站上抓取了一些数据，有时他们会显示milage，有时他们会在车辆描述中显示MPG 这是HTML 我使用的是xpath，只需按顺序进行即可以下是相关部分： def init_driver(): options = webdriver.ChromeOptions() options.binary_location = '/usr/bin/google-chrome-stable' options.add_argument('headless') o

我从一个网站上抓取了一些数据，有时他们会显示milage，有时他们会在车辆描述中显示MPG 这是HTML

我使用的是xpath，只需按顺序进行即可

以下是相关部分：

    def init_driver():
    options = webdriver.ChromeOptions()
    options.binary_location = '/usr/bin/google-chrome-stable'
    options.add_argument('headless')
    options.add_argument('window-size=1200x600')
    driver = webdriver.Chrome(chrome_options=options)
    driver.wait = WebDriverWait(driver, 5)
    return driver


def scrape(driver):

    #Tymm = year make model All three attributes are in the Header, Parse and separate before insterting to SQL
    ymm_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/h3')
    engine_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/div[3]/dl[1]/dd[1]')
    trans_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/div[3]/dl[1]/dd[2]')
    milage_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/div[3]/dl[1]/dd[3]')

由于元素的顺序对于所有车辆都不相同，因此我需要编写它，以便它能够检索我想要的标题后的文本

以下是从element inspection复制chrome上的HTML后的HTML：

  <div class="description">
    <dl> <dt>Engine:</dt> <dd>2.5L I-5 cyl<span class="separator">,</span>
    </dd> <dt>Transmission:</dt> <dd>Manual<span class="separator">,</span></dd> <dt>Mileage:</dt> <dd>37,171 miles<span class="separator">,</span></dd> <dt>MPG Range:</dt> <dd>22/31<span class="separator">,</span></dd></dl><dl class="last"> <dt>Exterior Color:</dt> <dd>Reflex Silver Metallic<span class="separator">,</span></dd> <dt>Interior Color:</dt> <dd>Titan Black<span class="separator">,</span></dd> <dt>Stock #:</dt> <dd>P3229</dd></dl> <span class="ddc-more">More<span class="hellip">…</span></span> 
<div class="calloutDetails">
<ul class="list-unstyled">
<li class="certified" style="margin-bottom: 10px;"><div class="badge "><img class="align-center" src="https://static.dealer.com/v8/global/images/franchise/white/en_US/logo-certified-volkswagen.gif?r=1356028132000" alt="Certified"></div></li><li class="carfax" style="margin-bottom: 10px;"><a href="http://www.carfax.com/cfm/ccc_displayhistoryrpt.cfm?partner=DLR_3&amp;vin=3VWHX7AT1EM600723" class="badge carfax-one-owner pointer" target="_blank"><img class="align-center" src="https://static.dealer.com/v8/global/images/franchise/white/logo-certified-carfax-one-owner-lrg.png?r=1405027620000" alt="Carfax One Owner"></a></li>
</ul>
</div>
<div class="hproductDynamicArea"></div>
</div>


发动机：2.5升I-5缸，
变速器：手动，里程数：37171英里，MPG范围：22/31，车身颜色：反光银色金属漆，内饰颜色：泰坦黑，库存：P3229更多…

基本上，我需要在标题后搜索文本，而不是为XPath编号

My year make和model都在同一个元素“标记中，您能给我指出正确的方向吗？或者首先建议一个库拆分头，使用xpath您可以使用contains，如下所示：

driver.find_elements_by_xpath('//dt[contains(text(),'Engine')]')

它看起来更干净，更容易使用，更坚固

其次，阅读关于xpath的以下兄弟、前兄弟、父和祖先。它将帮助您构建整洁的xpath定位器：

driver.find_elements_by_xpath('//dt[contains(text(),'Engine:')]/following-sibling::dd')
driver.find_elements_by_xpath('//dt[contains(text(),'Transmission:')]/following-sibling::dd')
driver.find_elements_by_xpath('//dt[contains(text(),'Mileage:')]/following-sibling::dd')

无论html元素的排列顺序如何，上面的XPath都会起作用。

谢谢，我会这样做，我不得不改为双引号，但效果很好。我还会逐个循环每辆车，以避免出现差异。很抱歉再次打扰您，我很难循环浏览web元素：def scrape（driver）：cars=driver.find_element_by_xpath（'//div[@class=“description”]”）for car in cars:engine=car.find_element_by_xpath（//dt[contains（text（），'engine'）]/following sibling：：dd）miliety=car.find_element_by_by_xpath（“//dt[contains（text（），'miliety'）]/following sibling：：dd”）打印（milineage.text，engine.text）def scrape（driver）：cars=driver.find_element_by_xpath（'//div[@class=“description”]'）用于车内汽车：engine=car.find_element_by_xpath（“//dt[contains（text（），'engine'）]/“following sibling:：dd”）miliner=car.find_element_by_by_by_xpath（//dt[contains（contains（），'milineer'）/“contains（text（）打印（里程数.text，发动机.text）