Python BeautifulSoup-将嵌套for循环内的字符串值转换为int，然后进行排序_Python_For Loop_Web Scraping_Beautifulsoup

Python BeautifulSoup-将嵌套for循环内的字符串值转换为int，然后进行排序

python for-loop web-scraping

Python BeautifulSoup-将嵌套for循环内的字符串值转换为int，然后进行排序,python,for-loop,web-scraping,beautifulsoup,Python,For Loop,Web Scraping,Beautifulsoup,我试图弄清楚如何在scrapedfor循环中将字符串值转换为int，以便按int排序（下面脚本中的“视图”）下面是对这个问题的简明看法。包括返回字符串的工作脚本、修复问题的失败尝试以及所需的输出返回字符串的工作脚本： import requests from bs4 import BeautifulSoup import pprint res = requests.get('https://www.searchenginejournal.com/category/news/') s

我试图弄清楚如何在scrapedfor循环中将字符串值转换为int，以便按int排序（下面脚本中的“视图”）

下面是对这个问题的简明看法。包括返回字符串的工作脚本、修复问题的失败尝试以及所需的输出

返回字符串的工作脚本：

import requests  
from bs4 import BeautifulSoup  
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = subtext[idx].find_all(
            'li')[2].text.strip().replace(' Reads', '')
        sej.append({'title': title, 'link': href, 'views': views})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

在上述内容中，输出包含如下所示的词典：

 {
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site                ',
'views': '4.5K'
}

所需的输出将是：

 {
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site                ',
'views': '4500'
}

下面是我修复问题的失败尝试。下面的脚本返回一个值，而不是所有适用值的列表，但老实说，我不确定是否以正确的方式执行此操作

import requests
from bs4 import BeautifulSoup
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = subtext[idx].find_all(
            'li')[2].text.strip().replace(' Reads', '').replace(' min  read', '')
# below is my unsuccessful attempt to change the strings to int
        for item in views:
            if views:
                multiplier = 1
                if views.endswith('K'):
                    multiplier = 1000
                    views = views[0:len(views)-1]
                return int(float(views) * multiplier)
            else:
                return views
        sej.append({'title': title, 'link': href, 'views': views})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

任何帮助都将不胜感激

谢谢。

您可以尝试使用以下代码将视图转换为整数：

import requests  
from bs4 import BeautifulSoup  
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def convert(views):
    if 'K' in views:
        return int( float( views.split('K')[0] ) * 1000 )
    else:
        return int(views)

def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = item.parent.find('i', class_='sej-meta-icon fa fa-eye')
        views = views.find_next(text=True).split()[0] if views else '0'
        sej.append({'title': title, 'link': href, 'views': convert(views)})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

印刷品：

[{'link': 'https://www.searchenginejournal.com/microsoft-clarity-analytics/385867/',
  'title': 'Microsoft Announces Clarity – Free Website '
           'Analytics                ',
  'views': 11000},
 {'link': 'https://www.searchenginejournal.com/wordpress-5-6-feature-removed-for-subpar-experience/385414/',
  'title': 'WordPress 5.6 Feature Removed For Subpar '
           'Experience                ',
  'views': 7000},
 {'link': 'https://www.searchenginejournal.com/whatsapp-shopping-payment-customer-service/385362/',
  'title': 'WhatsApp Announces Shopping and Payment Tools for '
           'Businesses                ',
  'views': 6800},
 {'link': 'https://www.searchenginejournal.com/google-noindex-meta-tag-proper-use/385538/',
  'title': 'Google Shares How Noindex Meta Tag Can Cause '
           'Issues                ',
  'views': 6500},

...and so on.

您可以尝试以下代码将视图转换为整数：

import requests  
from bs4 import BeautifulSoup  
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def convert(views):
    if 'K' in views:
        return int( float( views.split('K')[0] ) * 1000 )
    else:
        return int(views)

def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = item.parent.find('i', class_='sej-meta-icon fa fa-eye')
        views = views.find_next(text=True).split()[0] if views else '0'
        sej.append({'title': title, 'link': href, 'views': convert(views)})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

印刷品：

[{'link': 'https://www.searchenginejournal.com/microsoft-clarity-analytics/385867/',
  'title': 'Microsoft Announces Clarity – Free Website '
           'Analytics                ',
  'views': 11000},
 {'link': 'https://www.searchenginejournal.com/wordpress-5-6-feature-removed-for-subpar-experience/385414/',
  'title': 'WordPress 5.6 Feature Removed For Subpar '
           'Experience                ',
  'views': 7000},
 {'link': 'https://www.searchenginejournal.com/whatsapp-shopping-payment-customer-service/385362/',
  'title': 'WhatsApp Announces Shopping and Payment Tools for '
           'Businesses                ',
  'views': 6800},
 {'link': 'https://www.searchenginejournal.com/google-noindex-meta-tag-proper-use/385538/',
  'title': 'Google Shares How Noindex Meta Tag Can Cause '
           'Issues                ',
  'views': 6500},

...and so on.

完美的谢谢你的帮助！完美的谢谢你的帮助！