Python 美化组循环通过项目
我有一个页面具有以下结构Python 美化组循环通过项目,python,beautifulsoup,Python,Beautifulsoup,我有一个页面具有以下结构 <div class="cloud-grid margin-bottom-40"> <div class="cloud-grid__col is-6"> <a href="https://cloud.google.com/bigquery/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="bigQuery" track-metadata-pos
<div class="cloud-grid margin-bottom-40">
<div class="cloud-grid__col is-6">
<a href="https://cloud.google.com/bigquery/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="bigQuery" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
BigQuery
</a>
<div class="cloud-product-card__sub-headline">
A fully managed, highly scalable data warehouse with built-in ML.
</div>
<a href="https://cloud.google.com/dataflow/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataflow" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Dataflow
</a>
<div class="cloud-product-card__sub-headline">
Real-time batch and stream data processing.
</div>
<a href="https://cloud.google.com/dataproc/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataproc" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Dataproc
</a>
<div class="cloud-product-card__sub-headline">
Managed Spark and Hadoop service.
</div>
<a href="https://cloud.google.com/datalab/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDatalab" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Datalab
</a>
<div class="cloud-product-card__sub-headline">
Explore, analyze, and visualize large datasets.
</div>
<a href="https://cloud.google.com/dataprep/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataprep" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Dataprep
</a>
<div class="cloud-product-card__sub-headline">
Cloud data service to explore, clean, and prepare data for analysis.
</div>
<a href="https://cloud.google.com/pubsub/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudPubSub" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Pub/Sub
</a>
<div class="cloud-product-card__sub-headline">
Ingest event streams from anywhere, at any scale.
</div>
</div>
<div class="cloud-grid__col is-6">
<a href="https://cloud.google.com/composer/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudComposer" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Composer
</a>
<div class="cloud-product-card__sub-headline">
A fully managed workflow orchestration service built on Apache Airflow.
</div>
<a href="https://cloud.google.com/data-fusion/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataFusion" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Data Fusion
</a>
<div class="cloud-product-card__sub-headline">
Fully managed, code-free data integration.
</div>
<a href="https://cloud.google.com/data-catalog/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="dataCatalog" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Data Catalog
</a>
<div class="cloud-product-card__sub-headline">
A fully managed and highly scalable data discovery and metadata
management service.
</div>
<a href="https://cloud.google.com/genomics/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="genomics" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Genomics
</a>
<div class="cloud-product-card__sub-headline">
Power your science with Google Genomics.
</div>
<a href="https://marketingplatform.google.com/about/enterprise/#?modal_active=none" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="googleMarketingPlatform" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Google Marketing Platform*
</a>
<div class="cloud-product-card__sub-headline">
Enterprise analytics for better customer experiences.
</div>
<a href="https://marketingplatform.google.com/about/data-studio/" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="googleDataStudio" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Google Data Studio*
</a>
<div class="cloud-product-card__sub-headline">
Tell great data stories to support better business decisions.
</div>
<a href="https://firebase.google.com/products/performance/" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="firebasePerformanceMonitoring" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Firebase Performance Monitoring
</a>
<div class="cloud-product-card__sub-headline">
Gain insight into your app's performance.
</div>
</div>
我设法把所有的结果都拿回来了,但结果非常杂乱无章
我试图以csv或json格式从中获取/刮取数据,方法稍有不同:这些项的数量相等,并且有一个规则结构,因此您可以在列表理解中使用将这三个项作为列表连接。标题和链接都可以来自类为
cloud-product-card\uu headline
的元素,然后描述为下一个兄弟。下一个兄弟
。在输出之前,可以对描述进行一些字符串清理
import requests, re, csv
from bs4 import BeautifulSoup as bs
r = requests.get('https://cloud.google.com/products/')
soup = bs(r.content, 'lxml')
products = [[i.text.strip(), i['href'], re.sub('\n\s+',' ',i.next_sibling.next_sibling.text.strip())] for i in soup.select('.cloud-product-card__headline')]
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Title','Link','Description'])
for product in products:
w.writerow(product)
示例输出行:
你说的“无组织”是什么意思?结果的顺序是否与html代码中的顺序不同?您尝试过这个吗?
import requests, re, csv
from bs4 import BeautifulSoup as bs
r = requests.get('https://cloud.google.com/products/')
soup = bs(r.content, 'lxml')
products = [[i.text.strip(), i['href'], re.sub('\n\s+',' ',i.next_sibling.next_sibling.text.strip())] for i in soup.select('.cloud-product-card__headline')]
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Title','Link','Description'])
for product in products:
w.writerow(product)