Python抓取-包含多个文本元素的表_Python_Python 3.x_Web Scraping_Beautifulsoup

Python抓取-包含多个文本元素的表

python python-3.x web-scraping

Python抓取-包含多个文本元素的表,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,这里没有。这是我第一次编写python代码。我正试图从这个网站上搜刮Instagram账户及其追随者的名单。我能够提取数据，但我很难在CSV中以正确的格式获取数据。我想用Instagram句柄标题、关注者、本网站所有页面的帖子来提取数据。这是我的代码，任何帮助都将不胜感激导入请求从bs4导入BeautifulSoup url='1〕https://www.trackalytics.com/the-most-followed-instagram-profiles/page/1/' heade

这里没有。这是我第一次编写python代码。我正试图从这个网站上搜刮Instagram账户及其追随者的名单。我能够提取数据，但我很难在CSV中以正确的格式获取数据。我想用Instagram句柄标题、关注者、本网站所有页面的帖子来提取数据。这是我的代码，任何帮助都将不胜感激

导入请求
从bs4导入BeautifulSoup
url='1〕https://www.trackalytics.com/the-most-followed-instagram-profiles/page/1/'
headers={'User-Agent'：'Mozilla/5.0'}
response=requests.get（url）
r=请求。获取（url）
汤=BeautifulSoup（response.content，“lxml”）
table2=soup.find_all（'table'，recursive=True）
表=表2[0]
打开（“instagram.txt”，“w”）作为文件：
对于表中的行。find_all（'tr'）：
对于行中的单元格，查找所有（'td'）：
container=cell.text.strip（）
file.write（容器）

首先：您应该使用模块

csv

创建正确的csv文件。仅使用普通的

open（）

和

write（）

，您必须手动将每一行数据转换为字符串，其值由

、

和结尾处的

\n

分隔。但它可能需要其他更复杂的更改-例如，如果文本中有

、

或

\n

，则将文本放入

“

”

第二：在保存之前，您可能需要更复杂的代码来清理数据-即，您可以删除空格、制表符、

\n

、

（）

、将一些文本拆分为两列等

第三：您可能需要循环阅读其他页面

import requests
from bs4 import BeautifulSoup
import csv

# --- functions ---

def get_page(number):
    url = 'https://www.trackalytics.com/the-most-followed-instagram-profiles/page/{}/'.format(number)
    headers= {'User-Agent': 'Mozilla/5.0'}

    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')

    return soup

def get_data(soup):
    table = soup.find('table')

    results = []

    for row in table.find_all('tr'):
        all_cells = row.find_all('td')

        # skip empty rows
        if all_cells:
            a = all_cells[0].find('span').text.strip()

            b = all_cells[1].text.strip()

            c = all_cells[2].text.strip().split('\n')
            c = [clean(item) for item in c]

            d = all_cells[3].text.strip().split('\n')
            d = [clean(item) for item in d]

            e = all_cells[3].text.strip().split('\n')
            e = [clean(item) for item in e]

            f = all_cells[3].text.strip().split('\n')
            f = [clean(item) for item in f]

            results.append([a,b,c[0],c[1],d[0],d[1],e[0],e[1],f[0],f[1]])

    return results

def clean(text):
    return text.strip().replace(' ', '').replace(',', '').replace('(', '').replace(')', '')

def write_data(data):

    with open ("instagram.txt", 'w') as writer:
        cvs_writer = csv.writer(writer)

        # write header
        cvs_writer.writerow([
            'Rank',
            'Profile',
            'Total Followers',
            'Total Followers today',
            'Total Following',
            'Total Following today',
            'Total Posts',
            'Total Posts today',
            'Total Influence',
            'Total Influence today'
        ])

        cvs_writer.writerows(data)

# --- main ---

all_data = []

for number in range(1, 10):
    print('page:', number)
    soup = get_page(number)
    data = get_data(soup)
    all_data.extend(data)

write_data(all_data)

这里有什么问题？如果您想创建CSV文件，最好使用模块

CSV

来创建。仅使用普通的

open（）

可能会创建不正确的CSV文件。您不能使用

file.write（container）

在CSV中创建正确的行。您必须使用

“，”

将列表中的所有元素连接起来，以创建单个字符串，然后编写它。但是使用模块

csv

你可以使用

csv\u writer.writerow（list）

，它会自动将列表转换为正确的字符串。顺便说一句：要在csv中获得更好的数据，你可能需要创建更复杂的代码，只从单元格中获取部分，并跳过其他元素（即跳过无用的按钮），简单的解释和漂亮的代码。它非常有效，我从这个答案中学到了很多。谢谢你看这个。你是一位绅士和学者！