Parsing 组合索引_Parsing_Beautifulsoup

Parsing 组合索引

parsing

Parsing 组合索引,parsing,beautifulsoup,Parsing,Beautifulsoup,因此，我试图解析IMDB页面中的体裁和子体裁的链接现在已经能够将主要的流派标签解析成可用的东西使用以下代码 table = soup.find_all("table", {"class": "genre-table"}) for item in table: for x in range(100): try: print(item.contents[x].find_all("h3")) print(len(item.

因此，我试图解析IMDB页面中的体裁和子体裁的链接

现在已经能够将主要的流派标签解析成可用的东西使用以下代码

table = soup.find_all("table", {"class": "genre-table"})

for item in table:
    for x in range(100):

        try:
            print(item.contents[x].find_all("h3"))
            print(len(item.contents[x].find_all("h3")))
        except:
            pass

我的输出是11组列表，其中有两个标签，如下所示

[<h3><a href="http://www.imdb.com/genre/action/?ref_=gnr_mn_ac_mp">Action <span class="normal">»</span></a></h3>, <h3><a href="http://www.imdb.com/genre/adventure/?ref_=gnr_mn_ad_mp">Adventure <span class="normal">»</span></a></h3>]
2

我试过了

对于范围（2）内的y：

这当然是嵌套在for循环中的x（不是它自己）
但它似乎不起作用

有什么想法吗？因果报应

首先，不需要查找所有表，因为只需要第一个表：

table = soup.find("table", {'class': 'genre-table'})

由于每个其他项都是冗余的（从第一项开始），您可以像这样迭代表：

for item in list(table)[1::2]:

在此之后，我们可以在每一行中获得“h3”标记，并在这两行中循环：

    row = item.find_all("h3")

    for col in row:

因为每个“h3”元素中的文本都以以下格式返回类型：“Somegenre\xc2\xbb”在获取文本之前，我删除了span元素：

        col.span.extract()
        link = col.a['href']
        genre = col.text.strip()

在此之后，只需按索引将元素添加到dataframe中：

        df.loc[len(df)]=[genre, None, link]

完整代码：

import pandas as pd
import requests
from bs4 import BeautifulSoup

df = pd.DataFrame(columns=['Genre', 'Sub-Genre', 'Link'])

req = requests.get('http://www.imdb.com/genre/?ref_=nv_ch_gr_3')
soup = BeautifulSoup(req.content, 'html.parser')

table = soup.find("table", {'class': 'genre-table'})

for item in list(table)[1::2]:
    row = item.find_all("h3")

    for col in row:
        col.span.extract()
        link = col.a['href']
        genre = col.text.strip()

        df.loc[len(df)] = [genre, None, link]

完美的男人把我分类了，去研究一些函数哈哈，一个问题，当你切列表的时候，你是如何迭代的？我有点被两个ColonEdit抛出，我只是不明白你是如何在列表（表）[元素到这里]@entercaspa]中的4个元素的4个位置上进行切片的，我不确定你的意思，但是列表（表）返回一个表元素列表，以便我可以切片它们。语法的工作方式如下：[start:end]或[start:end:step]。start默认为0，end默认为len（列表），因此您可以像这样省略它们：[start:]、[：end]或[：]。最后一个选项步骤允许您跳过元素，这里[1：（结束）：2]我们跳过每一个元素。语法[：：3]会每三分之一跳过一次，依此类推。是的，我知道这是有道理的对不起，我是新手，差点忘了切片是如何工作的，哈哈，谢谢你帮了我大忙

        df.loc[len(df)]=[genre, None, link]

import pandas as pd
import requests
from bs4 import BeautifulSoup

df = pd.DataFrame(columns=['Genre', 'Sub-Genre', 'Link'])

req = requests.get('http://www.imdb.com/genre/?ref_=nv_ch_gr_3')
soup = BeautifulSoup(req.content, 'html.parser')

table = soup.find("table", {'class': 'genre-table'})

for item in list(table)[1::2]:
    row = item.find_all("h3")

    for col in row:
        col.span.extract()
        link = col.a['href']
        genre = col.text.strip()

        df.loc[len(df)] = [genre, None, link]