通过BeautifulSoup获取网页时的Python文本间距和对齐

通过BeautifulSoup获取网页时的Python文本间距和对齐,python,beautifulsoup,web-crawler,urllib,Python,Beautifulsoup,Web Crawler,Urllib,我正在尝试使用beautifulsoup4(Python3.4)实现一个基本的python web爬行脚本。它用于获取国家篮球协会(NBA注册赛季)当前的“联赛排名” 我试图使文本看起来更“表格化”,但无法做到这一点。例如: Golden State Warriors 67 7 0.905 40-5 San Antonio Spurs 62 12 0.838 39-6 相反,它看起来是这样的(精神错乱……某种方式) 我尝试过使用string.format(),但没有效果

我正在尝试使用beautifulsoup4(Python3.4)实现一个基本的python web爬行脚本。它用于获取国家篮球协会(NBA注册赛季)当前的“联赛排名”

我试图使文本看起来更“表格化”,但无法做到这一点。例如:

Golden State Warriors  67  7  0.905  40-5
San Antonio Spurs      62 12  0.838  39-6
相反,它看起来是这样的(精神错乱……某种方式)

我尝试过使用
string.format()
,但没有效果

以下是用于从网页中提取数据的“我的代码片段”:

for row in tableStats.find_all('tr')[2:]:
    print("\n")
    row_team = row.find_all("td")

    try:
        for stat in row_team:
            print("{0:>5} {1:>5} ".format(stat.text," "), end=" ")
            f.write("{0:^2} {1:^3} ".format(stat.text," "))
        if(i == 16 and flag == 0):
            i = int("0")
            flag = int('1')
            print("\n\n\n\n")
            print("Western Conference".center(10),"\n\n\n")
            f.write("Western Conference\n\n")

        i = i + 1
        f.write("\n")
    except Exception as e:   #In Case a none object gets returned
        pass

关于如何使其工作的建议

由于您没有给出一个可复制的示例,我将继续提供一些建议,下面的所有代码都未经测试,因此需要考虑算法思想,而不是直接复制/粘贴

解决这个问题有两种策略:

  • 您正在分析每一列的宽度
  • 您将根据最大单元格获得每列的大小
  • 策略①: 单程,但固定标题列宽 对于第一个策略,您可以在一个循环中执行(正如您正在执行的),但是您需要一种方法来将行的第一个单元格与其他单元格区别对待,这样您就可以给它一个更大的大小。这就是:

    ### within your try/except block:
    # take the first cell to show off the team name on 20 columns
    # and strip it if it's longer than 20 columns. I like to add
    # three dots to strings I'm cutting, so here it goes:
    if len(row_team[0]) > 20:
        out_l = ['{}…'.format(row_team[0][:19])]
    else:
        # the ljust() method pads the right side of your string 
        # with spaces 
        out_l = [row_team[0].ljust(20)]
    for stat in row_team[1:]:
        # for each stat, parse it as float, and reinterpret it so
        # it's a ' 0.00' format, you might want to do 5.2f if some
        # values are in the 100s
        out_l.append("{: 4.2f}".format(float(stat)))
    
    # printing out the line, by making a string out of the list
    # using the ' '.join() method, adding a single space between
    # elements
    out = ' '.join(out_l)
    print(out)
    # write the line with a carriage return
    f.write('{}\n'.format(out))
    
    if(i == 16 and flag == 0):
        # here I'm centering the string's middle at 40 columns
        # considering a full width of 80 columns. If you set 10
        # columns for a string that's 18 characters, it's going
        # to have no effects!
        out = "Western Conference".center(80)
        print() # empty line
        print(out)
        print() # empty line
        # print the string surrounded by empty lines
        f.write("\n{}\n\n".format(out))
    
    顺便说一句,为了避免将
    i
    管理为:

    i = 0
    for whatever:
        something
        i = i + 1
    
    你可以做:

    for i, row in enumerate(tableStats.find_all('tr')[2:]):
    
    我将为每个值递增。这将为您提供如下输出:

    Golden State Warrio… 67.00  7.00 0.90 40-05
    San Antonio Spurs    62.00 12.00 0.83 39-06
                                          ^^^^^- this is not handled with
                                                 the code above, cf the end
                                                 of my post.
    
    Golden State Warriors 67.00  7.00 0.90 40-05
    San Antonio Spurs     62.00 12.00 0.83 39-06
    
    策略②: 两次传球 对于第二种策略,首先需要构建一个矩阵(因此基本上是一个列表列表):

    # init the matrix as an empty list
    stats_matrix = []
    for row in tableStats.find_all('tr')[2:]:
        row_team = row.find_all("td")
        # build a list, starting with the first cell:
        line = [row_team[0]]
        # find out what's the largest string for the first column
        max_header_size = max(max_header_size, len(row_team[0])
        for stat in row_team[1:]:
            # then all the other cells as floats
            line.append(float(stat))
        # add it to the matrix:
        stats_matrix.append(line)
    
    完成后,您可以使用
    max\u header\u size
    设置第一列的格式:

    for line in stats_matrix:
        # show the first cell with a padding on the right of size "max_header_size"
        out = [line[0].ljust(max_header_size)]
        for stat in line[1:]:
            # print each stat, which was stored as float, as a ' 0.00' string
            out.append("{: 4.2f}".format(stat))
        # show on standard output
        print(' '.join(line))
        # and write to file (with extra \n at the end)
        f.write('{}\n'.format(' '.join(line)))
    
    然后你会看到所有的格式都很好

    注意:尽管如此,这段代码对您的数据集不起作用,因为最后一个值不是浮点数,而是分数(
    NN-NN
    )。所以最后一个元素不会被当作一个浮点数,这取决于你来修正它

    如果我是你,我会考虑这个选项(对于第二种策略):

    然后在第二个循环中:

    …
    for stat in line[1:-1]:
        …
    line.append('{:02d}-{:02d}'.format(score[0], score[1]))
    # show on standard output
    print(' '.join(line))
    …
    
    然后您应该有如下输出:

    Golden State Warrio… 67.00  7.00 0.90 40-05
    San Antonio Spurs    62.00 12.00 0.83 39-06
                                          ^^^^^- this is not handled with
                                                 the code above, cf the end
                                                 of my post.
    
    Golden State Warriors 67.00  7.00 0.90 40-05
    San Antonio Spurs     62.00 12.00 0.83 39-06
    

    HTH

    这些元素的末尾可能有额外的空格?您能否尝试为检索到的每个元素调用
    strip()