通过BeautifulSoup获取网页时的Python文本间距和对齐_Python_Beautifulsoup_Web Crawler_Urllib

通过BeautifulSoup获取网页时的Python文本间距和对齐

python web-crawler

通过BeautifulSoup获取网页时的Python文本间距和对齐,python,beautifulsoup,web-crawler,urllib,Python,Beautifulsoup,Web Crawler,Urllib,我正在尝试使用beautifulsoup4（Python3.4）实现一个基本的python web爬行脚本。它用于获取国家篮球协会（NBA注册赛季）当前的“联赛排名” 我试图使文本看起来更“表格化”，但无法做到这一点。例如： Golden State Warriors 67 7 0.905 40-5 San Antonio Spurs 62 12 0.838 39-6 相反，它看起来是这样的（精神错乱……某种方式）我尝试过使用string.format（），但没有效果

我正在尝试使用beautifulsoup4（Python3.4）实现一个基本的python web爬行脚本。它用于获取国家篮球协会（NBA注册赛季）当前的“联赛排名”

我试图使文本看起来更“表格化”，但无法做到这一点。例如：

Golden State Warriors  67  7  0.905  40-5
San Antonio Spurs      62 12  0.838  39-6

相反，它看起来是这样的（精神错乱……某种方式）

我尝试过使用

string.format（）

，但没有效果

以下是用于从网页中提取数据的“我的代码片段”：

for row in tableStats.find_all('tr')[2:]:
    print("\n")
    row_team = row.find_all("td")

    try:
        for stat in row_team:
            print("{0:>5} {1:>5} ".format(stat.text," "), end=" ")
            f.write("{0:^2} {1:^3} ".format(stat.text," "))
        if(i == 16 and flag == 0):
            i = int("0")
            flag = int('1')
            print("\n\n\n\n")
            print("Western Conference".center(10),"\n\n\n")
            f.write("Western Conference\n\n")

        i = i + 1
        f.write("\n")
    except Exception as e:   #In Case a none object gets returned
        pass

关于如何使其工作的建议

由于您没有给出一个可复制的示例，我将继续提供一些建议，下面的所有代码都未经测试，因此需要考虑算法思想，而不是直接复制/粘贴

解决这个问题有两种策略：

您正在分析每一列的宽度

您将根据最大单元格获得每列的大小

策略①: 单程，但固定标题列宽对于第一个策略，您可以在一个循环中执行（正如您正在执行的），但是您需要一种方法来将行的第一个单元格与其他单元格区别对待，这样您就可以给它一个更大的大小。这就是：

### within your try/except block:
# take the first cell to show off the team name on 20 columns
# and strip it if it's longer than 20 columns. I like to add
# three dots to strings I'm cutting, so here it goes:
if len(row_team[0]) > 20:
    out_l = ['{}…'.format(row_team[0][:19])]
else:
    # the ljust() method pads the right side of your string 
    # with spaces 
    out_l = [row_team[0].ljust(20)]
for stat in row_team[1:]:
    # for each stat, parse it as float, and reinterpret it so
    # it's a ' 0.00' format, you might want to do 5.2f if some
    # values are in the 100s
    out_l.append("{: 4.2f}".format(float(stat)))

# printing out the line, by making a string out of the list
# using the ' '.join() method, adding a single space between
# elements
out = ' '.join(out_l)
print(out)
# write the line with a carriage return
f.write('{}\n'.format(out))

if(i == 16 and flag == 0):
    # here I'm centering the string's middle at 40 columns
    # considering a full width of 80 columns. If you set 10
    # columns for a string that's 18 characters, it's going
    # to have no effects!
    out = "Western Conference".center(80)
    print() # empty line
    print(out)
    print() # empty line
    # print the string surrounded by empty lines
    f.write("\n{}\n\n".format(out))

顺便说一句，为了避免将

管理为：

i = 0
for whatever:
    something
    i = i + 1

你可以做：

for i, row in enumerate(tableStats.find_all('tr')[2:]):

我将为每个值递增。这将为您提供如下输出：

Golden State Warrio… 67.00  7.00 0.90 40-05
San Antonio Spurs    62.00 12.00 0.83 39-06
                                      ^^^^^- this is not handled with
                                             the code above, cf the end
                                             of my post.

Golden State Warriors 67.00  7.00 0.90 40-05
San Antonio Spurs     62.00 12.00 0.83 39-06

策略②: 两次传球对于第二种策略，首先需要构建一个矩阵（因此基本上是一个列表列表）：

# init the matrix as an empty list
stats_matrix = []
for row in tableStats.find_all('tr')[2:]:
    row_team = row.find_all("td")
    # build a list, starting with the first cell:
    line = [row_team[0]]
    # find out what's the largest string for the first column
    max_header_size = max(max_header_size, len(row_team[0])
    for stat in row_team[1:]:
        # then all the other cells as floats
        line.append(float(stat))
    # add it to the matrix:
    stats_matrix.append(line)

完成后，您可以使用

max\u header\u size

设置第一列的格式：

for line in stats_matrix:
    # show the first cell with a padding on the right of size "max_header_size"
    out = [line[0].ljust(max_header_size)]
    for stat in line[1:]:
        # print each stat, which was stored as float, as a ' 0.00' string
        out.append("{: 4.2f}".format(stat))
    # show on standard output
    print(' '.join(line))
    # and write to file (with extra \n at the end)
    f.write('{}\n'.format(' '.join(line)))

然后你会看到所有的格式都很好

注意：尽管如此，这段代码对您的数据集不起作用，因为最后一个值不是浮点数，而是分数（

NN-NN

）。所以最后一个元素不会被当作一个浮点数，这取决于你来修正它

如果我是你，我会考虑这个选项（对于第二种策略）：

然后在第二个循环中：

…
for stat in line[1:-1]:
    …
line.append('{:02d}-{:02d}'.format(score[0], score[1]))
# show on standard output
print(' '.join(line))
…

然后您应该有如下输出：

Golden State Warrio… 67.00  7.00 0.90 40-05
San Antonio Spurs    62.00 12.00 0.83 39-06
                                      ^^^^^- this is not handled with
                                             the code above, cf the end
                                             of my post.

Golden State Warriors 67.00  7.00 0.90 40-05
San Antonio Spurs     62.00 12.00 0.83 39-06

HTH

这些元素的末尾可能有额外的空格？您能否尝试为检索到的每个元素调用

strip（）

？