通过BeautifulSoup获取网页时的Python文本间距和对齐
我正在尝试使用beautifulsoup4(Python3.4)实现一个基本的python web爬行脚本。它用于获取国家篮球协会(NBA注册赛季)当前的“联赛排名” 我试图使文本看起来更“表格化”,但无法做到这一点。例如:通过BeautifulSoup获取网页时的Python文本间距和对齐,python,beautifulsoup,web-crawler,urllib,Python,Beautifulsoup,Web Crawler,Urllib,我正在尝试使用beautifulsoup4(Python3.4)实现一个基本的python web爬行脚本。它用于获取国家篮球协会(NBA注册赛季)当前的“联赛排名” 我试图使文本看起来更“表格化”,但无法做到这一点。例如: Golden State Warriors 67 7 0.905 40-5 San Antonio Spurs 62 12 0.838 39-6 相反,它看起来是这样的(精神错乱……某种方式) 我尝试过使用string.format(),但没有效果
Golden State Warriors 67 7 0.905 40-5
San Antonio Spurs 62 12 0.838 39-6
相反,它看起来是这样的(精神错乱……某种方式)
我尝试过使用string.format()
,但没有效果
以下是用于从网页中提取数据的“我的代码片段”:
for row in tableStats.find_all('tr')[2:]:
print("\n")
row_team = row.find_all("td")
try:
for stat in row_team:
print("{0:>5} {1:>5} ".format(stat.text," "), end=" ")
f.write("{0:^2} {1:^3} ".format(stat.text," "))
if(i == 16 and flag == 0):
i = int("0")
flag = int('1')
print("\n\n\n\n")
print("Western Conference".center(10),"\n\n\n")
f.write("Western Conference\n\n")
i = i + 1
f.write("\n")
except Exception as e: #In Case a none object gets returned
pass
关于如何使其工作的建议 由于您没有给出一个可复制的示例,我将继续提供一些建议,下面的所有代码都未经测试,因此需要考虑算法思想,而不是直接复制/粘贴 解决这个问题有两种策略:
### within your try/except block:
# take the first cell to show off the team name on 20 columns
# and strip it if it's longer than 20 columns. I like to add
# three dots to strings I'm cutting, so here it goes:
if len(row_team[0]) > 20:
out_l = ['{}…'.format(row_team[0][:19])]
else:
# the ljust() method pads the right side of your string
# with spaces
out_l = [row_team[0].ljust(20)]
for stat in row_team[1:]:
# for each stat, parse it as float, and reinterpret it so
# it's a ' 0.00' format, you might want to do 5.2f if some
# values are in the 100s
out_l.append("{: 4.2f}".format(float(stat)))
# printing out the line, by making a string out of the list
# using the ' '.join() method, adding a single space between
# elements
out = ' '.join(out_l)
print(out)
# write the line with a carriage return
f.write('{}\n'.format(out))
if(i == 16 and flag == 0):
# here I'm centering the string's middle at 40 columns
# considering a full width of 80 columns. If you set 10
# columns for a string that's 18 characters, it's going
# to have no effects!
out = "Western Conference".center(80)
print() # empty line
print(out)
print() # empty line
# print the string surrounded by empty lines
f.write("\n{}\n\n".format(out))
顺便说一句,为了避免将i
管理为:
i = 0
for whatever:
something
i = i + 1
你可以做:
for i, row in enumerate(tableStats.find_all('tr')[2:]):
我将为每个值递增。这将为您提供如下输出:
Golden State Warrio… 67.00 7.00 0.90 40-05
San Antonio Spurs 62.00 12.00 0.83 39-06
^^^^^- this is not handled with
the code above, cf the end
of my post.
Golden State Warriors 67.00 7.00 0.90 40-05
San Antonio Spurs 62.00 12.00 0.83 39-06
策略②: 两次传球
对于第二种策略,首先需要构建一个矩阵(因此基本上是一个列表列表):
# init the matrix as an empty list
stats_matrix = []
for row in tableStats.find_all('tr')[2:]:
row_team = row.find_all("td")
# build a list, starting with the first cell:
line = [row_team[0]]
# find out what's the largest string for the first column
max_header_size = max(max_header_size, len(row_team[0])
for stat in row_team[1:]:
# then all the other cells as floats
line.append(float(stat))
# add it to the matrix:
stats_matrix.append(line)
完成后,您可以使用max\u header\u size
设置第一列的格式:
for line in stats_matrix:
# show the first cell with a padding on the right of size "max_header_size"
out = [line[0].ljust(max_header_size)]
for stat in line[1:]:
# print each stat, which was stored as float, as a ' 0.00' string
out.append("{: 4.2f}".format(stat))
# show on standard output
print(' '.join(line))
# and write to file (with extra \n at the end)
f.write('{}\n'.format(' '.join(line)))
然后你会看到所有的格式都很好
注意:尽管如此,这段代码对您的数据集不起作用,因为最后一个值不是浮点数,而是分数(NN-NN
)。所以最后一个元素不会被当作一个浮点数,这取决于你来修正它
如果我是你,我会考虑这个选项(对于第二种策略):
然后在第二个循环中:
…
for stat in line[1:-1]:
…
line.append('{:02d}-{:02d}'.format(score[0], score[1]))
# show on standard output
print(' '.join(line))
…
然后您应该有如下输出:
Golden State Warrio… 67.00 7.00 0.90 40-05
San Antonio Spurs 62.00 12.00 0.83 39-06
^^^^^- this is not handled with
the code above, cf the end
of my post.
Golden State Warriors 67.00 7.00 0.90 40-05
San Antonio Spurs 62.00 12.00 0.83 39-06
HTH这些元素的末尾可能有额外的空格?您能否尝试为检索到的每个元素调用
strip()
?