Python脚本无误终止
我正在运行一个脚本,该脚本下载包含html标记的xls文件,并将它们剥离以创建一个干净的csv文件 代码: 上面的代码对于75KB的文件非常有效,但是对于75MB的文件,进程被终止而没有任何错误 我对beautiful soup和python非常陌生,请帮助我确定问题所在。该脚本在3GB RAM上运行 小文件的输出为:Python脚本无误终止,python,memory-leaks,beautifulsoup,out-of-memory,Python,Memory Leaks,Beautifulsoup,Out Of Memory,我正在运行一个脚本,该脚本下载包含html标记的xls文件,并将它们剥离以创建一个干净的csv文件 代码: 上面的代码对于75KB的文件非常有效,但是对于75MB的文件,进程被终止而没有任何错误 我对beautiful soup和python非常陌生,请帮助我确定问题所在。该脚本在3GB RAM上运行 小文件的输出为: table found row list created soup decomposed file closed writer started
table found
row list created
soup decomposed
file closed
writer started
types | # objects | total size
===================================== | =========== | ============
dict | 5615 | 4.56 MB
str | 8457 | 713.23 KB
list | 3525 | 375.51 KB
<class 'bs4.element.NavigableString | 1810 | 335.76 KB
code | 1874 | 234.25 KB
<class 'bs4.element.Tag | 3097 | 193.56 KB
unicode | 3102 | 182.65 KB
type | 137 | 120.95 KB
wrapper_descriptor | 1060 | 82.81 KB
builtin_function_or_method | 718 | 50.48 KB
method_descriptor | 580 | 40.78 KB
weakref | 416 | 35.75 KB
set | 137 | 35.04 KB
tuple | 431 | 31.56 KB
<class 'abc.ABCMeta | 20 | 17.66 KB
找到表
已创建行列表
汤腐烂了
文件关闭
作家开始
类型|#对象|总大小
===================================== | =========== | ============
dict | 5615 | 4.56 MB
str | 8457 | 713.23 KB
列表| 3525 | 375.51 KB
很难说没有一个实际的文件可以使用,但是您可以做的是避免创建中间行列表并直接写入打开的csv
文件
此外,您还可以让BeautifulSoup
在发动机罩下使用(lxml
应安装)
改进代码:
#!/usr/bin/env python
from urllib2 import urlopen
import csv
from bs4 import BeautifulSoup
f = urlopen('http://localhost/Classes/sample.xls')
soup = BeautifulSoup(f, 'lxml')
with open('output_file.csv', 'wb') as file:
writer = csv.writer(file)
for row in soup.select('table tr'):
writer.writerows(val.text.encode('utf8') for val in row.find_all('th') if val)
writer.writerows(val.text.encode('utf8') for val in row.find_all('td') if val)
soup.decompose()
f.close()
#!/usr/bin/env python
from urllib2 import urlopen
import csv
from bs4 import BeautifulSoup
f = urlopen('http://localhost/Classes/sample.xls')
soup = BeautifulSoup(f, 'lxml')
with open('output_file.csv', 'wb') as file:
writer = csv.writer(file)
for row in soup.select('table tr'):
writer.writerows(val.text.encode('utf8') for val in row.find_all('th') if val)
writer.writerows(val.text.encode('utf8') for val in row.find_all('td') if val)
soup.decompose()
f.close()