Python 使用bs4提取html文件中的文本_Python_Python 2.7_Beautifulsoup_Html Parsing

Python 使用bs4提取html文件中的文本

python python-2.7

Python 使用bs4提取html文件中的文本,python,python-2.7,beautifulsoup,html-parsing,Python,Python 2.7,Beautifulsoup,Html Parsing,要从我的html文件中提取文本。如果我将以下内容用于特定文件： import bs4, sys from urllib import urlopen #filin = open(sys.argv[1], 'r') filin = '/home/iykeln/Desktop/R_work/file1.html' webpage = urlopen(filin).read().decode('utf-8') soup = bs4.BeautifulSoup(webpage) for node in

要从我的html文件中提取文本。如果我将以下内容用于特定文件：

import bs4, sys
from urllib import urlopen
#filin = open(sys.argv[1], 'r')
filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

它会起作用的。但在下面使用open（sys.argv[1]，'r'）尝试非特定文件：

或

我将得到以下错误：

Traceback (most recent call last):
  File "/home/iykeln/Desktop/py/clean.py", line 5, in <module>
    webpage = urlopen(filin).read().decode('utf-8')
  File "/usr/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/usr/lib/python2.7/urllib.py", line 180, in open
    fullurl = unwrap(toBytes(fullurl))
  File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap
    url = url.strip()
AttributeError: 'file' object has no attribute 'strip'

回溯（最近一次呼叫最后一次）：
文件“/home/iykeln/Desktop/py/clean.py”，第5行，在
webpage=urlopen（filin）.read（）.decode（'utf-8'）
文件“/usr/lib/python2.7/urllib.py”，urlopen中的第87行
返回opener.open（url）
文件“/usr/lib/python2.7/urllib.py”，第180行，打开
fullurl=展开（以字节为单位（fullurl））
文件“/usr/lib/python2.7/urllib.py”，第1057行，展开
url=url.strip（）
AttributeError:“文件”对象没有属性“strip”

您不应该调用

open

，只需将文件名传递给

urlopen

：

import bs4, sys
from urllib import urlopen

webpage = urlopen(sys.argv[1]).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

仅供参考，打开本地文件不需要

urllib

：

import bs4, sys

with open(sys.argv[1], 'r') as f:
    webpage = f.read().decode('utf-8')

soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

希望有帮助。

是的！这很有帮助。谢谢

import bs4, sys
from urllib import urlopen

webpage = urlopen(sys.argv[1]).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

import bs4, sys

with open(sys.argv[1], 'r') as f:
    webpage = f.read().decode('utf-8')

soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')