Python 从断开的文件中检索内容<；a>；用漂亮的汤做标签_Python_Web Scraping_Beautifulsoup

Python 从断开的文件中检索内容<；a>；用漂亮的汤做标签

python web-scraping

Python 从断开的文件中检索内容<；a>；用漂亮的汤做标签,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图解析一个网站，并检索包含超链接的文本。例如：我需要检索“这是一个例子”，这是我能够做的网页没有坏标签。在以下情况下，我无法检索： <html> <body> <a href = "http:\\www.google.com">Google<br> <a href = "http:\\www.example.com">Example</a> </body> </html> 请注意，sol.

我试图解析一个网站，并检索包含超链接的文本。例如：

我需要检索“这是一个例子”，这是我能够做的网页没有坏标签。在以下情况下，我无法检索：

<html>
<body>
<a href = "http:\\www.google.com">Google<br>
<a href = "http:\\www.example.com">Example</a>
</body>
</html>

请注意，sol.html本身包含上述给定的html代码

谢谢 -AJ

尝试以下代码：

from BeautifulSoup import BeautifulSoup

text = '''
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
<a href = "http:\\www.example.com">Example</a>
</body>
</html>
'''

soup = BeautifulSoup(text)

for link in soup.findAll('a'):
    if link.string != None:
        print link.string

从美化组导入美化组
文本='''
'''
soup=BeautifulSoup（文本）
对于soup.findAll（'a'）中的链接：
如果link.string！=无：
打印link.string

下面是我运行代码时的输出：

Example 例子

只需将

text

替换为

text=open（'sol.html'）.read（）

，或者您需要的任何内容。

从代码中删除

text=True

，它就可以正常工作了：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <html>
... <body>
... <a href = "http:\\www.google.com">Google<br>
... <a href = "http:\\www.example.com">Example</a>
... </body>
... </html>
... ''')
>>> [a.get_text().strip() for a in soup.find_all('a')]
[u'Google', u'Example']
>>> [a.get_text().strip() for a in soup.find_all('a', text=True)]
[u'Example']

>>来自bs4导入组
>>>汤=美汤（“”）
... 
... 
... 
... 
... 
... ''')
>>>[a.为汤中的a获取_text（）.strip（）.find_all（'a'）]
[u'Google'，u'Example']
>>>[a.为汤中的a获取_text（）.strip（）。查找_all（'a'，text=True）]
[举例]

Hi，我想找到Google和Example。目前我的代码也给出了示例。哦，对不起，我现在明白了。嗯，我想你只需要修复断开的链接，或者使用，我不知道你还能怎么做。谢谢，就在那里。当我按照您的建议尝试代码时，我得到了以下输出：[u'Google\n示例'，u'Example']只是想看看为什么会出现这个示例。下面是我修改过的代码：从bs4导入BeautifulSoupsTrainer f=open（“sol.html”，“r”）soup=BeautifulSoup（f，parse_only=SoupStrainer（'a'））打印[a.get_text（）.strip（），用于汤中的a.find_all（'a'）]否，额外的文本“\n示例”仍然会显示appended@AjayNair：使用我使用的确切代码？您使用的是什么版本的Python和BeautifulSoup？重写上面的注释（现在已删除）以提高可读性：是的，完全相同的代码。我正在使用Python2.7和BS4。在python解释器上测试的我的代码：

>>soup=BeautifulSoup（'''..''）>>[a.get_text（）.strip（）表示汤中的a.find_all（'a'）][u'Google\n例如，u'Example']>>

@AjayNair:这真奇怪。我唯一能想到的就是让您尝试安装

html5lib

。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <html>
... <body>
... <a href = "http:\\www.google.com">Google<br>
... <a href = "http:\\www.example.com">Example</a>
... </body>
... </html>
... ''')
>>> [a.get_text().strip() for a in soup.find_all('a')]
[u'Google', u'Example']
>>> [a.get_text().strip() for a in soup.find_all('a', text=True)]
[u'Example']