Web scraping 特定单词和前后单词数的网页抓取

Web scraping 特定单词和前后单词数的网页抓取,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我在一个段落中寻找一个特定的词“风险”。有人能帮我编码,检查段落中是否有这个词,如果有,我只想在关键词前后提取6个词。提前感谢。这里有一个简单的例子。但是请注意,如果关键字前后少于6个单词,我没有说明情况。但这给了你一个大致的开始/想法 import bs4 as bs import urllib.request import re sauce = urllib.request.urlopen('url').read() soup = bs.BeautifulSoup(sauce, 'lxml

我在一个段落中寻找一个特定的词“风险”。有人能帮我编码,检查段落中是否有这个词,如果有,我只想在关键词前后提取6个词。提前感谢。

这里有一个简单的例子。但是请注意,如果关键字前后少于6个单词,我没有说明情况。但这给了你一个大致的开始/想法

import bs4 as bs
import urllib.request
import re

sauce = urllib.request.urlopen('url').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print (soup.text)

test  = soup.findAll (text = re.compile('risk'))
print (test)
输出:

from bs4 import BeautifulSoup
import requests
import re

key_word = 'risk'
url = 'https://www.investing.com/analysis/2-reasons-merck-200373488'

with requests.Session() as s: 
    s.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    } 


response = s.get(url)
soup = BeautifulSoup(response.text,"html.parser")

paragraphs  = soup.findAll(text = re.compile(key_word)) 

if len(paragraphs) == 0:
    print ('"%s" not found.' %(key_word))

else:
    for paragraph in paragraphs:
        #print (paragraph.strip())
        alpha = paragraph.strip().split(' ')

        try:
            idx = alpha.index(key_word)

            six_words = alpha[idx-6: idx] + alpha[idx: idx+7]
            print (' '.join(six_words) + '\n')
        except:
            continue

我认为这个解决方案应该有效。如果字符串中的前/后单词少于6个,这也会为您提供一个输出。此外,它与“风险”匹配正确,不会与“风险”之类的内容匹配

您必须进行一些修改以匹配您的用例

cryptocurrencies involves high risks including the risk of losing some, or all, of

investment objectives, level of experience, and risk appetite, and seek professional advice where

你有url吗?我只是想帮助你在任何url中使用任何关键字的通用代码。谢谢
from bs4 import BeautifulSoup
import urllib.request
import re
url='https://www.investing.com/analysis/2-reasons-merck-200373488'
req = urllib.request.Request(
    url,
    data=None,
    headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)
sauce = urllib.request.urlopen(req).read()
soup=BeautifulSoup(sauce,'html.parser')
pattern=re.compile(r'risk[\.| ]',re.IGNORECASE)#'Risk', 'risk.', 'risk'  but NOT 'risky'
no_of_words=6
for elem in soup(text=pattern):
    str=elem.parent.text
    list=str.split(' ')
    list_indices=[i for i,x in enumerate(list) if re.match(pattern,x.strip()+' ')]# +' ' to conform with our pattern
    for index in list_indices:
        start=index-no_of_words
        end=index+no_of_words+1
        if start<0:
            start=0
        print(' '.join(list[start:end]).strip()) #end will not affect o/p if > len(list)
        print("List of Word Before: ",list[start:index])# words before
        print("List of Words After: ",list[index+1:end])#word after
        print()
Risk Warning
List of Word Before:  []
List of Words After:  ['Warning']

Risk Disclosure:
List of Word Before:  []
List of Words After:  ['Disclosure:']

Risk Disclosure: Trading in financial instruments and/or
List of Word Before:  []
List of Words After:  ['Disclosure:', 'Trading', 'in', 'financial', 'instruments', 'and/or']

cryptocurrencies involves high risks including the risk of losing some, or all, of
List of Word Before:  ['cryptocurrencies', 'involves', 'high', 'risks', 'including', 'the']
List of Words After:  ['of', 'losing', 'some,', 'or', 'all,', 'of']

investment objectives, level of experience, and risk appetite, and seek professional advice where
List of Word Before:  ['investment', 'objectives,', 'level', 'of', 'experience,', 'and']
List of Words After:  ['appetite,', 'and', 'seek', 'professional', 'advice', 'where']

investment objectives, level of experience, and risk appetite, and seek professional advice where
List of Word Before:  ['investment', 'objectives,', 'level', 'of', 'experience,', 'and']
List of Words After:  ['appetite,', 'and', 'seek', 'professional', 'advice', 'where']