Python:报纸模块-有没有直接从URL获取文章的方法?

Python:报纸模块-有没有直接从URL获取文章的方法?,python,multithreading,parsing,nlp,python-newspaper,Python,Multithreading,Parsing,Nlp,Python Newspaper,我正在使用python的报纸模块 在教程中,它描述了如何将不同报纸s.t.的构建合并在一起,同时生成它们。(请参阅上面链接中的“多线程文章下载”) 有没有办法直接从URL列表中提取文章?也就是说,我有没有办法将多个URL输入到下面的设置中,并让它同时下载和解析它们 from newspaper import Article url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics

我正在使用python的报纸模块

在教程中,它描述了如何将不同报纸s.t.的构建合并在一起,同时生成它们。(请参阅上面链接中的“多线程文章下载”)

有没有办法直接从URL列表中提取文章?也就是说,我有没有办法将多个URL输入到下面的设置中,并让它同时下载和解析它们

from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])

我不熟悉报纸模块,但以下代码使用URL列表,应该与链接页面中提供的URL列表等效:

import newspaper
from newspaper import news_pool

urls = ['http://slate.com','http://techcrunch.com','http://espn.com']
papers = [newspaper.build(i) for i in urls]
news_pool.set(papers, threads_per_source=2)
news_pool.join()

我可以通过为每篇文章的URL创建一个
源代码来实现这一点。(免责声明:不是python开发人员)


我知道这个问题很老了,但这是我在谷歌上搜索如何获取多线程报纸时出现的第一个链接之一。虽然Kyles的答案很有帮助,但它并不完整,我认为它有一些拼写错误

import newspaper

urls = [
'http://www.baltimorenews.net/index.php/sid/234363921',
'http://www.baltimorenews.net/index.php/sid/234323971',
'http://www.atlantanews.net/index.php/sid/234323891',
'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
]

class SingleSource(newspaper.Source):
def __init__(self, articleURL):
    super(SingleSource, self).__init__("http://localhost")
    self.articles = [newspaper.Article(url=articleURL)]

sources = [SingleSource(articleURL=u) for u in urls]

newspaper.news_pool.set(sources)
newspaper.news_pool.join()
我将Stubsource更改为Singlesource,将其中一个URL更改为articleURL。 当然,这只是下载网页,你仍然需要解析它们才能得到文本

multi=[]
i=0
for s in sources:
    i+=1
    try:
        (s.articles[0]).parse()
        txt = (s.articles[0]).text
        multi.append(txt)
    except:
        pass
在我的100个url的示例中,与仅按顺序处理每个url相比,这花费了一半的时间。(编辑:将样本量增加到2000后,减少了约四分之一。)

(编辑:使用多线程完成了全部工作!)我对我的实现使用了非常好的解释。对于100个URL的示例大小,使用4个线程所需的时间与上面的代码相当,但将线程数增加到10会进一步减少大约一半。更大的样本量需要更多的线程来提供可比较的差异

import newspaper
from multiprocessing.dummy import Pool as ThreadPool

def getTxt(url):
    article = Article(url)
    article.download()
    try:
        article.parse()
        txt=article.text
        return txt
    except:
        return ""

pool = ThreadPool(10)

# open the urls in their own threads
# and return the results
results = pool.map(getTxt, urls)

# close the pool and wait for the work to finish 
pool.close() 
pool.join()

以约瑟夫的回答为基础。我假设最初的海报想要使用多线程来提取一组数据并将其正确地存储在某个地方。经过多次尝试,我认为我已经找到了一个解决方案,它可能不是最有效的,但它是有效的,我试图使它更好,但是,我认为newspaper3k插件可能有点缺陷。但是,这可以将所需元素提取到数据帧中

import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
papers = [gamespot_paper, bbc_paper]
news_pool.set(papers, threads_per_source=4) 
news_pool.join()

#Create our final dataframe
df_articles = pd.DataFrame()

#Create a download limit per sources
limit = 100

for source in papers:
    #tempoary lists to store each element we want to extract
    list_title = []
    list_text = []
    list_source =[]

    count = 0

    for article_extract in source.articles:
        article_extract.parse()

        if count > limit:
            break

        #appending the elements we want to extract
        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.append(article_extract.source_url)

        #Update count
        count +=1


    df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append to the final DataFrame
    df_articles = df_articles.append(df_temp, ignore_index = True)
    print('source extracted')

请提出任何改进建议

你说的“直接从URL中提取文章”是什么意思?你想从一个给定的URL抓取并下载所有链接的文章吗?我只想抓取页面上为文章提供的URL。我希望能够提供一组URL,以便它们可以同时下载。这就是我所说的“在教程中,它描述了如何汇集不同报纸的建设,同时生成它们。”我不认为这是我想要的。具体地说,我也试着做同样的事情,但是使用特定文章的URL,我不知道如何提取文章的文本。。。如果它被下载了。
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
papers = [gamespot_paper, bbc_paper]
news_pool.set(papers, threads_per_source=4) 
news_pool.join()

#Create our final dataframe
df_articles = pd.DataFrame()

#Create a download limit per sources
limit = 100

for source in papers:
    #tempoary lists to store each element we want to extract
    list_title = []
    list_text = []
    list_source =[]

    count = 0

    for article_extract in source.articles:
        article_extract.parse()

        if count > limit:
            break

        #appending the elements we want to extract
        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.append(article_extract.source_url)

        #Update count
        count +=1


    df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append to the final DataFrame
    df_articles = df_articles.append(df_temp, ignore_index = True)
    print('source extracted')