Python 仅根据特定域名筛选数据帧中的链接_Python_Pandas_Dataframe_Filter

Python 仅根据特定域名筛选数据帧中的链接

python pandas dataframe filter

Python 仅根据特定域名筛选数据帧中的链接,python,pandas,dataframe,filter,Python,Pandas,Dataframe,Filter,我有一个有5列的熊猫数据框。我需要根据列表中的域名过滤列链接上的数据帧，并重复计算每个进程的行数。假设我有以下数据帧： url_id | link ------------------------------------------------------------------------ 1 | http://www.example.com/somepath 2 | http://www.somelink.net/example 3 | http://othe

我有一个有5列的熊猫数据框。我需要根据列表中的域名过滤列

链接

上的数据帧，并重复计算每个进程的行数。假设我有以下数据帧：

url_id | link
------------------------------------------------------------------------
1      | http://www.example.com/somepath
2      | http://www.somelink.net/example
3      | http://other.someotherurls.ac.uk/thisissomelink.net&part/sample 
4      | http://part.example.com/directory/files

我想根据以下列表中的域名过滤数据帧，并计算每个结果的数量：

domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']

预期产出如下：

domain       | no_of_links
--------------------------
example.com  |  2
other.com    |  0
somelink.net |  1
sample.com   |  0

这是我的代码：

from tld import get_tld 
import pandas as pd

def urlparsing(row):
    url = row['link']
    res = get_tld(url,as_object=True)
    return (res.fld)

link = ({"url_id":[1,2,3,4],"link":["http://www.example.com/somepath",
            "http://www.somelink.net/example",
            "http://other.someotherurls.ac.uk/thisissomelink.net&part/sample",
            "http://part.example.com/directory/files"]})

domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']
df_link = pd.DataFrame(link)

ref_dom = []
for dom in domains:   
    ddd = df_link[(df_link.apply(lambda row: urlparsing(row), axis=1)).str.contains(dom, regex=False)]     
    ref_dom.append([dom, len(ddd)])

pd.DataFrame(ref_dom, columns=['domain','no_of_links'])

基本上，我的代码是有效的。然而，当数据帧的大小非常大（超过500万行），域名列表超过10万个时，这个过程就花了我一天的时间。

如果你有其他的方法让它更快，请让我知道。任何帮助都将不胜感激。谢谢。

您可以使用df.str函数的regex和findall函数来完成

domains=['example.com'，'other.com'，'somelink.net'，'sample.com']
pat=“|”。.join（[f”http[s]？：/（？：\w*\）？（{domain}）
对于映射中的域（lambda x:x.replace（“.”，“\”，domains）]）
match=df[“link”].str.findall（pat.explode（）.explode（）
match=match[match.str.len（）>0]
match.groupby（match.count）（）

结果

链接
example.com 2
somelink.net 1
名称：link，数据类型：int64

对于0.25之前的大熊猫

domains=['example.com'，'other.com'，'somelink.net'，'sample.com']
pat=“|”。.join（[f”http[s]？：/（？：\w*\）？（{domain}）
对于映射中的域（lambda x:x.replace（“.”，“\”，domains）]）
match=df[“link”].str.findall（pat）\
.apply（lambda x:“”.join（[匹配中的域对应于x中的域]）.strip（））
match=match[match.str.len（）>0]
match.groupby（match.count）（）

要获得具有0个链接的域，您还可以将结果与具有所有域的df连接起来