Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从列中提取URL信息_Python_Pandas_Url_Tld_Urlparse - Fatal编程技术网

Python 从列中提取URL信息

Python 从列中提取URL信息,python,pandas,url,tld,urlparse,Python,Pandas,Url,Tld,Urlparse,我需要保留链接的某些部分: Link www.xxx.co.uk/path1 www.asx_win.com/path2 www.asdfe.aer.com ... 期望输出: Link2 xxx.co.uk asx_win.com asdfe.aer.com ... 我使用了urlparse和tldextract,但两者都有 Netloc www.xxx.co.uk www.asx_win.com www.asdfe.aer.com ... 或 通过使用字符串,

我需要保留链接的某些部分:

Link             
www.xxx.co.uk/path1
www.asx_win.com/path2
www.asdfe.aer.com
...
期望输出:

Link2
xxx.co.uk
asx_win.com
asdfe.aer.com
...
我使用了urlparse和tldextract,但两者都有

Netloc
www.xxx.co.uk
www.asx_win.com
www.asdfe.aer.com
...

通过使用字符串,一些问题可能来自以下方面:

9     https://www.facebook.com/login/?next=https%3A%...
10    https://pt-br.facebook.com/114546123419/pos...
11    https://www.facebook.com/login/?next=https%3A%...
20    http://fsareq.media/?pg=article&id=s...
22    https://www.wq-wq.com/lrq-rqwrq-...
24    https://faseqrq.it/2020/05/28/...
我的尝试是考虑从URL解析NETLoC和从TLDEXTRAIL I,E,结束部分获得的差异。 例如,我从Netloc获得www.xxx.co.uk,从tldextract获得xxx。这意味着如果我从Netloc中减去tldextract,我会得到www和co.uk。我会使用公共部分作为分界点,并将该部分保留在后面,即co.uk,这就是我要寻找的

差异将由类似df['Link2']=[a.replaceb,.strip for a,b in zipdf['Netloc'],df['TLDEXTRACT']的内容给出。这只是因为我需要考虑的结尾部分后缀。
现在,我需要了解如何只考虑结束部分以获得预期的输出。您可以在上面的示例中使用Netloc和TLDEXTRACT列。

首先删除http/https:

from urllib.parse import urlparse
def remove(row):
    if(row['urls'].str.contains('https') or row['urls'].str.contains('http')):
        return urlparse(row['urls']).netloc
   
withouthttp = df.apply(lambda x: remove(x), axis=1)
然后:

切割前4个标志www

一切都结束了/

您还可以使用https和http编辑所有记录:

onlyHttps = df.loc[df['urls'].str.contains("https", case=False)]
allWithoutHttps = df[~df["urls"].str.contains("https", case=False)]
在所有操作之后,删除www和http/https-concat正确的记录

pd.concat([https, http, www])

首先删除http/https:

from urllib.parse import urlparse
def remove(row):
    if(row['urls'].str.contains('https') or row['urls'].str.contains('http')):
        return urlparse(row['urls']).netloc
   
withouthttp = df.apply(lambda x: remove(x), axis=1)
然后:

切割前4个标志www

一切都结束了/

您还可以使用https和http编辑所有记录:

onlyHttps = df.loc[df['urls'].str.contains("https", case=False)]
allWithoutHttps = df[~df["urls"].str.contains("https", case=False)]
在所有操作之后,删除www和http/https-concat正确的记录

pd.concat([https, http, www])
tldextract.extract返回子域、域、后缀的命名元组:

tldextract.extract'www.xxx.co.uk' ExtractResultsubdomain='www',domain='xxx',后缀='co.uk' 因此,您可以只连接索引[1:]:

进口tldextract df['Extracted']=df.Link.applylambda x:'。.jointldextract.extractx[1:] 链接提取 0 www.xxx.co.uk/path1 xxx.co.uk 1 www.asx_-win.com/path2 asx_-win.com 2 www.asdfe.aer.com aer.com 3.https://www.facebook.com/login/?next=https%3A%... facebook.com 4.https://pt-br.facebook.com/114546123419/pos... facebook.com 5.https://www.facebook.com/login/?next=https%3A%... facebook.com 6.http://fsareq.media/?pg=article&id=s... fsareq.media 7.https://www.wq-wq.com/lrq-rqwrq-... wq-wq.com 8.https://faseqrq.it/2020/05/28/... faseqrq.it tldextract.extract返回子域、域、后缀的命名元组:

tldextract.extract'www.xxx.co.uk' ExtractResultsubdomain='www',domain='xxx',后缀='co.uk' 因此,您可以只连接索引[1:]:

进口tldextract df['Extracted']=df.Link.applylambda x:'。.jointldextract.extractx[1:] 链接提取 0 www.xxx.co.uk/path1 xxx.co.uk 1 www.asx_-win.com/path2 asx_-win.com 2 www.asdfe.aer.com aer.com 3.https://www.facebook.com/login/?next=https%3A%... facebook.com 4.https://pt-br.facebook.com/114546123419/pos... facebook.com 5.https://www.facebook.com/login/?next=https%3A%... facebook.com 6.http://fsareq.media/?pg=article&id=s... fsareq.media 7.https://www.wq-wq.com/lrq-rqwrq-... wq-wq.com 8.https://faseqrq.it/2020/05/28/... faseqrq.it
好的-因为它是数据帧而不是系列-我正在尝试返工。@LdM-因为有df['url'].str[4:]-先切割4-然后是h-t-t-pOk-因为它是数据帧而不是系列-我正在尝试返工。@LdM-因为有df['url'].str[4:]-先切割4-然后是h-t-t-p谢谢你,tdy!是的,这正是我要找的!非常感谢你,tdy!是的,这正是我要找的!