使用python忽略字符串末尾的子字符串_Python_String_Pandas

使用python忽略字符串末尾的子字符串

python string pandas

使用python忽略字符串末尾的子字符串,python,string,pandas,Python,String,Pandas,我有数据 213.87.137.33 - - [14/Apr/2016:17:23:36],"CONNECT api-glb-ams.smoot.apple.com:443",200 0,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 api-glb-ams.smoot.apple.com 443 1856 213.87.137.33 - - [14/Apr/2016:17:23:36],"CON

我有数据

213.87.137.33 - - [14/Apr/2016:17:23:36],"CONNECT api-glb-ams.smoot.apple.com:443",200 0,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 api-glb-ams.smoot.apple.com 443 1856
213.87.137.33 - - [14/Apr/2016:17:23:36],"CONNECT init.itunes.apple.com:443",200 0,"MobileSafari/601.1 CFNetwork/758.2.8 Darwin/15.0.0",9443 init.itunes.apple.com 443 50073
213.87.137.33 - - [14/Apr/2016:17:23:54],"GET http://www.rbc.ru/ajax/getnewsfeed/?",304 292,"Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13C75 Safari/601.1",9443 www.rbc.ru 80 9547
213.87.137.33 - - [14/Apr/2016:17:23:56],"GET http://www.rbc.ru/ajax/mainjson/?",200 99535,"Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13C75 Safari/601.1",9443 www.rbc.ru 80 0
213.87.137.33 - - [14/Apr/2016:17:23:58],"CONNECT api-glb-ams.smoot.apple.com:443",200 0,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 api-glb-ams.smoot.apple.com 443 40633
213.87.137.33 - - [14/Apr/2016:17:23:58],"GET https://api-glb-ams.smoot.apple.com.js",200 381,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 - 443 40633
213.87.137.33 - - [14/Apr/2016:17:24:02],"CONNECT init.itunes.apple.com:443",200 0,"MobileSafari/601.1 CFNetwork/758.2.8 Darwin/15.0.0",9443 init.itunes.apple.com 443 57391

我应该忽略一些

url

末尾包含一些单词的

url

我试着

import pandas as pd

colnames = ["used_at", "url", "smth", "browser", "smth2"]
df = pd.read_csv('urls.csv', names=colnames, header=None, sep='""', engine="python")
df['url'] = df['url'].str.strip(',')
urls = df['url']
ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml', '.json', '.css', '.swf', 'svg', 'ico', '.cur')
for url in urls:
    if not url.startswith('GET'):
        continue
    elif url.endswith(word for word in ignore):
        continue
    else:
        print url

但是它返回

TypeError:endswith first arg必须是str、unicode或tuple，而不是generator

最简单的方法是直接使用

ignore

：

ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml', '.json', '.css', '.swf', 'svg', 'ico', '.cur')
for url in urls:
    if not url.startswith('GET'):
        continue
    elif url.endswith(ignore): #use ignore directly here
        continue
    else:
        print url

这是因为

endswith

可以与tuple一起使用。

最简单的方法是直接使用

ignore

：

ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml', '.json', '.css', '.swf', 'svg', 'ico', '.cur')
for url in urls:
    if not url.startswith('GET'):
        continue
    elif url.endswith(ignore): #use ignore directly here
        continue
    else:
        print url

这是因为

endswith

可以与tuple一起使用。

Change 到它读起来相当不错：如果任何url以“忽略”一词结尾，那么做些什么。

更改到

它读起来很好：如果任何url以单词from ignore结尾，那么就做些什么。你可以先用

创建ignore\u li
，然后通过连接（或）然后过滤数据帧，通过str[]
获取最后的字符并使用。最后一次仅返回列url
by:
您可以首先创建忽略
，使用加入
通过
（或
），然后过滤数据帧
，通过str[]获取最后的5
字符并使用。最后一次仅返回列url
by:
您可以使用矢量化的.str.startswith
和str.contains
来执行操作。endswith接受一个元组，因此您可以只使用url.endswith（忽略）
。您可以使用矢量化的.str.startswith
和str.contains
来执行您的操作。endswith接受一个元组，因此您只需使用url.endswith（忽略）@hellmoore-yep。Python非常适合这种处理方式。：）@是的。Python非常适合这种处理方式。：）
elif any(url.endswith(word) for word in ignore)

ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml',
          '.json', '.css', '.swf', 'svg', 'ico', '.cur')
ignore_li = '|'.join(ignore)

print df.loc[df.url.str.startswith('GET') & ~(df.url.str[-5:].str.contains(ignore_li)),'url']

0                        GET http://www.livejournal.com/
1      GET http://pagead2.googlesyndication.com/activ...
2      GET http://pagead2.googlesyndication.com/activ...
3      GET http://rtax.criteo.com/delivery/rta/rta.js...
4      GET http://l-stat.livejournal.net/tmpl/??Widge...
5      GET http://xc3.services.livejournal.com/ljcoun...
7                     GET http://montblanc.rambler.ru/mb
8      GET http://awaps.yandex.ru/0/9999/001001.gif?0...
9      GET http://www.tns-counter.ru/V13a***R%3E*sup_...
10     GET http://b.scorecardresearch.com/b?c1=2&c2=1...
11     GET http://l-api.livejournal.com/__api/?callba...
12     GET http://l-api.livejournal.com/__api/?callba...
13     GET http://www.tns-counter.ru/V13a****rambler_...
15     GET http://www.tns-counter.ru/V13a****rambler_...
16     GET http://www.tns-counter.ru/V13a****rambler_...