使用python忽略字符串末尾的子字符串
我有数据使用python忽略字符串末尾的子字符串,python,string,pandas,Python,String,Pandas,我有数据 213.87.137.33 - - [14/Apr/2016:17:23:36],"CONNECT api-glb-ams.smoot.apple.com:443",200 0,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 api-glb-ams.smoot.apple.com 443 1856 213.87.137.33 - - [14/Apr/2016:17:23:36],"CON
213.87.137.33 - - [14/Apr/2016:17:23:36],"CONNECT api-glb-ams.smoot.apple.com:443",200 0,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 api-glb-ams.smoot.apple.com 443 1856
213.87.137.33 - - [14/Apr/2016:17:23:36],"CONNECT init.itunes.apple.com:443",200 0,"MobileSafari/601.1 CFNetwork/758.2.8 Darwin/15.0.0",9443 init.itunes.apple.com 443 50073
213.87.137.33 - - [14/Apr/2016:17:23:54],"GET http://www.rbc.ru/ajax/getnewsfeed/?",304 292,"Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13C75 Safari/601.1",9443 www.rbc.ru 80 9547
213.87.137.33 - - [14/Apr/2016:17:23:56],"GET http://www.rbc.ru/ajax/mainjson/?",200 99535,"Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13C75 Safari/601.1",9443 www.rbc.ru 80 0
213.87.137.33 - - [14/Apr/2016:17:23:58],"CONNECT api-glb-ams.smoot.apple.com:443",200 0,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 api-glb-ams.smoot.apple.com 443 40633
213.87.137.33 - - [14/Apr/2016:17:23:58],"GET https://api-glb-ams.smoot.apple.com.js",200 381,"SafariShared/601.1.46.42 (iPhone4,1; iPhone OS 13C75) Safari/601.1",9443 - 443 40633
213.87.137.33 - - [14/Apr/2016:17:24:02],"CONNECT init.itunes.apple.com:443",200 0,"MobileSafari/601.1 CFNetwork/758.2.8 Darwin/15.0.0",9443 init.itunes.apple.com 443 57391
我应该忽略一些url
末尾包含一些单词的url
我试着
import pandas as pd
colnames = ["used_at", "url", "smth", "browser", "smth2"]
df = pd.read_csv('urls.csv', names=colnames, header=None, sep='""', engine="python")
df['url'] = df['url'].str.strip(',')
urls = df['url']
ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml', '.json', '.css', '.swf', 'svg', 'ico', '.cur')
for url in urls:
if not url.startswith('GET'):
continue
elif url.endswith(word for word in ignore):
continue
else:
print url
但是它返回
TypeError:endswith first arg必须是str、unicode或tuple,而不是generator
最简单的方法是直接使用ignore
:
ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml', '.json', '.css', '.swf', 'svg', 'ico', '.cur')
for url in urls:
if not url.startswith('GET'):
continue
elif url.endswith(ignore): #use ignore directly here
continue
else:
print url
这是因为
endswith
可以与tuple一起使用。最简单的方法是直接使用ignore
:
ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml', '.json', '.css', '.swf', 'svg', 'ico', '.cur')
for url in urls:
if not url.startswith('GET'):
continue
elif url.endswith(ignore): #use ignore directly here
continue
else:
print url
这是因为endswith
可以与tuple一起使用。Change
到
它读起来相当不错:如果任何url以“忽略”一词结尾,那么做些什么。更改
到
它读起来很好:如果任何url以单词from ignore结尾,那么就做些什么。你可以先用
创建ignore\u li
,然后通过连接(或)然后过滤数据帧,通过str[]
获取最后的字符并使用。最后一次仅返回列url
by:
您可以首先创建忽略
,使用加入
通过
(或
),然后过滤数据帧
,通过str[]获取最后的5
字符并使用。最后一次仅返回列url
by:
您可以使用矢量化的.str.startswith
和str.contains
来执行操作。endswith接受一个元组,因此您可以只使用url.endswith(忽略)
。您可以使用矢量化的.str.startswith
和str.contains
来执行您的操作。endswith接受一个元组,因此您只需使用url.endswith(忽略)
@hellmoore-yep。Python非常适合这种处理方式。:)@是的。Python非常适合这种处理方式。:)
elif any(url.endswith(word) for word in ignore)
ignore = ('.jpg', '.js', '.jpeg', '.gif', '.png', '.xml',
'.json', '.css', '.swf', 'svg', 'ico', '.cur')
ignore_li = '|'.join(ignore)
print df.loc[df.url.str.startswith('GET') & ~(df.url.str[-5:].str.contains(ignore_li)),'url']
0 GET http://www.livejournal.com/
1 GET http://pagead2.googlesyndication.com/activ...
2 GET http://pagead2.googlesyndication.com/activ...
3 GET http://rtax.criteo.com/delivery/rta/rta.js...
4 GET http://l-stat.livejournal.net/tmpl/??Widge...
5 GET http://xc3.services.livejournal.com/ljcoun...
7 GET http://montblanc.rambler.ru/mb
8 GET http://awaps.yandex.ru/0/9999/001001.gif?0...
9 GET http://www.tns-counter.ru/V13a***R%3E*sup_...
10 GET http://b.scorecardresearch.com/b?c1=2&c2=1...
11 GET http://l-api.livejournal.com/__api/?callba...
12 GET http://l-api.livejournal.com/__api/?callba...
13 GET http://www.tns-counter.ru/V13a****rambler_...
15 GET http://www.tns-counter.ru/V13a****rambler_...
16 GET http://www.tns-counter.ru/V13a****rambler_...