Python中是否有“string.split()”的生成器版本?
返回一个列表实例。是否有返回a的版本?是否有任何理由反对使用生成器版本 没有,但是用它写一个应该很容易 编辑: 非常简单、不完整的实现:Python中是否有“string.split()”的生成器版本?,python,string,generator,Python,String,Generator,返回一个列表实例。是否有返回a的版本?是否有任何理由反对使用生成器版本 没有,但是用它写一个应该很容易 编辑: 非常简单、不完整的实现: import itertools import string def isplitwords(s): i = iter(s) while True: r = [] for c in itertools.takewhile(lambda x: not x in string.whitespace, i): r.append(
import itertools
import string
def isplitwords(s):
i = iter(s)
while True:
r = []
for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
r.append(c)
else:
if r:
yield ''.join(r)
continue
else:
raise StopIteration()
我认为split()
的生成器版本没有任何明显的好处。生成器对象必须包含整个字符串才能进行迭代,这样就不会通过使用生成器来节省内存
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
如果你想写一本,那就相当容易了:
import string
def gsplit(s,sep=string.whitespace):
word = []
for c in s:
if c in sep:
if word:
yield "".join(word)
word = []
else:
word.append(c)
if word:
yield "".join(word)
这是通过
re.search()
实现的split()
import re
def itersplit(s, sep=None):
exp = re.compile(r'\s+' if sep is None else re.escape(sep))
pos = 0
while True:
m = exp.search(s, pos)
if not m:
if pos < len(s) or sep is not None:
yield s[pos:]
break
if pos < m.start() or sep is not None:
yield s[pos:m.start()]
pos = m.end()
sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["
assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')
重新导入
def itersplit(s,sep=无):
exp=re.compile(r'\s+'如果sep不是其他的re.escape(sep))
pos=0
尽管如此:
m=exp.search(s,pos)
如果不是m:
如果pos
编辑:更正了在未提供分隔符字符的情况下对周围空白的处理。极有可能使用相当小的内存开销
def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
演示:
编辑:我刚刚确认,假设我的测试方法是正确的,在python 3.2.1中这需要恒定内存。我创建了一个非常大的字符串(1GB左右),然后用for
循环迭代iterable(不是列表理解,它会产生额外的内存)。这并没有导致明显的内存增长(也就是说,如果内存有增长,则远远小于1GB字符串)
更一般的版本:
对于“我看不到与str.split
的连接”的评论,这里有一个更通用的版本:
def splitStr(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
其思想是,((?!pat.)*
通过确保组贪婪地匹配直到模式开始匹配(lookaheads不使用regex有限状态机中的字符串)来“否定”组。在伪代码中:重复使用(字符串的开始
xor{sep}
)+尽可能多,直到我们能够再次开始(或点击字符串末尾)
演示:
>>splitStr('..A..b..c..',sep='..'
>>>列表(splitStr('A,b,c',sep=','))
['A','b','c.]
>>>列表(splitStr(“,,A,b,c,”,sep=“,”)
['','A','b','c',']
>>>列表(splitStr(“……A……b……c……”,“\.\”)
['','',A',b',c','.'.]
>>>列表(splitStr('abc.'))
[“A”,“b”,“c.”
(需要注意的是,它有一个丑陋的行为:它在特殊情况下使用sep=None
作为第一个执行str.strip
来删除前导和尾随空格。上面故意不这样做;请参见最后一个示例,其中sep=“\s+”
)
(我在尝试实现此功能时遇到了各种错误(包括内部错误)…反向查找将限制您使用固定长度的分隔符,因此我们不使用它。除了上面的正则表达式之外,几乎任何东西都会导致字符串开头和字符串结尾边缘大小写出现错误(例如,r'(*?)($|,)“
on”、、、、a、、b、c'
返回[“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”]
并在末尾添加一个无关的空字符串;您可以查看编辑历史,查找另一个看似正确但实际上存在细微错误的正则表达式。)
(如果您想自己实现它以获得更高的性能(尽管它们很重,最重要的是在C中运行正则表达式),您应该编写一些代码(使用ctypes?不知道如何让生成器使用它?),使用以下用于固定长度分隔符的伪代码:散列长度为L的分隔符。使用运行的散列算法O(1)扫描字符串时,保留长度为L的运行散列更新时间。每当散列可能等于您的分隔符时,手动检查过去的几个字符是否为分隔符;如果是,则从上次生成后生成子字符串。字符串开头和结尾的特殊情况。这将是要执行的教科书算法的生成器版本O(N)文本搜索。多处理版本也是可能的。它们可能看起来太过火了,但问题意味着一个是用非常大的字符串……在这一点上,你可能会考虑一些疯狂的事情,比如缓存字节偏移,如果很少,或者从磁盘上用磁盘支持的字节串视图对象工作,购买更多的RAM等等。我能想到的最有效的方法是使用str.find()
方法的offset
参数编写一个。这避免了大量内存使用,并且在不需要时依赖regexp的开销
[编辑2016-8-2:更新此选项以可选地支持正则表达式分隔符]
这可以像你想要的那样使用
>>> print list(isplit("abcb","b"))
['a','c','']
虽然每次find()时字符串中都有一点成本查找或者执行切片,这应该是最小的,因为字符串在内存中表示为连续数组。这是我的实现,它比这里的其他答案更快、更完整。它有4个独立的子函数用于不同的情况
我将只复制主stru split
函数的docstring:
将字符串s
按其余参数拆分,可能会忽略
空部分(empty
关键字参数负责)。
这是一个生成器函数
当只有一个分隔符是s时
>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>
>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']
>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']
>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']
>>> list(splitStr(' A b c. '))
['', 'A', 'b', 'c.', '']
def isplit(source, sep=None, regex=False):
"""
generator version of str.split()
:param source:
source string (unicode or bytes)
:param sep:
separator to split on.
:param regex:
if True, will treat sep as regular expression.
:returns:
generator yielding elements of string.
"""
if sep is None:
# mimic default python behavior
source = source.strip()
sep = "\\s+"
if isinstance(source, bytes):
sep = sep.encode("ascii")
regex = True
if regex:
# version using re.finditer()
if not hasattr(sep, "finditer"):
sep = re.compile(sep)
start = 0
for m in sep.finditer(source):
idx = m.start()
assert idx >= start
yield source[start:idx]
start = m.end()
yield source[start:]
else:
# version using str.find(), less overhead than re.finditer()
sepsize = len(sep)
start = 0
while True:
idx = source.find(sep, start)
if idx == -1:
yield source[start:]
return
yield source[start:idx]
start = idx + sepsize
>>> print list(isplit("abcb","b"))
['a','c','']
str_split(s, *delims, empty=None)
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
import string
def _str_split_chars(s, delims):
"Split the string `s` by characters contained in `delims`, including the \
empty parts between two consecutive delimiters"
start = 0
for i, c in enumerate(s):
if c in delims:
yield s[start:i]
start = i+1
yield s[start:]
def _str_split_chars_ne(s, delims):
"Split the string `s` by longest possible sequences of characters \
contained in `delims`"
start = 0
in_s = False
for i, c in enumerate(s):
if c in delims:
if in_s:
yield s[start:i]
in_s = False
else:
if not in_s:
in_s = True
start = i
if in_s:
yield s[start:]
def _str_split_word(s, delim):
"Split the string `s` by the string `delim`"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
yield s[start:i]
start = i+dlen
except ValueError:
pass
yield s[start:]
def _str_split_word_ne(s, delim):
"Split the string `s` by the string `delim`, not including empty parts \
between two consecutive delimiters"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
if start!=i:
yield s[start:i]
start = i+dlen
except ValueError:
pass
if start<len(s):
yield s[start:]
def str_split(s, *delims, empty=None):
"""\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.
When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
"""
if len(delims)==1:
f = _str_split_word if empty is None or empty else _str_split_word_ne
return f(s, delims[0])
if len(delims)==0:
delims = string.whitespace
delims = set(delims) if len(delims)>=4 else ''.join(delims)
if any(len(d)>1 for d in delims):
raise ValueError("Only 1-character multiple delimiters are supported")
f = _str_split_chars if empty else _str_split_chars_ne
return f(s, delims)
def str_split(s, *delims, **kwargs):
"""...docstring..."""
empty = kwargs.get('empty')
def split_generator(f,s):
"""
f is a string, s is the substring we split on.
This produces a generator rather than a possibly
memory intensive list.
"""
i=0
j=0
while j<len(f):
if i>=len(f):
yield f[j:]
j=i
elif f[i] != s:
i=i+1
else:
yield [f[j:i]]
j=i+1
i=i+1
def isplit(string, delimiter = None):
"""Like string.split but returns an iterator (lazy)
Multiple character delimters are not handled.
"""
if delimiter is None:
# Whitespace delimited by default
delim = r"\s"
elif len(delimiter) != 1:
raise ValueError("Can only handle single character delimiters",
delimiter)
else:
# Escape, incase it's "\", "*" etc.
delim = re.escape(delimiter)
return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))
# Wrapper to make it a list
def helper(*args, **kwargs):
return list(isplit(*args, **kwargs))
# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "]
# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]
# Surrounding whitespace dropped
assert helper(" 1 2 3 ") == ["1", "2", "3"]
# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]
# No multi-char delimiters allowed
try:
helper(r"1,.2,.3", ",.")
assert False
except ValueError:
pass
import itertools as it
def iter_split(string, sep=None):
sep = sep or ' '
groups = it.groupby(string, lambda s: s != sep)
return (''.join(g) for k, g in groups if k)
>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']
the_text = "100 " * 9999 + "100"
def test_function( method ):
def fn( ):
total = 0
for x in method( the_text ):
total += int( x )
return total
return fn
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
print(string[prev.end(): curr.start()])
>>> import more_itertools as mit
>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]
>>> "abcdcba".split("b")
['a', 'cdc', 'a']
def gen_str(some_string, sep):
j=0
guard = len(some_string)-1
for i,s in enumerate(some_string):
if s == sep:
yield some_string[j:i]
j=i+1
elif i!=guard:
continue
else:
yield some_string[j:]
def isplit(text, split='\n'):
while text != '':
end = text.find(split)
if end == -1:
yield text
text = ''
else:
yield text[:end]
text = text[end + 1:]
def str_split(text: str, separator: str) -> Iterable[str]:
i = 0
n = len(text)
while i <= n:
j = text.find(separator, i)
if j == -1:
j = n
yield text[i:j]
i = j + 1