Python 如何在不使用split函数的情况下从字符串中提取单词?
如何从字符串中提取单词,用标点符号、空格、数字等分隔这些单词…而不使用任何Python 如何在不使用split函数的情况下从字符串中提取单词?,python,Python,如何从字符串中提取单词,用标点符号、空格、数字等分隔这些单词…而不使用任何拆分、替换,或类似re的库。我仍在学习python,本书建议在不使用列表和字符串方法的情况下找到解决方案 Example Input : The@Tt11end Example Output: ["The", "Tt", "end"] 这是我迄今为止的尝试: def extract_words(sentence): words_list = [] separator = [",",".",";","'"
拆分
、替换
,或类似re
的库。我仍在学习python,本书建议在不使用列表和字符串方法的情况下找到解决方案
Example Input : The@Tt11end
Example Output: ["The", "Tt", "end"]
这是我迄今为止的尝试:
def extract_words(sentence):
words_list = []
separator = [",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"]
counter= 0
for i in range(len(sentence)):
i=counter
while(is_letter(sentence[i])):
words+= sentence[i]
i = i+1
counter=counter+1
words_list.append(words)
words=""
return words_list
编辑:这是我的is_letter()
方法:
def is_letter(char):
return ("A" <= char and char <= "Z") or \
("a" <= char and char <= "z")
def是字母(char):
return(“A”最好是在那里使用正则表达式,但如果您想要一些异国情调……这里是:
str = "The@Tt11end444sooqa"
delims = [0] + [i + 1 for i, s in enumerate(str) if not s.isalpha()] + [len(str) + 1]
parts = [str[delims[i]: delims[i + 1] - 1] for i in range(len(delims) - 1) if delims[i + 1] - delims[i] != 1]
扩展版可更好地了解正在发生的事情:
str = "The@Tt11end444sooqa"
# delims will contain indexes of all non-alphabetic characters
delims = [0] # adding 0 index as first delimiter (start of string)
for i, s in enumerate(str): # iterating through "str"
if not s.isalpha(): # if character is non-alphabetic store it's index
delims.append(i + 1) # we add 1 to not include delimiter into final string
delims += [len(str) + 1] # adding end of string index to not miss last part
# parts will contain parts of original string stored in "str"
parts = []
for i in range(len(delims) - 1): #iterating over "delims" using indexes
# do not include part if delimiters goes next one to another
if delims[i + 1] - delims[i] != 1:
substr = str[delims[i]: delims[i + 1] - 1] # copy substring between delimiters
parts.append(substr)
最好是在那里使用正则表达式,但如果您想要一些异国情调…这里是:
str = "The@Tt11end444sooqa"
delims = [0] + [i + 1 for i, s in enumerate(str) if not s.isalpha()] + [len(str) + 1]
parts = [str[delims[i]: delims[i + 1] - 1] for i in range(len(delims) - 1) if delims[i + 1] - delims[i] != 1]
扩展版可更好地了解正在发生的事情:
str = "The@Tt11end444sooqa"
# delims will contain indexes of all non-alphabetic characters
delims = [0] # adding 0 index as first delimiter (start of string)
for i, s in enumerate(str): # iterating through "str"
if not s.isalpha(): # if character is non-alphabetic store it's index
delims.append(i + 1) # we add 1 to not include delimiter into final string
delims += [len(str) + 1] # adding end of string index to not miss last part
# parts will contain parts of original string stored in "str"
parts = []
for i in range(len(delims) - 1): #iterating over "delims" using indexes
# do not include part if delimiters goes next one to another
if delims[i + 1] - delims[i] != 1:
substr = str[delims[i]: delims[i + 1] - 1] # copy substring between delimiters
parts.append(substr)
此代码用于:
def extract_words(sentence):
sentence = list(sentence)
words_list = []
separator = [",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"]
bufferS = []
for i in range(len(sentence)):
if sentence[i] not in separator:
bufferS.append(sentence[i])
else:
words_list.append(''.join(bufferS))
bufferS = []
words_list.append(''.join(bufferS))
words_list = [x for x in words_list if x != '']
return words_list
它回来了
['aaaaaaa', 'bbbbbbb', 'ccccc', 'dddd']
没有使用库。此代码执行以下操作:
def extract_words(sentence):
sentence = list(sentence)
words_list = []
separator = [",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"]
bufferS = []
for i in range(len(sentence)):
if sentence[i] not in separator:
bufferS.append(sentence[i])
else:
words_list.append(''.join(bufferS))
bufferS = []
words_list.append(''.join(bufferS))
words_list = [x for x in words_list if x != '']
return words_list
它回来了
['aaaaaaa', 'bbbbbbb', 'ccccc', 'dddd']
没有使用库。您的问题是每次都将i
设置为计数器
,并且它不会递增超过第一个非字母
它将每次递增,直到范围(len(句子))完成,但for It的每个循环将重置回原来的is_字母故障,在这种情况下,i=3
例如
现在变量i
将等于4,但是变量计数器仍然等于3,因为它在while(is_字母)块内没有递增。在这方面更合适的用法是if/else,如下所示:
def extract_words(sentence):
words_list = []
words = ""
for i in range(len(sentence)):
if is_letter(sentence[i]):
words += sentence[i]
else:
if words != "":
words_list.append(words)
words = ""
if words != "":
words_list.append(words)
return words_list
def is_letter(char):
return ("A" <= char and char <= "Z") or \
("a" <= char and char <= "z")
if __name__ == '__main__':
print(extract_words("The@Tt11end"))
在此设置中,循环将仅使用i作为递增变量,因为它已经是for循环,并且在for上下文之外更改i值可能会导致问题,如您所见
下一次,每当字符串的字符是字母时,它都会被添加到word变量中。然后,如果下一个增量是符号,它会将该单词附加到列表中,并忽略符号/数字
最后,如果两个或多个符号相邻(这导致您得到一个空字符串列表'
),它将检查单词是否已经包含任何字符,如果没有,它将继续下一个字符。您的问题是每次都将i
设置为计数器,并且它不会递增超过第一个非字母
它将每次递增,直到范围(len(句子))完成,但for It的每个循环将重置回原来的is_字母故障,在这种情况下,i=3
例如
现在变量i
将等于4,但是变量计数器仍然等于3,因为它在while(is_字母)块内没有递增。在这方面更合适的用法是if/else,如下所示:
def extract_words(sentence):
words_list = []
words = ""
for i in range(len(sentence)):
if is_letter(sentence[i]):
words += sentence[i]
else:
if words != "":
words_list.append(words)
words = ""
if words != "":
words_list.append(words)
return words_list
def is_letter(char):
return ("A" <= char and char <= "Z") or \
("a" <= char and char <= "z")
if __name__ == '__main__':
print(extract_words("The@Tt11end"))
在此设置中,循环将仅使用i作为递增变量,因为它已经是for循环,并且在for上下文之外更改i值可能会导致问题,如您所见
下一次,每当字符串的字符是字母时,它都会被添加到word变量中。然后,如果下一个增量是符号,它会将该单词附加到列表中,并忽略符号/数字
最后,如果两个或多个符号相邻(这导致您得到一个空字符串列表'
),它将检查单词是否已经包含任何字符,如果不包含,它将继续到下一个字符。只需对当前代码进行最小的更改,您就可以一次迭代字符串一个字符,并利用您已有的分隔符列表作为O(1)的集合查找时间。这将使您不必担心递增多个计数器变量:
def extract_words(sentence):
separator_set = set([",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"])
words_list = []
word = []
for c in sentence:
if c not in separator_set:
word.append(c)
else:
if len(word) > 0:
words_list.append(''.join(word))
word = []
if len(word) > 0:
words_list.append(''.join(word))
return words_list
def is_letter(char):
return ("A" <= char and char <= "Z") or ("a" <= char and char <= "z")
def main():
print(extract_words("The@Tt11end"))
if __name__ == '__main__':
main()
通过对当前代码进行最小的更改,您可以一次迭代字符串一个字符,并利用您已有的分隔符列表作为O(1)查找时间的集合。这将使您不必担心递增多个计数器变量:
def extract_words(sentence):
separator_set = set([",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"])
words_list = []
word = []
for c in sentence:
if c not in separator_set:
word.append(c)
else:
if len(word) > 0:
words_list.append(''.join(word))
word = []
if len(word) > 0:
words_list.append(''.join(word))
return words_list
def is_letter(char):
return ("A" <= char and char <= "Z") or ("a" <= char and char <= "z")
def main():
print(extract_words("The@Tt11end"))
if __name__ == '__main__':
main()
您的代码陷入了混乱,没有索引到给定的句子中
你只需要反复阅读句子中的字符
def is_letter(char):
return ("A" <= char <= "Z") or ("a" <= char <= "z")
def extract_words(sentence):
word = ""
words_list = []
for ch in sentence:
if is_letter(ch):
word += ch
else:
if word:
words_list.append(word)
word = ""
if word:
words_list.append(word)
return words_list
print(extract_words('The@,Tt11end'))
代码会遍历语句中的每个字符。如果是字母,则会将其添加到当前单词中。如果不是,则会将当前单词(如果有)添加到输出列表中。最后,如果最后一个字符是字母,则会剩下一个单词,该单词也会添加到输出中。您的代码陷入了一个混乱,而不是对给定句子进行索引
你只需要反复阅读句子中的字符
def is_letter(char):
return ("A" <= char <= "Z") or ("a" <= char <= "z")
def extract_words(sentence):
word = ""
words_list = []
for ch in sentence:
if is_letter(ch):
word += ch
else:
if word:
words_list.append(word)
word = ""
if word:
words_list.append(word)
return words_list
print(extract_words('The@,Tt11end'))
代码在语句中的每个字符中进行迭代。如果是字母,则将其添加到当前单词中。如果不是,则将当前单词(如果有)添加到输出列表中。最后,如果最后一个字符是字母,则剩余的单词也将添加到输出中。您发布的代码不会运行。您的代码作为posted不运行。好的,这更有用,但它仍然没有解释OP代码的错误。此代码与OP中的测试数据一起工作吗?@quamrana它现在工作了好的,这更有用,但它仍然没有解释OP代码的错误。此代码与OP中的测试数据一起工作吗什么?@quamrana是的now@ggorlen“现在好点了吗?”@ggorlen,现在好点了吗?