Python 使用正则表达式“清理”名称列表
我正在使用正则表达式来清理一个名字列表,以便它们是正常的。假设这个列表是Python 使用正则表达式“清理”名称列表,python,regex,python-3.x,Python,Regex,Python 3.x,我正在使用正则表达式来清理一个名字列表,以便它们是正常的。假设这个列表是 000000AAAAAARob Alsod ## Notice multiple 0's and A's? AAAPerson Person ## Here, too Jeff the awesome Guy ## Four words... Jenna DEeath ## A name like this can exist. GEOFFERY EVERDEE
000000AAAAAARob Alsod ## Notice multiple 0's and A's?
AAAPerson Person ## Here, too
Jeff the awesome Guy ## Four words...
Jenna DEeath ## A name like this can exist.
GEOFFERY EVERDEEN ## All caps
shy guy ## All lowercase
Theone Normalperson ## Example name. This one is fine.
Guywith Whitespace ## Trailing or leading whitespace is a nono.
如你所见,人们的名字格式不正确,所以我需要一个程序来突出显示所有不需要的东西。这包括:
名称开头的数字
后面没有小写字母的大写字母。i、 e.aaaaaaa Josh
所有大写字母
任何不以大写字母开头的内容。i、 e.乔希
尾随和前导空格
我想我只需要过滤掉这些。最终产品应如下所示:
Rob Alsod ## No more 0's and A's.
Person Person ## No more leading A's (or other letters).
Jeff Guy ## No lowercase words in his name.
Jenna DEeath ## HASN'T removed the D in the middle.
## Name removed as it was all uppercase.
## Name removed as it was all lowercase.
Theone Normalperson ## Nothing changed.
Guywith Whitespace ## Removed whitespace.
编辑:很抱歉。这是我目前的代码:
# Enter your code for "Name Cleaning" here.
import re
namenum = []
num = 0
for sen in open('file.txt'):
namenum += [sen.split(',')]
namenum[num][0] = re.sub(r'\s[a-z]+', '', namenum[num][0])
namenum[num][0] = re.sub(r'^([0-9]*)', '', namenum[num][0])
namenum[num][0] = re.sub(r'^[A-Z]*?\s[A-Z]*?$', '', namenum[num][0])
namenum[num][0] = re.sub(r'[^a-zA-Z ][A-Z]*(?=[A-Z])', '', namenum[num][0])
namenum[num][0] = re.sub(r'\b[a-z]+\b', '', namenum[num][0])
namenum[num][0] = re.sub(r'^\s*', '', namenum[num][0])
namenum[num][0] = re.sub(r'\s*$', '', namenum[num][0])
if namenum[num][0] == '':
namenum[num][0] = 'Invalid Name'
num += 1
for i in range(len(namenum)):
namenum[i][1] = int(namenum[i][1].strip())
namenum = sorted(namenum, key=lambda item: (-item[1], item[0]))
for i in range(0, len(namenum)):
print(namenum[i][0]+','+str(namenum[i][1]))
它完成了一半的工作,但由于某种原因,它错过了一些东西
以下是输出:
AAAAAARob Alsod
AAAPerson Person
Guywith Whitespace
Invalid Name
Invalid Name
Jeff Guy
Jenna DEeath
Theone Normalperson
我还试着输入一个像harry hamilton这样的名字,它返回了harry,它应该删除它。这个正则表达式删除了所有无效的例子。你的例子中没有一个需要for循环来过滤被禁止的单词,但是我认为你需要它 尽管此代码从列表中删除了所有无效名称,但修改它以请求用户的新输入应该很容易。此外,它不会让您知道名称无效的原因,但您可以只显示所有规则
from re import match
def rules(name):
for badWord in bannedWords:
if name.lower().find(badWord) >= 0:
return False
return match(r'^([A-Z][a-z]+(?:[A-Z]?[a-z]+)* ?){1,}$', name)
bannedWords = ('really', 'awesome')
input = ['000000AAAAAARob Alsod', 'AAAPerson Person', 'Jeff the awesome Guy', 'Jenna DEeath', 'GEOFFERY EVERDEEN', 'shy guy', 'Theone Normalperson', ' Guywith Whitespace', 'Someone Middlename MacIntyre', '', 'Jack Really Awesome']
results = filter(rules, input)
print results
生成结果:
['Theone Normalperson', 'Someone Middlename MacIntyre']
['Theone Normalperson', 'Someone Middlename MacIntyre', 'Jack Really Awesome']
没有for循环:
from re import match
def rules(name):
return match(r'^([A-Z][a-z]+(?:[A-Z]?[a-z]+)* ?){1,}$', name)
input = ['000000AAAAAARob Alsod', 'AAAPerson Person', 'Jeff the awesome Guy', 'Jenna DEeath', 'GEOFFERY EVERDEEN', 'shy guy', 'Theone Normalperson', ' Guywith Whitespace', 'Someone Middlename MacIntyre', '', 'Jack Really Awesome']
results = filter(rules, input)
print results
生成结果:
['Theone Normalperson', 'Someone Middlename MacIntyre']
['Theone Normalperson', 'Someone Middlename MacIntyre', 'Jack Really Awesome']
事实上,你必须尝试一些东西。到目前为止你的代码在哪里?我们不是一个免费的代码工厂-1、闭嘴对不起。我编辑了作品。它遗漏了什么?只是出于好奇,你为什么要这么做?名字很复杂。我正在创建一个数据库。它包含我们系统中的人员姓名。不幸的是,这些名字是手工输入的,这导致一些人开玩笑,写罗伯特·勒阿索姆·阿尔索德而不是他们的普通名字。顺便说一句,@Michelle,这是经过编辑的。非常感谢。这应该行得通。