Python Shmython-“;“是”;有那么难吗?
我已经写了一个程序来实现 规则基本上是,如果一个单词以一个辅音(或一组辅音)开头,那么你去掉它并添加“shm”,但如果它以一个元音开头,那么你只添加“shm”。你还把整个事情放在现有单词的末尾 问题是字母Y,因为有时是辅音,有时是元音。我想Python Shmython-“;“是”;有那么难吗?,python,regex,nlp,Python,Regex,Nlp,我已经写了一个程序来实现 规则基本上是,如果一个单词以一个辅音(或一组辅音)开头,那么你去掉它并添加“shm”,但如果它以一个元音开头,那么你只添加“shm”。你还把整个事情放在现有单词的末尾 问题是字母Y,因为有时是辅音,有时是元音。我想you变成you-shmou,但我想Python变成Python-Shmython。我该怎么办 这是到目前为止我的代码 import re def word_shmord(word): orig = word if word.isupper(
you
变成you-shmou
,但我想Python
变成Python-Shmython
。我该怎么办
这是到目前为止我的代码
import re
def word_shmord(word):
orig = word
if word.isupper():
prefix = "SHM"
elif word.istitle():
word = word.lower()
prefix = "Shm"
else:
prefix = "shm"
position = re.search("[aeiou]", word, re.IGNORECASE).start()
new = prefix + word[position:]
return "{}-{}".format(orig, new)
text = """
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
"""
text_shmext = re.sub("\w+", lambda m:word_shmord(m.group(0)), text)
print(text_shmext)
我觉得这个问题很有趣,所以我为这个问题编写了一些语言规则(或者我应该说shmoblem)
重新导入
导入字符串
从nltk.corpus导入停止词
从nltk.tokenize导入单词\u tokenize
从nltk.tokenize.sonority\u排序导入音节词典
stop=stopwords.words('english')
tk=音节识别器()
def word_shmord(word):
如果(len(word)<4且word.lower()在stop中)或不是word.isalnum()或word.lower().startswith('shm'):
回信
如果word中的“y”:
y=word.find('y')
#如果单词没有其他元音,那么Y被认为是元音
如果len(re.findall(“[aeiou]”,word,re.IGNORECASE))=0,word.count('y')=1:
word=word[:y]+'#'+word[y+1:]
#或者如果字母在一个单词的末尾
如果单词[-1]=“y”:
单词=单词[:-1]+'#'
#或音节的中间/结尾
if word.find('y')!=-1:
syll=tk.tokenize(word)
对于枚举中的i,s(syll):
snew=s[:-1]+'#'如果s[-1]=='y'其他s
y=snew.find('y')
如果len(snew)//2==y:
snew=snew[:y]+'#'+snew[y+1:]
syll[i]=snew
word=''.join(syll)
如果word.isupper():
前缀=“SHM”
elif word.istitle():
word=word.lower()
前缀=“Shm”
其他:
前缀=“shm”
元音=re.search(“[aeiou#]”,单词,re.IGNORECASE)
如果不是元音:
回信
位置=元音。开始()
new=前缀+单词[position:]替换('#','y')
还新
text=“敏捷的棕色狐狸跳过懒惰的狗”
text_shmext=([word_shmord(x)表示word_标记化(text)])
#连接字符串
text_-shmext=“”.join([“”+i如果我不在字符串中。标点符号否则我在text_-shmext中代表i]).strip()
打印(文本\u shmext)
输入:敏捷的棕色狐狸跳过懒惰的狗
输出:shmuick shmown shmox shmumps shmover shmazy shmog不是真的。我想要
Skye Shmye
。也许只有单词开头的辅音才有效。但是我们关心y在后院的地位吗?它只是去了shmackyard,
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.sonority_sequencing import SyllableTokenizer
stop = stopwords.words('english')
tk = SyllableTokenizer()
def word_shmord(word):
if (len(word) < 4 and word.lower() in stop) or not word.isalnum() or word.lower().startswith('shm'):
return word
if 'y' in word:
y = word.find('y')
# Y is considered to be a vowel if The word has no other vowel
if len(re.findall("[aeiou]", word, re.IGNORECASE)) == 0 and word.count('y') == 1:
word = word[:y] + '#' + word[y + 1:]
# or if the letter is at the end of a word
if word[-1] == 'y':
word = word[:-1]+ '#'
# or middle/end of syllable
if word.find('y') != -1:
syll = tk.tokenize(word)
for i, s in enumerate(syll):
snew = s[:-1] + '#' if s[-1] == 'y' else s
y = snew.find('y')
if len(snew) // 2 == y:
snew = snew[:y] + '#' + snew[y + 1:]
syll[i] = snew
word = ''.join(syll)
if word.isupper():
prefix = "SHM"
elif word.istitle():
word = word.lower()
prefix = "Shm"
else:
prefix = "shm"
vowels = re.search("[aeiou#]", word, re.IGNORECASE)
if not vowels:
return word
position = vowels.start()
new = prefix + word[position:].replace('#', 'y')
return new
text = "The quick brown fox jumps over the lazy dog"
text_shmext = ([word_shmord(x) for x in word_tokenize(text)])
# join strings
text_shmext = "".join([" " + i if i not in string.punctuation else i for i in text_shmext]).strip()
print(text_shmext)