Python 使用正则表达式从google books生成以数字开头的Unigram_Python_Regex_Nltk_N Gram

Python 使用正则表达式从google books生成以数字开头的Unigram

python regex

Python 使用正则表达式从google books生成以数字开头的Unigram,python,regex,nltk,n-gram,Python,Regex,Nltk,N Gram,下面的代码生成了谷歌图书单格的pickle词典。它生成26个字典，如以a、b、c、…、z开头的单词字典 p = re.compile(r'^[a-z]*$', re.IGNORECASE) el = 'abcdefghijklmnopqrstuvwxyz' for l in el: fname, url, records = next(readline_google_store(ngram_len=1, indices=l)) unigrams = {} count

下面的代码生成了谷歌图书单格的pickle词典。它生成26个字典，如以a、b、c、…、z开头的单词字典

p = re.compile(r'^[a-z]*$', re.IGNORECASE)
el = 'abcdefghijklmnopqrstuvwxyz'

for l in el:
    fname, url, records = next(readline_google_store(ngram_len=1, indices=l))
    unigrams = {}
    count = 0
    for r in records:
        if (r.year >=2000):
            w = r.ngram.lower()
            if p.match(w):
                if w in unigrams:
                    unigrams[w] += np.array([r.match_count, r.volume_count])
                else:
                    unigrams[w] = np.array([r.match_count, r.volume_count])
    with open(str(l)+'_unigram_dict.pickle', 'w') as f:
        pickle.dump(unigrams, f)

输出类似于

{'word'：[total\u match\u count，total\u volume\u count]}

我想把它改成只有以数字开头的单词字典。正则表达式应该捕获以数字（从0到9）开头，后跟任意字符的模式。我尝试了

re.compile（r'^（？:\d*\）？\d+$，re.IGNORECASE）

但它只捕获数字字。它不捕捉以下词语： “00161动词”或“002200动词”或“01-73”等

编辑：输入（记录）的格式如下：

ngram TAB year TAB match_count TAB page_count NEWLINE

我希望输出是一个带有以“0”和“0”开头的键Ngram的字典将列表[‘历年匹配计数之和’、‘历年页面计数之和’]的值设置如下：

{'ngrams':['sum of match_count over the years', 'sum of page_count over the years']}

问题解决了，

re.compile（r'^\d\S+）

worked.

给出示例输入和所需输出，其中包括无法捕获的异常值。这将帮助我们更好地回答这个问题=）尝试一下

re.compile（r'^\d'）

re.compile（r'^\d'）

没有捕获像'002200_NUM'这样的ngram，我只是添加了示例输入和所需输出。