频率分布

广告位

在文本处理期间经常需要计算文本主体中单词出现的频率。 这可以通过应用word_tokenize()函数并将结果…

在文本处理期间经常需要计算文本主体中单词出现的频率。 这可以通过应用word_tokenize()函数并将结果附加到列表以保持单词的计数来实现,如下面的程序所示。

from nltk.tokenize import word_tokenize  from nltk.corpus import gutenberg    sample = gutenberg.raw("blake-poems.txt")    token = word_tokenize(sample)  wlist = []    for i in range(50):      wlist.append(token[i])    wordfreq = [wlist.count(w) for w in wlist]  print("Pairsn" + str(zip(token, wordfreq)))  

当运行上面的程序时,我们得到以下输出 –

[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1), (1789', 1), (]', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (AND', 1), (OF', 3), (EXPERIENCE', 1), (and', 1), (THE', 1), (BOOK', 1), (of', 2), (THEL', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (INTRODUCTION', 1), (Piping', 2), (down', 1), (the', 1), (valleys', 1), (wild', 1), (,', 3), (Piping', 2), (songs', 1), (of', 2), (pleasant', 1), (glee', 1), (,', 3), (On', 1), (a', 2), (cloud', 1), (I', 1), (saw', 1), (a', 2), (child', 1), (,', 3), (And', 1), (he', 1), (laughing', 1), (said', 1), (to', 1), (me', 1), (:', 1), (``', 1)]  

条件频率分布

当想要计算满足特定crteria满足一组文本的单词时,使用条件频率分布。

import nltk  #from nltk.tokenize import word_tokenize  from nltk.corpus import brown    cfd = nltk.ConditionalFreqDist(            (genre, word)            for genre in brown.categories()            for word in brown.words(categories=genre))  categories = ['hobbies', 'romance','humor']  searchwords = [ 'may', 'might', 'must', 'will']  cfd.tabulate(conditions=categories, samples=searchwords)  

当运行上面的程序时,我们得到以下输出 –

          may might  must  will   hobbies   131    22    83   264   romance    11    51    45    43     humor     8     8     9    13  

  

洁姐我爱你

关于作者: 洁姐我爱你

为您推荐