python 自然语言处理（四）___

词典或者词典资源是一个词和/或短语及其相关信息的集合，例如：词性和词意定义等相关信息。词典资源附属于文本，而且通常在文本的基础上创建和丰富。下面列举几种nltk中的词典资源。

1. 词汇列表语料库

nltk中包括了一些仅仅包含词汇列表的语料库。词汇语料库是UNIX中的/usr/dict/words文件，被一些拼写检查程序所使用。我们可以用它来寻找文本语料中不常见的或拼写错误的词汇。

1)过滤词汇

 >>> def unusual_words(text):

 ...     text_vocab=set(w.lower() for w in text if w.isalpha())

 ...     english_vocab=set(w.lower() for w in nltk.corpus.words.words())

 ...     unusual=text_vocab.difference(english_vocab)

 ...     return sorted(unusual)

 ...

 >>> dif1=unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))

 >>> dif1[:20]

 ['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abu

 ses', 'accents', 'accepting', 'accommodations', 'accompanied', 'accounted', 'acc

 ounts', 'accustomary', 'aches', 'acknowledging', 'acknowledgment', 'acknowledgme

 nts', 'acquaintances', 'acquiesced']

 >>> dif2=unusual_words(nltk.corpus.nps_chat.words())

 >>> dif2[:20]

 ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack',

 'acros', 'actualy', 'adams', 'adds', 'adduser', 'adjusts', 'adoted', 'adreniline

 ', 'ads', 'adults', 'afe', 'affairs', 'affari']

 >>>

其中，dict1.difference(dict2)表示dict1-dict2,即dict1中所有不属于dict2的词。

2. 停用词语料库

该语料库包括的是高频词汇，如：the, to 和 and，有时在进一步进行处理之前需要将他们从文档中过滤。停用词通常没有什么词汇内容，而它们的出现会使区分文本变得困难。

1）nltk中的常用词库：

 >>> from nltk.corpus import stopwords

 >>> stopwords.words('english')

 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yo

 urs', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'he

 rs', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'thems

 elves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am',

 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having

 ', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'be

 cause', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'again

 st', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below'

 , 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again'

 , 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'al

 l', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',

  'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'ca

 n', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y'

 , 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma'

 , 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn

 ']

 >>>

2）过滤停用词列表

 >>>

 >>> def content_fraction(text):

 ...     stopwords=nltk.corpus.stopwords.words('english')

 ...     content=[w for w in text if w.lower() not in stopwords]

 ...     print (content[:50])

 ...     return len(content)/len(text)

 ...

 >>> content_fraction(nltk.corpus.reuters.words())

 ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'U', '.', '.-', 'JAPAN', 'RIFT', 'Mount

 ing', 'trade', 'friction', 'U', '.', '.', 'Japan', 'raised', 'fears', 'among', '

 many', 'Asia', "'", 'exporting', 'nations', 'row', 'could', 'inflict', 'far', '-

 ', 'reaching', 'economic', 'damage', ',', 'businessmen', 'officials', 'said', '.

 ', 'told', 'Reuter', 'correspondents', 'Asian', 'capitals', 'U', '.', '.', 'Move

 ', 'Japan', 'might', 'boost', 'protectionist']

 0.735240435097661

 >>>

词汇列表对解决类似词谜问题很有用。运行程序遍历每一个词，检查每一个词是否符合条件。

3. 名字语料库

该语料库包括8000个按性别分类的名字。男性和女性的名字存储在单独的文件中。

1）以下实例实现功能：找出同时出现在两个文件中的名字即分辨不出性别的名字

 >>> names=nltk.corpus.names

 >>> names.fileids()

 ['female.txt', 'male.txt']

 >>> male_name=names.words('male.txt')

 >>> female_name=names.words('female.txt')

 >>> [w for w in male_name if w in female_name]

 ['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis'

 , 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',

 'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', 'Barrie',

  'Barry', 'Beau', 'Bennie', 'Benny', 'Bernie', 'Bert', 'Bertie', 'Bill', 'Billie

 ', 'Billy', 'Blair', 'Blake', 'Bo', 'Bobbie', 'Bobby', 'Brandy', 'Brett', 'Britt

 ', 'Brook', 'Brooke', 'Brooks', 'Bryn', 'Cal', 'Cam', 'Cammy', 'Carey', 'Carlie'

 , 'Carlin', 'Carmine', 'Carroll', 'Cary', 'Caryl', 'Casey', 'Cass', 'Cat', 'Ceci

 l', 'Chad', 'Chris', 'Chrissy', 'Christian', 'Christie', 'Christy', 'Clair', 'Cl

 aire', 'Clare', 'Claude', 'Clem', 'Clemmie', 'Cody', 'Connie', 'Constantine', 'C

 orey', 'Corrie', 'Cory', 'Courtney', 'Cris', 'Daffy', 'Dale', 'Dallas', 'Dana',

 'Dani', 'Daniel', 'Dannie', 'Danny', 'Darby', 'Darcy', 'Darryl', 'Daryl', 'Deane

 ', 'Del', 'Dell', 'Demetris', 'Dennie', 'Denny', 'Devin', 'Devon', 'Dion', 'Dion

 is', 'Dominique', 'Donnie', 'Donny', 'Dorian', 'Dory', 'Drew', 'Eddie', 'Eddy',

 'Edie', 'Elisha', 'Emmy', 'Erin', 'Esme', 'Evelyn', 'Felice', 'Fran', 'Francis',

  'Frank', 'Frankie', 'Franky', 'Fred', 'Freddie', 'Freddy', 'Gabriel', 'Gabriell

 ', 'Gail', 'Gale', 'Gay', 'Gayle', 'Gene', 'George', 'Georgia', 'Georgie', 'Geri

 ', 'Germaine', 'Gerri', 'Gerry', 'Gill', 'Ginger', 'Glen', 'Glenn', 'Grace', 'Gr

 etchen', 'Gus', 'Haleigh', 'Haley', 'Hannibal', 'Harley', 'Hazel', 'Heath', 'Hen

 rie', 'Hilary', 'Hillary', 'Holly', 'Ike', 'Ikey', 'Ira', 'Isa', 'Isador', 'Isad

 ore', 'Jackie', 'Jaime', 'Jamie', 'Jan', 'Jean', 'Jere', 'Jermaine', 'Jerrie', '

 Jerry', 'Jess', 'Jesse', 'Jessie', 'Jo', 'Jodi', 'Jodie', 'Jody', 'Joey', 'Jorda

 n', 'Juanita', 'Jude', 'Judith', 'Judy', 'Julie', 'Justin', 'Karel', 'Kellen', '

 Kelley', 'Kelly', 'Kelsey', 'Kerry', 'Kim', 'Kip', 'Kirby', 'Kit', 'Kris', 'Kyle

 ', 'Lane', 'Lanny', 'Lauren', 'Laurie', 'Lee', 'Leigh', 'Leland', 'Lesley', 'Les

 lie', 'Lin', 'Lind', 'Lindsay', 'Lindsey', 'Lindy', 'Lonnie', 'Loren', 'Lorne',

 'Lorrie', 'Lou', 'Luce', 'Lyn', 'Lynn', 'Maddie', 'Maddy', 'Marietta', 'Marion',

  'Marlo', 'Martie', 'Marty', 'Mattie', 'Matty', 'Maurise', 'Max', 'Maxie', 'Mead

 ', 'Meade', 'Mel', 'Meredith', 'Merle', 'Merrill', 'Merry', 'Meryl', 'Michal', '

 Michel', 'Michele', 'Mickie', 'Micky', 'Millicent', 'Morgan', 'Morlee', 'Muffin'

 , 'Nat', 'Nichole', 'Nickie', 'Nicky', 'Niki', 'Nikki', 'Noel', 'Ollie', 'Page',

  'Paige', 'Pat', 'Patrice', 'Patsy', 'Pattie', 'Patty', 'Pen', 'Pennie', 'Penny'

 , 'Perry', 'Phil', 'Pooh', 'Quentin', 'Quinn', 'Randi', 'Randie', 'Randy', 'Ray'

 , 'Regan', 'Reggie', 'Rene', 'Rey', 'Ricki', 'Rickie', 'Ricky', 'Rikki', 'Robbie

 ', 'Robin', 'Ronnie', 'Ronny', 'Rory', 'Ruby', 'Sal', 'Sam', 'Sammy', 'Sandy', '

 Sascha', 'Sasha', 'Saundra', 'Sayre', 'Scotty', 'Sean', 'Shaine', 'Shane', 'Shan

 non', 'Shaun', 'Shawn', 'Shay', 'Shayne', 'Shea', 'Shelby', 'Shell', 'Shelley',

 'Sibyl', 'Simone', 'Sonnie', 'Sonny', 'Stacy', 'Sunny', 'Sydney', 'Tabbie', 'Tab

 by', 'Tallie', 'Tally', 'Tammie', 'Tammy', 'Tate', 'Ted', 'Teddie', 'Teddy', 'Te

 rri', 'Terry', 'Theo', 'Tim', 'Timmie', 'Timmy', 'Tobe', 'Tobie', 'Toby', 'Tommi

 e', 'Tommy', 'Tony', 'Torey', 'Trace', 'Tracey', 'Tracie', 'Tracy', 'Val', 'Vale

 ', 'Valentine', 'Van', 'Vin', 'Vinnie', 'Vinny', 'Virgie', 'Wallie', 'Wallis', '

 Wally', 'Whitney', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']

 >>>

2）研究男性和女性名字的结尾字母

 >>> cfd=nltk.ConditionalFreqDist(

 ... (fileid, name[-1])

 ... for fileid in names.fileids()

 ... for name in names.words(fileid))

 >>> cfd.tabulate()

                    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p    r    s    t    u    v    w    x    y    z

 female.txt    1 1773    9    0   39 1432    2   10  105  317    1    3  179   13  386   33    2   47   93   68    6    2    5   10  461    4

   male.txt    0   29   21   25  228  468   25   32   93   50    3   69  187   70  478  165   18  190  230  164   12   16   17   10  332   11

 >>> cfd.plot()

 >>>

 >>>

显然，大多数以a, e, 或 i 结尾的名字是女性；以h 和 l 结尾的名字男性和女性同样多。

4.表格词典

表格（或电子表格）是一种略微丰富的词典资源，在每一行中含有一个词及其一些性质。nltk中包括美国英语的CMU发音词典。

1）发音的词典

CMU发音词典是为语音合成器而设计的。

 >>>

 >>> entries=nltk.corpus.cmudict.entries()

 >>> len(entries)

 133737

 >>> for entry in entries[39943:39951]:

 ...     print (entry)

 ...

 ('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0'])

 ('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z'])

 ('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z'])

 ('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG'])

 ('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N'])

 ('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z'])

 ('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V'])

 ('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])

 >>>

对任意一个词，词典资源都有语音的代码——不同的声音有着不同的标签——称做音素。CMU发音词典中的符号是从Arpabet来的。

2）比较词典

表格词典的另一个例子是比较词典。nltk中包含了所谓的斯瓦迪士核心词列表（Swadesh wordlists）, 包括几种语言的约200个常用词的列表。语言标识符使用ISO639双字母码。

 >>> from nltk.corpus import swadesh

 >>> swadesh.fileids()

 ['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', '

 uk']

 >>> swadesh.words('en')

 ['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'n

 ot', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy', 'small', 'short'

 , 'narrow', 'thin', 'woman', 'man (adult male)', 'man (human being)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish', 'bir

 d', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit', 'seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rop

 e', 'skin', 'meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather', 'hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth',

  'tongue', 'fingernail', 'foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back', 'breast', 'heart', 'liver', 'drink', 'eat',

  'bite', 'suck', 'spit', 'vomit', 'blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell', 'fear', 'sleep', 'live', 'di

 e', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split', 'stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit', 'stand', '

 turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe', 'pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'flo

 at', 'flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river', 'lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'c

 loud', 'fog', 'sky', 'wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain', 'red', 'green', 'yellow', 'white', 'black'

 , 'night', 'day', 'year', 'warm', 'cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight', 'round', 'sharp', 'dull', 'smo

 oth', 'wet', 'dry', 'correct', 'near', 'far', 'right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name']

 >>>

swadesh.fileids()获得的是语言的类别。

swadesh.words('en')获得的是英语语言下的词列表。
可以使用该词表轻松实现翻译器（法语，德语，西班牙语翻译成英文），实例如下：

 >>> fr2en=swadesh.entries(['fr', 'en'])

 >>> fr2en

 [('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they'), ('ceci',

  'this'), ('cela', 'that'), ('ici', 'here'), ('là', 'there'), ('qui', 'who'), ('quoi', 'what'), ('où', 'where'), ('quand', 'when'), ('commen

 t', 'how'), ('ne...pas', 'not'), ('tout', 'all'), ('plusieurs', 'many'), ('quelques', 'some'), ('peu', 'few'), ('autre', 'other'), ('un', 'o

 ne'), ('deux', 'two'), ('trois', 'three'), ('quatre', 'four'), ('cinq', 'five'), ('grand', 'big'), ('long', 'long'), ('large', 'wide'), ('ép

 ais', 'thick'), ('lourd', 'heavy'), ('petit', 'small'), ('court', 'short'), ('étroit', 'narrow'), ('mince', 'thin'), ('femme', 'woman'), ('h

 omme', 'man (adult male)'), ('homme', 'man (human being)'), ('enfant', 'child'), ('femme, épouse', 'wife'), ('mari, époux', 'husband'), ('mè

 re', 'mother'), ('père', 'father'), ('animal', 'animal'), ('poisson', 'fish'), ('oiseau', 'bird'), ('chien', 'dog'), ('pou', 'louse'), ('ser

 pent', 'snake'), ('ver', 'worm'), ('arbre', 'tree'), ('forêt', 'forest'), ('b\xe2ton', 'stick'), ('fruit', 'fruit'), ('graine', 'seed'), ('f

 euille', 'leaf'), ('racine', 'root'), ('écorce', 'bark (from tree)'), ('fleur', 'flower'), ('herbe', 'grass'), ('corde', 'rope'), ('peau', '

 skin'), ('viande', 'meat'), ('sang', 'blood'), ('os', 'bone'), ('graisse', 'fat (noun)'), ('\u0153uf', 'egg'), ('corne', 'horn'), ('queue',

 'tail'), ('plume', 'feather'), ('cheveu', 'hair'), ('tête', 'head'), ('oreille', 'ear'), ('\u0153il', 'eye'), ('nez', 'nose'), ('bouche', 'm

 outh'), ('dent', 'tooth'), ('langue', 'tongue'), ('ongle', 'fingernail'), ('pied', 'foot'), ('jambe', 'leg'), ('genou', 'knee'), ('main', 'h

 and'), ('aile', 'wing'), ('ventre', 'belly'), ('entrailles', 'guts'), ('cou', 'neck'), ('dos', 'back'), ('sein, poitrine', 'breast'), ('c\u0

 153ur', 'heart'), ('foie', 'liver'), ('boire', 'drink'), ('manger', 'eat'), ('mordre', 'bite'), ('sucer', 'suck'), ('cracher', 'spit'), ('vo

 mir', 'vomit'), ('souffler', 'blow'), ('respirer', 'breathe'), ('rire', 'laugh'), ('voir', 'see'), ('entendre', 'hear'), ('savoir', 'know (a

  fact)'), ('penser', 'think'), ('sentir', 'smell'), ('craindre, avoir peur', 'fear'), ('dormir', 'sleep'), ('vivre', 'live'), ('mourir', 'di

 e'), ('tuer', 'kill'), ('se battre', 'fight'), ('chasser', 'hunt'), ('frapper', 'hit'), ('couper', 'cut'), ('fendre', 'split'), ('poignarder

 ', 'stab'), ('gratter', 'scratch'), ('creuser', 'dig'), ('nager', 'swim'), ('voler', 'fly (verb)'), ('marcher', 'walk'), ('venir', 'come'),

 ("s'étendre", 'lie'), ("s'asseoir", 'sit'), ('se lever', 'stand'), ('tourner', 'turn'), ('tomber', 'fall'), ('donner', 'give'), ('tenir', 'h

 old'), ('serrer', 'squeeze'), ('frotter', 'rub'), ('laver', 'wash'), ('essuyer', 'wipe'), ('tirer', 'pull'), ('pousser', 'push'), ('jeter',

 'throw'), ('lier', 'tie'), ('coudre', 'sew'), ('compter', 'count'), ('dire', 'say'), ('chanter', 'sing'), ('jouer', 'play'), ('flotter', 'fl

 oat'), ('couler', 'flow'), ('geler', 'freeze'), ('gonfler', 'swell'), ('soleil', 'sun'), ('lune', 'moon'), ('étoile', 'star'), ('eau', 'wate

 r'), ('pluie', 'rain'), ('rivière', 'river'), ('lac', 'lake'), ('mer', 'sea'), ('sel', 'salt'), ('pierre', 'stone'), ('sable', 'sand'), ('po

 ussière', 'dust'), ('terre', 'earth'), ('nuage', 'cloud'), ('brouillard', 'fog'), ('ciel', 'sky'), ('vent', 'wind'), ('neige', 'snow'), ('gl

 ace', 'ice'), ('fumée', 'smoke'), ('feu', 'fire'), ('cendres', 'ashes'), ('br\xfbler', 'burn'), ('route', 'road'), ('montagne', 'mountain'),

  ('rouge', 'red'), ('vert', 'green'), ('jaune', 'yellow'), ('blanc', 'white'), ('noir', 'black'), ('nuit', 'night'), ('jour', 'day'), ('an,

 année', 'year'), ('chaud', 'warm'), ('froid', 'cold'), ('plein', 'full'), ('nouveau', 'new'), ('vieux', 'old'), ('bon', 'good'), ('mauvais',

  'bad'), ('pourri', 'rotten'), ('sale', 'dirty'), ('droit', 'straight'), ('rond', 'round'), ('tranchant, pointu, aigu', 'sharp'), ('émoussé'

 , 'dull'), ('lisse', 'smooth'), ('mouillé', 'wet'), ('sec', 'dry'), ('juste, correct', 'correct'), ('proche', 'near'), ('loin', 'far'), ('à

 droite', 'right'), ('à gauche', 'left'), ('à', 'at'), ('dans', 'in'), ('avec', 'with'), ('et', 'and'), ('si', 'if'), ('parce que', 'because'

 ), ('nom', 'name')]

 >>> translate=dict(fr2en)

 >>> translate['chien']

 'dog'

 >>> translate['jeter']

 'throw'

 >>>

 >>> de2en=swadesh.entries(['de', 'en'])

 >>> es2en=swadesh.entries(['es', 'en'])

 >>> translate.update(dict(de2en))

 >>> translate.update(dict(es2en))

 >>> translate['Hund']

 'dog'

 >>> translate['perro']

 'dog'

 >>> translate['jeter']

 'throw'

 >>>

5.词汇工具：Toolbox 和 Shoebox

目前最流行的语言学家用来管理数据的工具是Toolbox（工具箱），以前叫Shoebox（鞋柜）。Toolbox文件由一些条目的集合组成，其中每个条目由一个或者多个字段组成。大多数字段都是可选是或者重复的，这意味着这个词汇资源不能作为一个表格或电子表格来处理。下面是罗托卡特语（Rotokas）的词典。

 >>> from nltk.corpus import toolbox

 >>> dic1=toolbox.entries('rotokas.dic')

 >>> dic1[:20]

 [('kaa', [('ps', 'V'), ('pt', 'A'), ('ge', 'gag'), ('tkp', 'nek i pas'), ('dcsv', 'true'), ('vx', ''), ('sc', '???'), ('dt', '29/Oct/2005')

 , ('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'), ('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'), ('xe', 'Apoka

 is gagging from food while talking.')]), ('kaa', [('ps', 'V'), ('pt', 'B'), ('ge', 'strangle'), ('tkp', 'pasim nek'), ('arg', 'O'), ('vx', '

 2'), ('dt', '07/Oct/2006'), ('ex', 'Rera rauroro rera kaarevoi.'), ('xp', 'Em i holim pas em na nekim em.'), ('xe', 'He is holding him and s

 trangling him.'), ('ex', 'Iroiro-ia oirato okoearo kaaivoi uvare rirovira kaureoparoveira.'), ('xp', 'Ol i pasim nek bilong man long rop bik

 os em i save bikhet tumas.'), ('xe', "They strangled the man's neck with rope because he was very stubborn and arrogant."), ('ex', 'Oirato o

 koearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'), ('xp', 'Ol i pasim nek bilong man long rop. Olsem na em i no pu

 lim win olsem na em i dai.'), ('xe', "They strangled the man's neck with a rope. And he couldn't breathe and he died.")]), ('kaa', [('ps', '

 N'), ('pt', 'MASC'), ('cl', 'isi'), ('ge', 'cooking banana'), ('tkp', 'banana bilong kukim'), ('pt', 'itoo'), ('sf', 'FLORA'), ('dt', '12/Au

 g/2005'), ('ex', 'Taeavi iria kaa isi kovopaueva kaparapasia.'), ('xp', 'Taeavi i bin planim gaden banana bilong kukim tasol long paia.'), (

 'xe', 'Taeavi planted banana in order to cook it.')]), ('kaakaaro', [('ps', 'N'), ('pt', 'NT'), ('ge', 'mixture'), ('tkp', '???'), ('eng', '

 mixtures'), ('eng', 'charm used to keep married men and women youthful and attractive'), ('cmt', 'Check vowel length. Is it kaakaaro or kaak

 aro? Does lexeme have suffix, -aro or -ro?'), ('dt', '20/Nov/2006'), ('ex', 'Kaakaroto ira purapaiveira aue iava opita, voeao-pa airepa orao

 uirara, ra va aiopaive.'), ('xp', 'Kokonas ol i save wokim long ol kain samting bilong ol nupela marit, bai ol i ken kaikai.'), ('xe', 'Mixt

 ures are made from coconut for newlyweds, who eat them.')]), ('kaakaaviko', [('ps', 'N'), ('pt', 'FEM'), ('ge', 'type of beetle'), ('tkp', '

 ???'), ('nt', 'round beetle like Mexican bean beetle'), ('dt', '10/Feb/2005'), ('sf', 'FAUNA.INSECT'), ('ex', 'Kaakaaviko kare oea binara to

 uaveira vara tapo piupaiveira.'), ('xp', 'Kaakaaviko em i wanpela kain insect em i save istap long ol bin or na long kain lip.'), ('xe', '??

 ?'), ('ex', 'Kaakaaviko kare oea raviriro kouro piupaiveira.'), ('xp', 'Em i wanpela kain weevil i save bagarapim ol bin.'), ('xe', '??? dam

 ages up beans.')]), ('kaakaavo', [('rt', 'kaavo'), ('ps', '???'), ('rdp', 'partial'), ('ge', 'white'), ('tkp', 'wait'), ('sc', '???'), ('cmt

 ', "What's the part of speech?"), ('dt', '29/Oct/2005'), ('ex', 'Kaakaaro oa purapaiveira varauraro tokipasia aue iava opita ora vegoara iav

 a oirara iava ora riakova kaakaaro.'), ('xp', 'Ol i save wokim out long kokonas coconut na ol lip na skin blong ol diwai.'), ('xe', '???'),

 ('ex', 'Varoa kaakaavopa popotepa ragai varo.'), ('xp', 'Em white lap lap blong mi.'), ('xe', "That's my white laplap."), ('ex', 'Vaoia evao

 va kaakaavopaova.'), ('xp', 'Dispela diwai em i waitpela.'), ('xe', 'This tree is white.'), ('ex', 'Rarasoria kaakaavoto ira Amerika iava ur

 ioroera vo kovosia rupairara voaro.'), ('xp', 'Rarason em i wait man em i bin kam long Amerika na kam wok long hap bilong ol bilak man.'), (

 'xe', 'Rarason is a white man who came from America ???.')]), ('kaakaoko', [('ps', 'N'), ('pt', '???'), ('ge', 'type of beetle'), ('tkp', 'b

 inatang'), ('sf', 'FAUNA.INSECT'), ('cmt', 'Is it kaakaoko or kaakauko?'), ('dt', '08/Feb/2005'), ('ex', 'Kaakaoko vuri gesito./Kaakauko vur

 isi gesiva.'), ('xp', '???'), ('xe', 'Kaakauko em i wanpela binatang.')]), ('kaakasi', [('rt', '???'), ('ps', 'V'), ('pt', 'A'), ('ge', 'hot

 '), ('tkp', 'hot'), ('vx', '1'), ('sc', '???'), ('cmt', "Vowel length can't possibly be right. Or is the vowel of kaasi long?"), ('dt', '29/

 Oct/2005'), ('ex', 'Upiriko pitoka kaakasipai.'), ('xp', 'Sospen kaukau em i hot tru.'), ('xe', 'The saucepan of sweet potatos is really hot

 .'), ('ex', 'Kaukau pitoka rirovira rutu kaakasipai uvare riro kasia tuitui kasi oripiro.'), ('xp', 'Sospen kaukau em i hot tru bikos em i t

 an long bikpela paia.'), ('xe', '???')]), ('kaakau', [('ps', 'N'), ('pt', 'FEM'), ('ge', 'dog'), ('tkp', 'dok'), ('dt', '17/Jul/2005'), ('ex

 ', 'Kaakau voresiurava toupa aue kokoto ora kokopi.'), ('xp', 'Dog i gat fopela lek bilong em na em i teleblonge.'), ('xe', 'Dogs are four-f

 ooted ???.'), ('ex', 'Revisa riro kaakau raguito.'), ('xp', 'Revisa em i man bilong lukautim dok.'), ('xe', 'Revisa is a big dog lover.'), (

 'ex', 'Rake ora Jon kaakau kare ousia avasie.'), ('xp', 'Rake wantaim Jon ol i go kisim ol wail dok.'), ('xe', 'Rake and John went to get wi

 ld dogs.')]), ('kaakauko', [('ps', 'N'), ('pt', 'MASC'), ('ge', 'gray weevil'), ('tkp', 'wanpela kain binatang'), ('sf', 'FAUNA.INSECT'), ('

 nt', 'pictured on PNG postage stamp'), ('dt', '29/Oct/2005'), ('ex', 'Kaakauko ira toupareveira aue-ia niugini stemp.'), ('xp', 'Kaakauko em

  insect em i istap long niugini.'), ('xe', 'The gray weevil is found on the PNG stamp.'), ('ex', 'Kaakauko iria toupaeveira niugini stamia.'

 ), ('xp', 'Weevil i stap long niguini stamp.'), ('xe', 'The gray weevil is on the New Guinea stamp.'), ('ex', 'Kaakauko korekare iava oira i

 ria iava varaua vurivurivira ora kaapovira toupaiveira.'), ('xp', 'Kaakavuko em i wanpela kain binatang skin bilong em i braun na wait.'), (

 'xe', 'Kaakavuko is an insect whose body is brown and white.')]), ('kaakito', [('rt', 'kaaki'), ('ps', 'N'), ('pt', 'HUM'), ('ge', 'person b

 lind with cataracts'), ('tkp', 'man i gat wanpela ei'), ('nt', 'nickname when used to describe one-eyed person'), ('dt', '11/Feb/2005'), ('e

 x', 'Rarasirea kakito eisiva rera Tavusiva uruiia.'), ('xp', 'Rarasirea em i wan ai man bilong ples Tavusiova.'), ('xe', 'Rarasirea is a one

 -eyed man from Tavusiova village.'), ('ex', 'Kaakito kataitoa iava osireito vurapare.'), ('xp', 'Man i gat wanpela ei na i lukluk.'), ('xe',

  'A one-eyed man looks out of one eye.')]), ('kaakuupato', [('ps', 'N'), ('pt', 'PN'), ('ge', 'spring of hot mineral water near Togarao.'),

 ('tkp', '???'), ('nt', 'It is located in gulley above the shorter waterfall and is most likely ???.'), ('dt', '08/Feb/2005'), ('ex', 'Kasira

 opato kaakuupato uicoto ira vusivusipareveira vova rasito vo toupare togarao-ia sisiupaveira vosa upiapave ora ruvapasa.'), ('xp', 'Kaakuupa

 to em i spirins hot water em i stap long Togavao taim husat i sik bai ol waswas long bai sick br pinis .'), ('xe', '???'), ('ex', 'Kaakuupat

 o kasiraopato ukoto ira toupare eisi Rureva Togaraoia ruvaraia.'), ('xp', 'Hot wara kaakuupato i stap long Rureva klostu long Togarao.'), ('

 xe', 'The hot spring Kaakuupato is in Rureva near Togarao.')]), ('kaaova', [('ps', 'N'), ('pt', 'FEM'), ('ge', 'aunt'), ('tkp', '???'), ('nt

 ', 'FaSi'), ('sf', 'KIN'), ('dt', '19/Jul/2004')]), ('kaapa', [('ps', 'N'), ('pt', '???'), ('ge', 'copper metal'), ('tkp', 'retpela ain'), (

 'dt', '12/Feb/2005'), ('cmt', 'What is paupara doing in the second example?'), ('ex', 'Kaapa vao oa-ia kepa paupaviei.'), ('xp', 'Kaapa em i

  roof yumi save wokim haus long em.'), ('xe', 'Copper we make houses from.'), ('ex', 'Kaapara kepa paupara oara purapaiveira eisi Astararia.

 '), ('xp', 'Kapa bilong wokim haus ol i save wokim long Australia.'), ('xe', 'Copper rooves, they make them in Australia.')]), ('kaapea', [(

 'ps', '???'), ('ge', 'weak'), ('ge', 'loose'), ('ge', 'easy'), ('tkp', '???'), ('cmt', 'Check spelling. Is it kaapea or kapea?'), ('dt', '

 /Jun/2005'), ('ex', 'Kaapeta virago vao paupa.'), ('xp', 'Dispela chair i no strong bilong sindaun.'), ('xe', '???')]), ('kaapie', [('rt', '

 kaa'), ('ps', 'N'), ('pt', 'MASC'), ('ge', 'hook'), ('ge', 'fishhook'), ('tkp', 'huk'), ('dt', '15/Feb/2004')]), ('kaapie', [('rt', 'kaa'),

 ('ps', 'V'), ('pt', 'B'), ('ge', 'hook'), ('eng', 'choke'), ('eng', 'snag'), ('eng', 'hook'), ('ge', 'capture'), ('tkp', 'hukim'), ('tkp', '

 pasim long huk'), ('arg', 'O'), ('vx', '2'), ('dt', '15/Nov/2005'), ('cmt', "Double-check vowel length of kaa. First example doesn't make se

 nse. Is it two sentences?"), ('ex', 'Aiopaoro karoi kakaeto kaapierivoi aioa-ia.'), ('xp', 'Kaikai pas long nek bilong mi kaikai i pas long

 pikinini.'), ('xe', '???'), ('ex', 'Koie kaapierevo Ririre ovare oira gisipoaro iare karuveraisi vikirevo.'), ('xp', 'Ririre i tromoem singa

 po insait long maus bilong pik na i pas.'), ('xe', 'Ririre ???.'), ('ex', 'Aakova kakaeto kapieevoi aioa-ia.'), ('xp', 'Mama i givim kaikai

 long pikinini na hap kaikai i pas long nek bilong em.'), ('xe', 'Mother made the boy choke with some food.'), ('ex', 'Aakova kakaeto kaapiev

 oi aioa-ia uvare viapau vearovira va orievo.'), ('xp', 'Mama em i mekim pas kaikai long pikinini bikos em i no kukim gut.'), ('xe', "Mother

 made the boy choke from the food because she didn't cook it well."), ('ex', 'Avuka kakaeto aiopiepaoro rera kaapieevo.'), ('xp', 'Lapun wok

 long givim kaikai long bebe na kaikai i pas long nek.'), ('xe', 'The old person fed the boy and made it choke.')]), ('kaapiepato', [('rt', '

 kaapie'), ('ps', 'N'), ('pt', 'HUM'), ('ge', 'fisher'), ('tkp', 'man bilong hukim pis'), ('dt', '12/Feb/2005'), ('ex', 'Aveatoa atari kapiep

 ato vokiara rutu.'), ('xp', 'Aveato em i man bilong hukim pis olgeta de.'), ('xe', 'Aveato works as a fisherman every day.')]), ('kaapisi',

 [('ps', 'V'), ('pt', 'B'), ('ge', 'pinch together'), ('ge', 'grip with pincers'), ('tkp', 'holim'), ('arg', 'O'), ('vx', ''), ('dt', '08/Ju

 n/2005'), ('ex', 'Kaapisi ava eva ra avekeara kasiraopa ra kaekaepiea. Ra varao vera oara kasiraopai.'), ('xp', 'Yu mas kam wantam sisis pin

 vh bar mi ya rausim ol dispela pela stow ol bai mi rausim ol dispela i hot.'), ('xe', '???'), ('ex', 'Avekeara kaapisi evara kasiraopara.'),

  ('xp', 'Yu rausim ol ton i hot long pansa.'), ('xe', '???')]), ('kaapisivira', [('rt', 'kaapisi'), ('ps', 'ADV'), ('pt', 'MANNER'), ('ge',

 'linked'), ('ge', 'pinched'), ('tkp', '???'), ('dt', '29/Oct/2005'), ('ex', 'Auea eva oa kaapisivira toupaivoi.'), ('xp', 'Samting i stap ol

 sem pansa.'), ('xe', '???'), ('ex', 'Pariearei tapokovira toupai uva kaapisivira kekepapiroi.'), ('xp', 'Hap mambu i pas wantaim na i luk ol

 sem sises.'), ('xe', '???')])]

 >>>

只看第一个条目，词kaa,意思是“窒息”。条目由一系列的”属性-值”对组成，如（'ps', 'V'），表示词性是'V'（动词），（'ge', 'gag'）表示英文注释是‘gag’。最后的三个配对包含一个罗托卡特语例句及其巴布亚皮钦语和英语的翻译。

罗托卡特语是巴布亚新几内亚的布干维尔岛上使用的一种语言，这个词典资源有Stusrt Robinson贡献给nltk。罗托卡特语以仅有12个音素（彼此对立的声音）而闻名。

python 自然语言处理（四）____词典资源的更多相关文章

python+NLTK 自然语言学习处理五：词典资源
前面介绍了很多NLTK中携带的词典资源,这些词典资源对于我们处理文本是有大的作用的,比如实现这样一个功能,寻找由egivronl几个字母组成的单词.且组成的单词每个字母的次数不得超过egivronl中 ...
Python自然语言处理（1）：初识NLP
由于我们从美国回来就是想把医学数据和医学人工智能的事认真做起来,所以我们选择了比较扎实的解决方法,想快速出成果的请绕道.我们的一些解决方法是:1.整合公开的所有医学词典,尽可能包含更多的标准医学词汇: ...
Python自然语言处理工具小结
Python自然语言处理工具小结作者:白宁超 2016年11月21日21:45:26 目录 [Python NLP]干货!详述Python NLTK下如何使用stanford NLP工具包(1) [ ...
《Python自然语言处理》
<Python自然语言处理> 基本信息作者: (美)Steven Bird Ewan Klein Edward Loper 出版社:人民邮电出版社 ISBN:97871153 ...
转-Python自然语言处理入门
Python自然语言处理入门原文链接:http://python.jobbole.com/85094/ 分享到:20 本文由伯乐在线 - Ree Ray 翻译,renlytime 校稿.未经许 ...
Python爬虫进阶四之PySpider的用法
审时度势 PySpider 是一个我个人认为非常方便并且功能强大的爬虫框架,支持多线程爬取.JS动态解析,提供了可操作界面.出错重试.定时爬取等等的功能,使用非常人性化. 本篇内容通过跟我做一个好玩的 ...
Python学习（四）数据结构（概要）
Python 数据结构本章介绍 Python 主要的 built-type(内建数据类型),包括如下: Numeric types int float Text Sequence ...
Python爬虫入门四之Urllib库的高级用法
1.设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Headers 的属性. 首先,打开我们的浏览 ...
孤荷凌寒自学python第十四天python代码的书写规范与条件语句及判断条件式
孤荷凌寒自学python第十四天python代码的书写规范与条件语句及判断条件式 (完整学习过程屏幕记录视频地址在文末,手写笔记在文末) 在我学习过的所有语言中,对VB系的语言比较喜欢,而对C系和J系 ...

随机推荐

String和常量池
1.Java 会确保一个字符串常量只有一个拷贝 2.用new String() 创建的字符串不是常量,不能在编译期就确定,所以new String() 创建的字符串不放入常量池中,它们有自己的地址空间 ...
Android主页Activity对多个Fragment实现不同的沉浸式标题（图片或者文字标题）
提示:讲解的该例实现是 FragmentTabHost + Fragment 实现: 1.示例效果图: 2.场景需求: 如示例图所示,在首页实现轮播图的沉浸,而 “发现” 和“我的”页是标题的沉浸. ...
leecode第一百零四题（二叉树的最大深度）
/** * Definition for a binary tree node. * struct TreeNode { * int val; * TreeNode *left; * TreeNode ...
学习笔记8—MATLAB中奇异值处理办法
一.Inf 和 NAN处理 lnf: 无穷大值,可以用islnf或者isfinite函数处理 NAN:不是一个数字,可以用isnan函数来处理或者: 类似于这种处理 mn(find(mn<= ...
图片路径转base64字节码
package product; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOE ...
Getting started with Processing 第十三章——延伸(1)
导入库: 导入库的名称为:import processing.libName.* 声音播放声音支持的格式:wav,aiff,mp3声明: SoundFile blip;创建:blip = new ...
Linux简介和安装
Andrew S. Tanenbaum参考Unix,写了Minix,并开源,Linus Torvalds以其为模板写了Linux. Linux包含内核版本和发行版本. Linux内核版本 Linux内 ...
Tensor RT使用记录
Tensor RT的介绍在此不做赘述. 自己在服务器上本打算装Tensor RT来着,不过过程很艰辛,最后发现服务器的cudnn版本偏低了,还需要升级cudnn的版本.故,在自己的电脑上了装了下Ten ...
laravel调度任务
<?php namespace App\Console; use Illuminate\Console\Scheduling\Schedule;use Illuminate\Foundation ...
惊世骇俗的sql语句之连表查询
select `product_skus`.id as skuId, `wname` as sku名称, if(`sku_attributes`.`status`=1,'上架','下架') as 状态 ...

python 自然语言处理（四）____词典资源

python 自然语言处理（四）____词典资源的更多相关文章

随机推荐

热门专题