Phyton 2.5

Wednesday, June 3, 2009

Mensortir nama laki2 dan perempuan

One more wordlist corpus is the Names corpus, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names which appear in both files, i.e. names that are ambiguous for gender:

>>> names = nltk.corpus.names
>>> names.fileids()
['female.txt', 'male.txt']
>>> male_names = names.words('male.txt')
>>> female_names = names.words('female.txt')
>>> [w for w in male_names if w in female_names]
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis',
'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',
'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]

Memecahkan Puzzle kata2

misal ada susunan huruf2 acak egivrvonl, dan kita dimninta membuat daftar statistik kemungkinan kata yang bisa dirangkai, dengan parameter tiap kata wajibmengandung huruf R
Figure 2.9: A Word Puzzle: a grid of randomly chosen letters with rules for creating words out of the letters; this puzzle is known as "Target."

A wordlist is useful for solving word puzzles, such as the one in Figure 2.9. Our program iterates through every word and, for each one, checks whether it meets the conditions. It is easy to check obligatory letter and length constraint (and we'll only look for words with six or more letters here). It is trickier to check that candidate solutions only use combinations of the supplied letters, especially since some of the supplied letters appear twice (here, the letter v). The FreqDist comparison method permits us to check that the frequency of each letter in the candidate word is less than or equal to the frequency of the corresponding letter in the puzzle.

>>> puzzle_letters = nltk.FreqDist('egivrvonl')
>>> obligatory = 'r'
>>> wordlist = nltk.corpus.words.words()
>>> [w for w in wordlist if len(w) >= 6
... and obligatory in w
... and nltk.FreqDist(w) <= puzzle_letters]
['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor',
'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi',
'revolving', 'ringle', 'roving', 'violer', 'virole']

Thursday, May 28, 2009

Membuat grammar

berikut kita bahas cara membuat grammar. Salah satunya yang akan kita bahas sekarang adalah bagaimana membuat kata kata jamak, atau sering disebut plural

masukan rumusnya

def plural(word):
if word.endswith('y'):
return word[:-1] + 'ies'
if word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
return word + 'es'
if word.endswith('an'):
return word[:-2] + 'en'
else:
return word + 's'

setelah itu tes dengan kata2 tertentu

>>> plural('fairy')
'fairies'
>>> plural('woman')
'women'

Tuesday, May 26, 2009

Mencari Kata2 yg berpasangan dengan genre

Misal, genrenya News. Kata2 apa saja yang berpasangan dengan news? apakah bad news? good news? sad news? News week?
Berikut akan kita tampilkan, dengan genre lebih dari satu
Pertama import perintah yg segini banyak
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre)

Setelah perintah ni masuk, kt coba untuk memasukan 2 jenis genre saja.

>>> genre_word = [(genre, word)
... for genre in ['news', 'romance']
... for word in brown.words(categories=genre)]
>>> len(genre_word)
170576

Lalu, coba tampilkan satu2. Berikut, untuk tiap genre, cuma dipilih 4 kata saja. Mau lebih ya boleh, misal 6,7,8
>>> genre_word[:4]
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]
>>> genre_word[-4:]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]

Monday, May 25, 2009

Menghitung jumlah huruf pada tiap genre

>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))

>>> genre_word = [(genre, word)
... for genre in ['news', 'romance']
... for word in brown.words(categories=genre)]
>>> len(genre_word)
170576

Sunday, May 24, 2009

menguapload korpus kita sendiri

Kita bisa menguapload teks kita sendiri.

>>> from nltk.corpus import PlaintextCorpusReader
tampilkan perintah untuk mengupload

>>> corpus_root = '/usr/share/dict'
disamping corpus root, ketik link file dalam har disk kita. beiar lebih cepat lihat di properties file. Jangan lupa, kasih tanda kutip

>>> wordlists = PlaintextCorpusReader(corpus_root, '.*')
setelah ketemu linknya, ketik jenis file nya. karena kalu tidak, dia akan melacak semua file yang ada di folder tersebut. Yang paling friendly adalah txt

>>> wordlists.fileids()
dia akan menampilakn file2 berjenis txt
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']

>>> wordlists.words('connectives')
tampilkan kata2nya
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

ini contoh lain
>>> corpus_root = "\Documents and Settings\User\My Documents"
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*.txt')
>>> wordlists.fileids()

['ELK.txt']
>>> wordlists.words('ELK')

Korpus dalam bahasa lain

Eh, ternyata daripada pakai phyton command line, lebih mudah pakai phyton shell. Karena bisa dicopy paste. cuma emang gak bisa dimodif, dan gak bisa mentrace operintah sebelumnya seperti di command line
tapi gak banyak stress lah uhtukl yang gak bisa ngetik kayakgue

>>> nltk.corpus.cess_esp.words()
['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]
>>> nltk.corpus.floresta.words()
['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...]
>>> nltk.corpus.indian.words('hindi.pos')
['\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xa3',
'\xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xac\xe0\xa4
\x82\xe0\xa4\xa7', ...]
>>> nltk.corpus.udhr.fileids()
['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1',
'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1',
'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', ...]
>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]
[u'Saben', u'umat', u'manungsa', u'lair', u'kanthi', ...]

Dan dia bisa menampilkan grafik untuk word length dalam tiap2 bahasa
>>> from nltk.corpus import udhr
import file yang diinginnkan. dalam hal ini udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
masukan daftar bahasa2nya
... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

berikut rumus2nya
>>> cfd = nltk.ConditionalFreqDist(
... (lang, len(word))
... for lang in languages
... for word in udhr.words(lang + '-Latin1'))
>>> cfd.plot(cumulative=True)

Korpus ada juga yang tersedia dalam bahasa lain, dari bahasa2 eropa sampai bahasa JAWA... hebat ya...siapa ya kira2 yang ngumpulin. Berikut conroh untuk mencari frekwensi distribusi huruf2 pada teks udhr dalam bahasa jawa
>>> languages = ['Javanese-Latin1']
>>> raw_text = udhr.raw('Javanese-Latin1')
>>> nltk.FreqDist(raw_text).plot()

Berikut kami tampilkan daftar perintah

Example Description
fileids() the files of the corpus
fileids([categories]) the files of the corpus corresponding to these categories
categories() the categories of the corpus
categories([fileids]) the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2]) the raw content of the specified categories
words() the words of the whole corpus
words(fileids=[f1,f2,f3]) the words of the specified fileids
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
sents(categories=[c1,c2]) the sentences of the specified categories
abspath(fileid) the location of the given file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid) open a stream for reading the given corpus file
root() the path to the root of locally installed corpus
readme() the contents of the README file of the corpus

Ada tambahan sedikit
>>> raw = gutenberg.raw("burgess-busterbrown.txt")
>>> raw[1:20]
menampilkan judul teks
'The Adventures of B'

>>> words = gutenberg.words("burgess-busterbrown.txt")
masukan nama file nya
>>> words[1:20]
['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.',
'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster',
'Bear']
>>> sents = gutenberg.sents("burgess-busterbrown.txt")
>>> sents[1:20]
[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as',
'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched',
'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...],