Sunday, May 24, 2009

Korpus dalam bahasa lain

Eh, ternyata daripada pakai phyton command line, lebih mudah pakai phyton shell. Karena bisa dicopy paste. cuma emang gak bisa dimodif, dan gak bisa mentrace operintah sebelumnya seperti di command line
tapi gak banyak stress lah uhtukl yang gak bisa ngetik kayakgue

>>> nltk.corpus.cess_esp.words()
['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]
>>> nltk.corpus.floresta.words()
['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...]
>>> nltk.corpus.indian.words('hindi.pos')
['\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xa3',
'\xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xac\xe0\xa4
\x82\xe0\xa4\xa7', ...]
>>> nltk.corpus.udhr.fileids()
['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1',
'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1',
'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', ...]
>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]
[u'Saben', u'umat', u'manungsa', u'lair', u'kanthi', ...]

Dan dia bisa menampilkan grafik untuk word length dalam tiap2 bahasa
>>> from nltk.corpus import udhr
import file yang diinginnkan. dalam hal ini udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
masukan daftar bahasa2nya
... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

berikut rumus2nya
>>> cfd = nltk.ConditionalFreqDist(
... (lang, len(word))
... for lang in languages
... for word in udhr.words(lang + '-Latin1'))
>>> cfd.plot(cumulative=True)

Korpus ada juga yang tersedia dalam bahasa lain, dari bahasa2 eropa sampai bahasa JAWA... hebat ya...siapa ya kira2 yang ngumpulin. Berikut conroh untuk mencari frekwensi distribusi huruf2 pada teks udhr dalam bahasa jawa
>>> languages = ['Javanese-Latin1']
>>> raw_text = udhr.raw('Javanese-Latin1')
>>> nltk.FreqDist(raw_text).plot()


Berikut kami tampilkan daftar perintah

Example Description
fileids() the files of the corpus
fileids([categories]) the files of the corpus corresponding to these categories
categories() the categories of the corpus
categories([fileids]) the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2]) the raw content of the specified categories
words() the words of the whole corpus
words(fileids=[f1,f2,f3]) the words of the specified fileids
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
sents(categories=[c1,c2]) the sentences of the specified categories
abspath(fileid) the location of the given file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid) open a stream for reading the given corpus file
root() the path to the root of locally installed corpus
readme() the contents of the README file of the corpus

Ada tambahan sedikit
>>> raw = gutenberg.raw("burgess-busterbrown.txt")
>>> raw[1:20]
menampilkan judul teks
'The Adventures of B'

>>> words = gutenberg.words("burgess-busterbrown.txt")
masukan nama file nya
>>> words[1:20]
['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.',
'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster',
'Bear']
>>> sents = gutenberg.sents("burgess-busterbrown.txt")
>>> sents[1:20]
[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as',
'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched',
'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...],

No comments:

Post a Comment