Vietnamese Language Resources
Data
Vietnamese machine readable dictionary
- Containing 35,000 Vietnamese contemporary words with morphological,
syntactic, and semantic information;
- Following international standard format for computer dictionaries.
Vietnamese treebank
- 70,000 word-segmented sentences;
- 10,000 POS-tagged sentences;
- 10,000 syntactic trees with both constituent and functional labels;
- Format in bracketed structure (similarly to Penn English Treebank).
English-Vietnamese bilingual corpus
- 80,000 sentence pairs in Economics-Social topics;
- 20,000 sentence pairs in information technology topic.
If you want to use the above data for research or
educational purpose, please fill out the form
Data usage agreement
and send it to the e-mail: .
Tools
Vietnamese word segmentation program
- Combine dictionary and ngram models;
- Trained using 70,000 word-segmented sentences from Vietnamese treebank;
- Accuracy is around 97%.
Download: vnTokenizer 4.1.1c (04-Aug-2010) ~6.5 MB / Authors' page
Vietnamese part-of-speech tagger
- Based on maximum entropy model and conditional random field model;
- Trained using 20,000 POS-tagged sentences from Vietnamese treebank;
- Accuracy is around 93%.
Download: VietTagger (16-Aug-2010) ~10 MB
Vietnamese chunker
- Based on conditional random field model;
- Trained using 10,000 syntactic trees from Vietnamese treebank;
- Accuracy is around 81%.
Download: VietChunker (16-Aug-2010) ~132 MB
Vietnamese syntactic parser
- Based on lexicalized probabilistic context free grammars;
- Developed from multilingual parsing engine of Dan Bikel;
- Trained using 10,000 syntactic trees from Vietnamese treebank;
- F-core is around 78%.
|