Vietnamese Language Resources

Data

  • Vietnamese machine readable dictionary

    • Containing 35,000 Vietnamese contemporary words with morphological, syntactic, and semantic information;
    • Following international standard format for computer dictionaries.
  • Vietnamese treebank

    • 70,000 word-segmented sentences;
    • 10,000 POS-tagged sentences;
    • 10,000 syntactic trees with both constituent and functional labels;
    • Format in bracketed structure (similarly to Penn English Treebank).
  • English-Vietnamese bilingual corpus

    • 80,000 sentence pairs in Economics-Social topics;
    • 20,000 sentence pairs in information technology topic.

If you want to use the above data for research or educational purpose, please fill out the form Data usage agreement and send it to the e-mail: .

Tools

  • Vietnamese word segmentation program

    • Combine dictionary and ngram models;
    • Trained using 70,000 word-segmented sentences from Vietnamese treebank;
    • Accuracy is around 97%.

    Download: vnTokenizer 4.1.1c (04-Aug-2010) ~6.5 MB / Authors' page

  • Vietnamese part-of-speech tagger

    • Based on maximum entropy model and conditional random field model;
    • Trained using 20,000 POS-tagged sentences from Vietnamese treebank;
    • Accuracy is around 93%.

    Download: VietTagger (16-Aug-2010) ~10 MB

  • Vietnamese chunker

    • Based on conditional random field model;
    • Trained using 10,000 syntactic trees from Vietnamese treebank;
    • Accuracy is around 81%.

    Download: VietChunker (16-Aug-2010) ~132 MB

  • Vietnamese syntactic parser

    • Based on lexicalized probabilistic context free grammars;
    • Developed from multilingual parsing engine of Dan Bikel;
    • Trained using 10,000 syntactic trees from Vietnamese treebank;
    • F-core is around 78%.

© The KC01/06-10 project "Building Basic Resources and Tools for Vietnamese Language and Speech Processing" (VLSP)
This project belongs to the National Key Science and Technology Tasks for the 5-Year Period of 2006-2010.
The leader of the "Vietnamese Language Processing" branch: Prof. Ho Tu Bao.
© Please cite "VLSP Project" when you use the information from this website.