Vietnamese Syllable #1: Deconstructing Mother Tongue
Overview of Vietnamese Syllables and the Creation of an Open Syllable Database for the Community.
Đọc bản tiếng Việt tại đây.
The ever-living Vietnamese language
“Between the mountain ridges and sea strands, bridging two major cultures of India and China, and survived the occupations of three powers: France, the USA, and Japan. Trailing the technology from USSR and went through script transformations twice. Vietnamese now has its own flexibility to express every concept, idea, movement, and feeling that it has been through.”
The formation of the Vietnamese language has long been a topic of extensive research, comparison, discussion, and systematization by both domestic and international scholars. One cannot overlook the 16th-century An Nam Dịch Ngữ (Annamite Translation), which juxtaposed 716 Sino-Vietnamese words, or Alexander de Rhodes’ Vietnamese-Portuguese-Latin Dictionary, which formalized the use of the Quốc ngữ script (the modern Vietnamese script). The contributions of André-Georges Haudricourt, who explored the etymology of Vietnamese, should not be disregarded, particularly in his engagement with the work of Orientalist Henry Maspero to demonstrate that Vietnamese belongs to the Mon-Khmer branch of the Austroasiatic linguistic family. Similarly, the work of Professor Nguyễn Tài Cẩn on the tonal richness of Vietnamese, from its Austroasiatic origins—specifically the Mon-Khmer branch—to its adoption of tonal elements from the Tai-Kadai language family, is indispensable. Equally significant is the work of Professor Trần Trí Dõi, who meticulously systematized the development of Vietnamese within the broader cultural and historical contexts of turbulent eras.
Today, the Vietnamese lexicon has been systematically organized thanks to the Vietnamese Dictionary by the late professor and lexicographer Hoàng Phê. Digital archives have also expanded into the realm of Chữ Nôm and languages within the Mon-Khmer group, such as the Vietnamese-Muong Dictionary edited by Nguyễn Văn Khang, the Katu vocabulary compiled by Nancy Costello, and the K'Ho-French Dictionary by Dournes Jacques, published in Saigon. There are numerous connections between Vietnamese and other languages, as well as between Vietnamese across different historical periods. However, to truly grasp the essence of the Vietnamese language, we must delve into its fundamental unit: the syllables.
Meditation on Vietnamese syllables
Inherent to its monosyllabic structure, each sound in Vietnamese is a word, whether it has a meaning or not, or whether we don't yet know its meaning, or its meaning got lost in time. And when each of these syllables rings out, they all convey a sensation or evoke an emotion in the interlocutor.


Inspired by the Maluma-Takete experiment conducted by Wolfgang Kohler, a founding figure of Gestalt psychology, this project cross-references the existing syllable chart to further understand the imagery that each syllable can evoke.
Words like "to", "lớn", "bự", "khủng," all indicate a scale larger than what one considers sufficient, but the feeling each word brings is different. "To" appears neutral, imbued with a touch of surprise, and its articulation seems to stretch temporally. "Lớn" exudes a sense of urgency and impact, while "Bự" manifests a degree of discomfort, implying heaviness. "Khủng" discloses a degree of the speaker's introspection, feeling more remote than the preceding sounds. To me, this is the beauty of the Vietnamese language: the richness in tones, the generosity in vocabulary, and the simplicity in syllables.


Deconstruction and reconstruction of syllables
With a systematic chart of all syllables, expanded to built-in phonemes, this project explores the feelings evoked by each phoneme. It aims to encompass even those sounds that Vietnamese people can pronounce but are not found in Vietnamese orthography.
A dataset about Vietnamese, however, is a must-have to kickstart this research; however, it is hard to find.
In this research, we use the open dataset from the Vietnamese NLP project underthesea. We cross-checked with soha.vn’s dictionary and “Từ Điển Tiếng Việt” (GS. Hoàng Phê). The full dataset, after all, consists about 30,000 vocabularies.


From this corpus, we filtered out phonetically transcribed words, then split all the vocabularies into syllables. At the end, we achieved a dataset of 6,000 phonemes.
All the data is open for public use at: github.com/luotcode/amtiettiengviet


Vietnamese Syllable is the initial research trilogy that fueled the “Our Vietnamese Project,” hosted by CodeSurfing.
Other parts of Vietnamese Syllable:
For full content about “Our Vietnamese Project” by CodeSurfing, please visit here.
Đọc bản tiếng Việt tại: