Dr. Volkan Tunalı's Personal Blog

Computer, Technology, Science, Art

Archive for the ‘text mining’ tag

Turkish Deasciification

leave a comment

Deasciification is the process of converting text written with only ASCII letters to its correct form using corresponding letters in Turkish alphabet (or any language that contains non-ascii letters). For example, the text “Cok yogun bir calisma ve emegin urunu” conveys the meaning, that is, human intelligence is able resolve ambiguities (if any) and understand text like this. The text, however, should be written as “Çok yoğun bir çalışma ve emeğin ürünü” (in Turkish). This is what a deasciifier is supposed to do.

Well, why do we need deasciification? We may not have Turkish letters on the keyboard (or the OS we are using may be without Turkish keyboard layout) and we need to end up with a text in correct Turkish form. It is also possible that we are accustomed to typing only with Ascii letters for some reason.

In addition, we may need to analyze a large collection of Turkish documents, and this collection can be contaminated with text written in Ascii, which will degrade the performance of our analysis. Then, the only possibility is to use deasciification. This is the most important reason for me as I often perform text mining on Turkish document collections, and I always need deasciification.

In this post, I’ll shortly review a few deasciification tools developed with several languages.

The first deasciifier is the one which is part of Zemberek project. Written completely in Java, Zemberek is an open-source general purpose Natural Language Processing library and toolset designed for Turkic languages, especially Turkish. A web-based demo of Zemberek is available at http://zemberek-web.appspot.com/. I usually use the deasciifier of Zemberek in my text mining research when I work with Turkish text datasets.

The next deasciifier is developed by Gökhan Tür at Sabancı University. More information and a demo is available at http://www.hlst.sabanciuniv.edu/TL/deascii.html. This system is currently not open-source, and not available for download.

One deasciifier is from Deniz Yüret at Koç University, which is actually developed for Emacs for realtime correction of words written in ascii form. More information and download is available at http://denizyuret.blogspot.com/2006/11/emacs-turkish-mode.html.

Yüret’s deasciifier is converted to Javascript by Mustafa Emre Acer, and available at http://turkce-karakter.appspot.com/.

The last deasciifier, recently published by Emre Sevinç, is a conversion of Yüret’s work into Python. More information and download is available at http://github.com/emres/turkish-deasciifier.

None of these deasciifiers is perfect, but they all perform pretty well for most of the situations. I’m sure we’ll see much improved deasciifiers with the advances in NLP studies for Turkish.

Written by Volkan TUNALI

July 23rd, 2010 at 7:52 pm

DataMiningResearch.com Web Site

leave a comment

Data MiningMy PhD research is on Data Mining and particularly on Text Mining. During my research and project development, I encounter several interesting and useful stuff like articles, papers, tools, and source code. In order to share all those, and also to share my thoughts related to my research area, I’ve founded a web site, DataMiningResearch.com. I hope it is going to be helpful to those who are interested in data mining, machine learning, and knowledge discovery.

Written by Volkan TUNALI

June 30th, 2010 at 6:56 pm