Dr. Volkan Tunalı's Personal Blog

Computer, Technology, Science, Art

Archive for the ‘Data Mining’ Category

Hottest Majors for Future Career According to Microsoft

leave a comment

careerThere is a blog article published recently at Microsoft JobsBlog which talks about the hottest fields of study for a career in technology. All these three hottest majors mentioned are data mining related. The list is as follows:

  • Data Mining/Machine Learning/AI/Natural Language Processing
  • Business Intelligence/Competitive Intelligence
  • Analytics/Statistics – specifically Web Analytics, A/B Testing and statistical analysis

The article concludes as:

These fields are very HOT and looking long term, the demand will be just that much greater in these areas.

You can read the full article at http://jobsblog.com/blog/top-three-new-tech-majors/.

Written by Volkan TUNALI

August 26th, 2010 at 8:48 pm

Blekko – A Brand New Search Engine

leave a comment

blekkoI guess it is very likely that you use Google everytime you need to search for something, and sometimes the others like Yahoo, Bing, etc. This is what I do. Google has been my very first choice for a long time. I very very rarely use Yahoo or Bing. Now, there is a newcomer to the field: blekko.

Blekko is in a private beta stage and it is not open to everyone’s use. You need to get a join link from blekko through twitter or facebook.

I’ve just joined to blekko and actually have not tried it thoroughly. The most distinct property of blekko is the search using slashtags. Slashtags are the keywords that appear after a slash, which makes the search results narrow down according to the tag. Slasthtags are defined by blekko as:

Slashtag search lets users slash in what they want and slash out what they don’t. Knock out the spam sites and search only the sites you want to search.

Use of slashtags are very similar to the command line parameters passed to the programs in DOS.

I think I need more time and more trials to get the idea. Who knows, blekko might become my first choice search engine soon.

Written by Volkan TUNALI

August 9th, 2010 at 10:08 am

Turkish Deasciification

leave a comment

Deasciification is the process of converting text written with only ASCII letters to its correct form using corresponding letters in Turkish alphabet (or any language that contains non-ascii letters). For example, the text “Cok yogun bir calisma ve emegin urunu” conveys the meaning, that is, human intelligence is able resolve ambiguities (if any) and understand text like this. The text, however, should be written as “Çok yoğun bir çalışma ve emeğin ürünü” (in Turkish). This is what a deasciifier is supposed to do.

Well, why do we need deasciification? We may not have Turkish letters on the keyboard (or the OS we are using may be without Turkish keyboard layout) and we need to end up with a text in correct Turkish form. It is also possible that we are accustomed to typing only with Ascii letters for some reason.

In addition, we may need to analyze a large collection of Turkish documents, and this collection can be contaminated with text written in Ascii, which will degrade the performance of our analysis. Then, the only possibility is to use deasciification. This is the most important reason for me as I often perform text mining on Turkish document collections, and I always need deasciification.

In this post, I’ll shortly review a few deasciification tools developed with several languages.

The first deasciifier is the one which is part of Zemberek project. Written completely in Java, Zemberek is an open-source general purpose Natural Language Processing library and toolset designed for Turkic languages, especially Turkish. A web-based demo of Zemberek is available at http://zemberek-web.appspot.com/. I usually use the deasciifier of Zemberek in my text mining research when I work with Turkish text datasets.

The next deasciifier is developed by Gökhan Tür at Sabancı University. More information and a demo is available at http://www.hlst.sabanciuniv.edu/TL/deascii.html. This system is currently not open-source, and not available for download.

One deasciifier is from Deniz Yüret at Koç University, which is actually developed for Emacs for realtime correction of words written in ascii form. More information and download is available at http://denizyuret.blogspot.com/2006/11/emacs-turkish-mode.html.

Yüret’s deasciifier is converted to Javascript by Mustafa Emre Acer, and available at http://turkce-karakter.appspot.com/.

The last deasciifier, recently published by Emre Sevinç, is a conversion of Yüret’s work into Python. More information and download is available at http://github.com/emres/turkish-deasciifier.

None of these deasciifiers is perfect, but they all perform pretty well for most of the situations. I’m sure we’ll see much improved deasciifiers with the advances in NLP studies for Turkish.

Written by Volkan TUNALI

July 23rd, 2010 at 7:52 pm