Doküman dili tanıma için yeni bir öznitelik çıkarım yaklaşımı: İkili Desenler

Yılmaz KAYA, Ömer Faruk Ertuğrul
688 165

Öz


Doğal dil işlemenin önemli alt konularından biri olan dil tanıma (DT),  bir dokümanın içeriğine göre yazıldığı dili belirleme işlemidir.  Bu çalışmada, karakterlerin UTF-8 değerlerini birbirleri ile karşılaştırmalar sonucu elde edilen ikili desenler kullanarak yeni bir dil tanıma yaklaşımı, bir boyutlu yerel ikili örüntüler  (1B-YİÖ) önerilmiştir.  Önerilen yöntem farklı sayıda dillerden oluşan metinler içeren dört  veri kümesi ile test edilmiştir. 1B-YİÖ ile dokümanlardan elde edilen öznitelikler kullanılarak farklı makine öğrenmesi yöntemleri  ile sınıflandırma işlemi gerçekleştirilmiştir. Dört veri kümesi için sınıflandırma başarıları sırası ile  %86.20, %92.75, %100 ve %89.77 olarak gözlenmiştir. Elde edilen sonuçlara göre önerilen öznitelik çıkarım yönteminin dil tanıma için önemli örüntüler sağladığı görülmüştür. 


Anahtar kelimeler


metin tabanlı dil tanıma, yerel ikili örüntüler, doğal dil işleme

Tam metin:

PDF


Referanslar


Selamat, A., ve Ng, C. C. 2011. Arabic script web page language identifications using decision tree neural networks. Pattern Recognition, 44(1): 133-144.

Takçı, H. ve Ekinci, E. 2012. Minimal feature set in language identification and finding suitable classification method with it, Procedia Technology, 1: 444 – 448

Xafopoulos, A., Kotropoulos, C., Almpanidis, G., ve Pitas, I. 2004. Language identification in web documents using discrete HMMs. Pattern recognition,37(3): 583-594.

Popescu, M., ve Liviu P. Dinu. 2007. Kernel meth ods and string kernels for authorship identification:The federalist papers case. Proceedings of RANLP, September.

Popescu M., ve Cristian Grozea. 2012. Kernel methods and string kernels for authorship analysis. CLEF (Online Working Notes/Labs/Workshop), September.

Popescu, M., ve Radu Tudor Ionescu. 2013. The Story of the Characters, the DNA and the Native Language. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 270–278, June.

Nie, J. Y. (2010). Cross-language information retrieval. Synthesis Lectures on Human Language Technologies, 3(1), 1-125.

Li, H., Ma, B., ve Lee, C. H. 2007. A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, , 15(1): 271-284.

Nakamura, S., Markov, K., Nakaiwa, H., Kikui, G. I., Kawai, H., Jitsuhiro, T., ... & Yamamoto, S. (2006). The ATR multilingual speech-to-speech translation system. Audio, Speech, and Language Processing, IEEE Transactions on,14(2), 365-376.

Kaya, Y., Ertuğrul, Ö. F., ve Tekin, R. (2014). An Expert Spam Detection System Based on Extreme Learning Machine. Computer Science, 1(2), 132-137.

Selamat, A., ve Omatu, S. 2004. Web page feature selection and classification using neural networks. Information Sciences, 158: 69-88.

Mani, I., ve Maybury, M. T. (Eds.). (1999). Advances in automatic text summarization (Vol. 293). Cambridge: MIT press.

Chong, Leighton K., ve Christine K. Kamprath. "Machine translation and telecommunications system." U.S. Patent No. 5,497,319. 5 Mar. 1996.

Takcı, H., ve Soğukpınar, İ. 2005. Letter based text scoring method for language identification. In Advances in Information Systems (pp. 283-290). Springer Berlin Heidelberg.

Evans, D. A., Grefenstette, G. T., ve Tong, X. 2008. U.S. Patent No. 7,359,851. Washington, DC: U.S. Patent and Trademark Office.

Cavnar, W.B., Trenkle, J. M. 1994. N-gram-based text categorization. In: In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 161–175.

Ahmed, B., Cha, S. H., ve Tappert, C. 2004. Language identification from text using n-gram based cumulative frequency addition. Proceedings of Student/Faculty Research Day, CSIS, Pace University, 12-1.

Burçin, K., ve Vasif, N. V. 2011. Down syndrome recognition using local binary patterns and statistical evaluation of the system. Expert Systems with Applications, 38(7): 8690-8695.

Takçı, H., ve Güngör, T. 2012. A high performance centroid-based classification approach for language identification. Pattern Recognition Letters,33(16): 2077-2084.

Li, Q., ve Chen, Y. P. 2010. Personalized text snippet extraction using statistical language models. Pattern Recognition, 43(1): 378-386.

Sibun, P. ve Reynar, J.C. 1996. Language identification: examining the issues. In: Proc.5th Symposium on Document Analysis and Information Retrieval, Las Vegas, 125–135.

Song, Y., Dai, L., ve Wang, R. 2009. An automatic language identification method based on subspace analysis. In Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on (pp. 598-601). IEEE.

Jiang, C., Coenen, F., Sanderson, R., ve Zito, M. 2010. Text classification using graph mining-based feature extraction. Knowledge-Based Systems, 23(4): 302-308.

Tan, S. 2006. An effective refinement strategy for KNN text classifier. Expert Systems with Applications, 30(2): 290-298.

Botha, G. R., ve Barnard, E. 2012. Factors that affect the accuracy of text-based language identification. Computer Speech & Language, 26(5): 307-320.

Prager, J. M. 1999. Linguini: Language identification for multilingual documents. In Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on (pp. 11-pp). IEEE.

Suzuki, I., Mikami, Y., Ohsato, A., ve Chubachi, Y. 2002. A language and character set determination method based on N-gram statistics. ACM Transactions on Asian Language Information Processing (TALIP), 1(3): 269-278.

Ng, C. C., ve Selamat, A. 2009. Improved letter weighting feature selection on arabic script language identification. In Intelligent Information and Database Systems, 2009. ACIIDS 2009. First Asian Conference on (pp. 150-154). IEEE.

Baldwin, Timothy ve Marco Lui (2010) Language Identification: The Long and the Short of the Matter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, USA, pp. 229-237.

Kaya, Y., Uyar, M., Tekin, R., ve Yıldırım, S. 2014. 1D-local binary pattern based feature extraction for classification of epileptic EEG signals. Applied Mathematics and Computation, 243: 209-219.

Witten, IH, Frank, E, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005




Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.