A FILTER MODEL FOR TEXT CATEGORIZATION AGAINST ONLINE HATE SPEECHES

GEORGINA NKOLIKA OBUNADIKE, Emeka Ogbuju, Mukhtar Abubakar

Abstract


Text classification is a method of grouping a document text into different predefined categories. This method has been applied in different areas such as classification of scientific articles, spam filtering, and classification of document genre. Text classification is a popular task in data mining because of its level of accuracy and easy application. The Internet is a common message transmission medium among many people, billions of messages move around the internet on a daily basis through different platforms on the internet such as e-mail, Facebook, Twitter, etc. Some of these messages are being transmitted with wrong motives, thus it became imperative to design a model for filtering some of these messages using data mining algorithms to sieve away the unwanted messages from circulation. In the light of this, this paper applied three data mining techniques namely: Support Vector Machine (SVM), Naïve Bayes and K-Nearest Neighbour (KNN) to develop models that can be applied to filter messages from Facebook and e-mail to counter circulation of online hate speeches on these platforms. It also compared the performance of these models against collected data to identify the state of the art text classifier. It was observed that the Naïve Bayes algorithm performed better than the other two with an accuracy of 61.5 and ROC of 0.66.


Full Text:

PDF

References


Bakewell, L. (1998). Image Acts.American Anthropologist, 100(1): 22-32.

Bonnell, V. E. (1997). Iconography of power: Soviet political posters under Lenin and Stalin. Berkeley and Los Angeles: University of California Press.

Buber, E. ,Diri, B. , &Sahingoz, O. K. (2017). Detecting phishing attacks from URL by using NLP techniques. In 2017 International conference on computer science and Engineering (UBMK) pp. 337–342.

Cao, Y., Han, W.,& Le, Y. (2008). Anti-phishing based on automated individual white-list. In Proceedings of the 4th ACM workshop on digital identity

Chiew, K. L. , Yong, K. S. C. , & Tan, C. L. (2018). A survey of phishing attacks: Their types, vectors and technical approaches. Expert Systems with Applications, 106 , 1–20 .

Crammer, K., &Singer, Y.(2001).On the algorithmic implementation of multiclass kernel-based Vector Machines.Journal of Machine Learning Research, 2: 265–292.

David, S.,&Whillock, R.K. (eds.).(1995). Hate Speech. Thousand Oaks, CA: Sage Publications, Inc. Introduction. pp. ix-xvi; “Symbolism and the Representation of Hate in Visual Discourse.” pp. 122-141; “The Use of Hate as a Stratagem for Achieving Political and Social Goals.” pp. 28-54; “Afterword: Hate, or Power?” pp. 267-275.

Drucker, H., Vapnik, V.,& Wu, D.(1999). Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks, 10(5): 1048–1054.

Du, R., Safavi-Naini, R., & Susilon, W. (2013).Web filtering using text classification.The 11th IEEE International Conference on Networks, 28 September - 1 October 2003, 325-330.

Dumais, S.T., Platt, J., Heckerman, D.,& Sahami, M.(1998). Inductive learning algorithms and representations for text categorization. Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, ACM Press, New York, US: Bethesda, US, pp. 148–155

Dumais, S.T. & Chen, H., (2000). Hierarchical Classification of web content. Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, ACM Press, New York, US: Athens, GR, pp. 256–263

Fix, E., & Hodges, J. (1951). Discriminatory analysis: Nonparametric discrimination. Consistency Properties, 4

Gupta, B. B., Arachchilage, N. A. G. , & Psannis, K. E. (2018). Defending against phishing attacks: Taxonomy of methods, current issues and future directions. Telecommunication Systems, 67 (2), 247–267

Joachims, T.(1998). Text categorization with Support Vector Machines: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning.

Joachims, T. (1999). Transductive inference for text classification using Support Vector Machines. Proceedings of ICML-99, 16th International Conference on Machine Learning, Morgan Kaufmann Publishers, San Francisco, US: Bled, SL, pp. 200–209

Kataria, A., & Singh, M. D. (2013). A review of data classification using K-Nearest Neighbour Algorithm.International Journal of Emerging Technology and Advanced Engineering, 3(6): 354-360

Kubat, M., &Jr, M. (2000). Voting Nearest-Neighbour sub classifiers. Proceedings of the 17th International Conference on Machine Learning, ICML-2000, Stanford, CA, pp. 503-510.

Obunadike, G. N., Dima R., & Abah J. (2018). Empirical evaluation of KNN classifier using various K-Values.Proceedings of the International Conference on Information Technology in Education and Development (ITED, 2018), pp 13-18.

Parvin, H., Alizadeh, H., & Minaei, B. (2010).A modification on K-Nearest Neighbor classifier.Global Journal of Computer Science and Technology,10(14): 37-41.

Obunadike, G., N., Isah, A., & Alhassan, J., K. (2018). Optimized Naïve Bayesian algorithm for efficient performance, Journal of Computer Engineering and Intelligent System, 9(3): 8-13.

Yindalon, A., Ioannis, T., Alexander, S., Douglas, H., Constantin, F. A. (2005). Text categorization models for high-quality article retrieval in internal medicine. Journal of the American Medical Informatics Association, 12(2), pp. 207–216. https://doi.org/10.1197/jamia.M1641

Zhang, Y., Hong, J. I., & Cranor, L. F. (2007). Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, ACM, New York, NY, USA (pp. 639–648)


Refbacks

  • There are currently no refbacks.




FEDERAL UNIVERSITY DUTSIN-MA, KATSINA STATE - Copyright 2020