Computational linguistics resources for Sindhi language

Language is a fundamental source of communication, sharing of views, thoughts, love, knowledge and experiences. Therefore, computational linguistics is a significant source to enable the machine to understand and analyse the human language, generally called the natural language. The computational linguistics is derived from linguistics for analysis and synthesis of language and speech by applying different techniques of computer science.

According to Wikipedia, “Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modelling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions.”

Therefore, computational linguistics is the application of computer science to the analysis, synthesis and comprehension of written and spoken language. Computational linguistics is a significant source for machine translation, speech recognition, text-to-speech synthesizers, interactive voice response, search engines, text editors and language instruction materials etc. Generally, computational linguistics is focusing on the nature of a language, its morphology, syntax, and dynamic use, and constructing possible valuable models to help machines to handle language for separate purposes.

The Sindhi language is one of the oldest languages of the world with a large space to adopt several other languages, lexicons, complex grammars and a rich morphology. The grammar of Sindhi language is not the same as the grammar of English and other languages. Even the meaning and sense of understating of Sindhi words are different in separate situations. The diacritics used in Sindhi text are changing the meaning, number and gender of the Sindhi lexicons. This language is written, read and spoken all over the world. It is the official language of the province of Sindh in Pakistan. However, in India, Sindhi is one of the scheduled languages officially recognized by the central government, though Sindhi is not an official language of any of the states in India.

Sindhi has the status, capacity and capability of being a modern language of the world

Sindhi computational linguistics deals with Sindhi language and focuses on computational linguistics and NLP (Natural language processing) problems of Sindhi language. Most of the research studies in the field of computational linguistics concentrate on English text for text segmentation, word tagging, syntactic parsing, sentiments analysis, lemma and stem identification, etc. A variety of reliable resources are available for the English language including text and tools for computational linguistics practices and sentiment analysis. Online resources for languages other than English are limited even in this digital era. The style, grammatical structure and domain of Sindhi language is different from the other languages of the world. Therefore, to work on Sindhi text for text segmentation, Part of Speech (POS) tagging, syntactic parsing, morphological analysis, WordNet, text corpus development and analysis, sentiments analysis, semantic analysis and structural data, etc. is not the same as to work on the English or other languages.

The capability of understanding, analysing and producing natural language has deep roots in computer science especially in the Artificial Intelligence field of computer science. The significance of Sindhi computational linguistics and NLP has increased due to the availability of a large number of textual data on web 2.0. The different types of Sindhi language websites and social media tools are continuously producing textual and other types of contents whereas the printed literature and other textual material are in the process of digitisation. Therefore, it has grown the demand for newer applications to address the open challenges associated with the understanding and analysis of this type of data. The linguistics is a multi-disciplinary area that covers research dimensions like computational linguistics, social linguistics, psycholinguistics, pragmatics, machine learning, deep learning and many more. The Sindhi Computational Linguistics and Sindhi NLP contribute to developing the linguistics tools and filling the research gap by working on Sindhi. The set of Sindh NLP and computational linguistics tools and resources are created to facilitate the research scholars and linguists to work more on Sindhi. The newer techniques are to be developed while existing ones are to be enhanced to push boundaries for novel tasks and services with specific focus to the Sindhi language. The work composes of Sindhi treebank, Sindhi dialectology, machine learning models, deep learning models, text corpora, linguistics dataset, sentiment analysis dataset, WordNet, sentiment analysis system, aspect-based sentiment analysis, Sindhi lemma and stemming identification system, morphology analysis, syntactic parsing, universal part of speech text tagging system and Sindhi part of speech text tagging system. All the resources are available online with domain name Sindhi NLP ( The provision of these all-important resources is significant, which would elevate the status of Sindhi from resource-limited to a resourceful language because these developments could then open opportunities for the research community to create software that could convert text to speech, speech to text, images to text, and machine translation, etc. It would also be a significant contribution to the field of Natural Language Processing for future research and development. The NLP is an interdisciplinary field, which has made use of artificial intelligence techniques to help computers read, decipher and understand natural languages of humans in a manner, valuable and beneficial to the community and linguists. Sindhi computational linguistics and NLP would open up new paths and possibilities for research scholars and linguists worldwide and also the work would be beneficial to expand research work on the analysis of Sindhi text and its dialects.

Sindhi NLP is comprised of computational linguistics and NLP resources which is a research project and resource-based website. It is a significant bunch of resources because such type of project-based website is not developed before for the solution of linguistics problems of Sindhi language. The project provides the following resources:

Online text Parser using UPOS tagset;

Online text Parser using SPOS tagset;

Online Statistical Analysis of morphology and part of speech tagsets;

Online Lemma and stemming evaluation and analysis;

Online Sentiment Analysis;

Online Aspect Based Sentiment Analysis;

Online Text Corpus Development;

Online Sindhi WordNet;

Sindhi text corpora and research articles;

Machine learning models and

Deep learning models.

A good number of newspapers, magazines, social websites, blogs, general and specific websites are available in the Sindhi language online. They provide big data in form of Sindhi text. Text corpora of Sindhi text and linguistics data sets are available in different categories including the text corpus of the poetry of Shah Abdul Latif Bhitai, which are beneficial for research scholars, language engineers, linguists, social scientists and artificial intelligence engineers. The machine learning and deep learning models and libraries are developed to evaluate and analyse the Sindhi text of all types for text analysis, opinion mining, aspect-based sentiment analysis, semantic analysis, text parsing, machine translations, etc.

The research is in progress on several languages of the world in the field of computational linguistics and natural languages process (NLP). Viewing the computational linguistics problems of Sindhi language, scientific methodologies and models were developed that proper and scientific work may be done on Sindhi. Though this type of work has been continuing since 2010 by different computer scientists and language engineers, the practical and state of the artwork started in 2016. The work may be helpful for search engines, information retrieval systems, machine translation, information extraction, automatic question answering, building knowledge-based models as ground truth for factual information, sentiment analysis, spam classification, transcription, speech processing and other linguistics resource building for Sindhi languages like parsers, corpuses, grammar correction. Several research articles have been published on computational linguistics problems and their solution in national and international reputed research journals which may be seen on ( Moreover, additional work may be done on the upgrading of the models, methods, algorithms and tools for universal dependency, semantic analysis, syntactic analysis and sentiment analysis.

The Sindhi language stands with the developed languages of the world due to the current development and scientific work on Sindhi NLP and computational linguistics. Majority of the linguistics and NLP resources are available online and offline. Recently on 24 June 2019, Wikitongues has included the Sindhi language in its database. Before this, only four hundred languages from seven thousand languages of the world were available in the database. The video of Sindhi language is available online on the account of Wikitongues. Furthermore, several research scholars are doing their M Phils and PhDs on computational linguistics and NLP problems of Sindhi language. Now, the Sindhi language may not be counted in the less-resourced languages of the world, but it stands together with all developed languages, like English, Arabic, Urdu, etc. Sindhi is computationally resourced with all the natural languages processing (NLP) and computational tools and techniques. It is a fully Unicode-based language and has got the status, capacity and capability of being a modern language of the world.

Sindh government is working on the development of Sindhi language from separate aspects especially in the field of language engineering but unfortunately, it could not make the proper team of computational linguists and language experts who can design and develop proper and scientific plans and methods for the novel development of Sindhi language. Sindhi language authority is bound to upgrade the Sindhi language according to current requirements and trends but unfortunately, it is unaware of the latest development on the Sindhi language in the field of NLP, language engineering and computational linguistics. The above-mentioned work is historical work for Sindhi language but Sindhi language authority could not include this work even in the encyclopaedia of Sindhi language which is published in different versions and volumes.

