Harnessing New Technologies for Learning and Research in the Languages and Cultures of the Middle East
Panel IV-10, 2020 Annual Meeting
On Tuesday, October 6 at 01:30 pm
Panel Description
Recent advances in technology are re-shaping education and research and pushing the boundaries of what is possible. This panel reports on four research projects pursued by different teams using new technologies to advance language and culture learning and research with a focus on Arabic and Persian.
The first presentation focuses on Optical Character Recognition (OCR), a technology that identifies and extracts textual information from images, allowing machines to read documents at a human level. OCR allows for the digitization and preservation of vast and precious historical documents stored on non-digital media. Non-Latin scripts, such as those based on Arabic, present a technological challenge, as they are cursive, contributing to low accuracy. A team of researchers is creating a new, more accurate image-to-text conversion software capable of creating a large-scale, open source, global language and culture data bank for Pashto, to be extended to languages based on the Arabic script.
The second paper reports on a study that investigates Arabic automatic text summarization, the process of creating a concise and coherent summary of a longer text while preserving the meaning and the important information in the text. L2 learners struggle with reading authentic texts especially in the first stages of learning particularly with languages such as Arabic. This study contributes to research on automatic text summarization for L2 learning and its applicability for microlearning in Arabic.
The third paper reports on the development of an adaptive language learning system that is designed to provide a cost-effective and personalized language learning experience. The system leverages artificial intelligence algorithms that have the ability to recognize patterns in student performance, diagnose deficiencies in learning, and recommend personalized content to meet individual learning needs. The project first investigates the feasibility of the technology to Arabic. It then investigates the scalability of the technology through the use of pre-existing training materials, and explores its efficiency compared to conventional instruction.
The fourth and final paper presents on the design and delivery of a blended course to help students learn an Arabic dialect and prepare for studying and living abroad. It demonstrates how the innovative use of a well-established online learning platform, combined with an interactive content development platform, and authentic materials can facilitate students’ development of essential linguistic and regional expertise. The paper shares insights on applicability to other types of Arabic and other languages and the role that the instructor plays in such a technology-mediated course.
One of the challenges faced by learners of languages with multiple dialects and accents such as Arabic is to develop region-specific communicative skills that are essential for gaining advanced proficiency in the language and preparing for study abroad experience or field work. Yet, opportunities for gaining these essential skills at advanced enough levels are often difficult to come by in educational settings, especially with limited time to spare. This session first presents the design of a blended, technology-enhanced Moroccan Arabic course intended to effectively prepare undergraduate students of Arabic for their advanced learning and working in Morocco during the semester prior to departure. Presenters will demonstrate how the combination of a well-established learning management system, a web-based interactive content creation application, and multimodal authentic and non-authentic materials are leveraged to create an immersive, interactive, and supportive learning environment that addresses specific learner needs and increases intercultural competence. Presenters will then share their reflections on effective pedagogical strategies gained from designing and teaching this course, and practical ideas regarding enhancing motivation and self-paced learning as well as effective learner support and facilitation in a blended learning context.
Text summarization is the process of creating a concise and coherent summary of a longer text while preserving the meaning and the important information in the text (Allahyari et al., 2017). Automatic summaries reduce reading time, improve the effectiveness of indexing, and help in question-answering systems (Torres-Moreno, 2014). Automatic text summarization has been adopted in several studies (e.g., Douzidia & Lapalme, 2004; Froud et al., 2013; Ba-Alwi, 2015; Azmi & Al-Thanyyan, 2012; Haboush et al., 2012; Azmi & Al-Thanyyan, 2009; Belkebir & Guessoum, 2015 among others) that have used different algorithms (e.g., Rhetorical Structure Theory (RST), clustering techniques, adaboost and Latent Semantic Allocation (LSA) among others) for various purposes (e.g. Machine Translation, Information Retrieval among others).
This paper contributes to this line of research on automatic text summarization for L2 microlearning where summaries serve as small learning pieces that L2 learners read instead of larger documents. This paper’s automatic summarization is based on Probabilistic Topic Modeling and its Latent Dirichlet Allocation (LDA) algorithm as well as a sentence extraction approach. Topic modeling is used to discover the underlying topics in a text document or several documents. The basic assumption behind it is that a document can be represented by a set of latent topics, multinomial distributions over words, and assume that each document can be described as a mixture of these topics (Chang et al., 2009). Each document has then a set of topics and probability distributions associated with them. At the same time, each topic has a set of words and their probabilities of occurrence given that document and topic, i.e., topic models build bags for topics to extract information. The extractive method selects and extracts the more relevant pieces or sentences than others in a longer text (Das & Martins, 2007).
Taking into account that these summaries are extracted for L2 microlearning tasks, the results of the system were evaluated using a text quality evaluation method. This method provides four criteria to analyze several linguistic aspects of the text: grammaticality, non-redundancy, reference, and coherence (Steinberger & Jezek, 2012). The presentation will be reporting on these results. For future research, more improvements will be introduced to the summarization algorithm as well as an abstractive method will be adopted to provide machine-generated summaries instead of extracting sentences.
Cursive scripts, e.g. based on abjad, have traditionally presented a technological challenge for image-to-text conversion (otherwise known as OCR). This fact has contributed to low accuracy for several decades. For instance, typical documents at the ACKU Pashto and Persian collection (http://afghandata.org) may result in accuracy below 70% when processed with leading OCR software, while the threshold of practicality requires that at least 90% of characters be recognized correctly. The accuracy stronly depends on the quality of the documents, which is often related to their age. Some classes of documents remain out of reach, such as handwritten documents.
The most important product of OCR is full-text, which can be searched, edited, annotated and processed in other creative ways with software. This leads to great enhancement of the research environment and facilitates learning of languages and exploration of cultures.
The focus of the current presentation will be a survey of the new developments in machine learning (ML) and artificial intelligence (AI) that have led to significant progress towards the goal of accurate OCR of non-Latin scripts. In particular, today's limitations of existing technologies will be discussed, and possible ways to address them in the future.