Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU

Yiping Zhang, Kolja Wilker

Source Title: Journal of Organizational and End User Computing (JOEUC) 36(1)

DOI: 10.4018/JOEUC.338388

Article PDF Download Open access articles are freely available for download

Abstract

Effectively fusing information between the visual and language modalities remains a significant challenge. To achieve deep integration of natural language and visual information, this research introduces a multimodal fusion neural network model, which combines visual information (RGB images and depth maps) with language information (natural language navigation instructions). Firstly, the authors used faster R-CNN and ResNet50 to extract image features and attention mechanism to further extract effective information. Secondly, GRU model is used to extract language features. Finally, another GRU model is used to fuse the visual- language features, and then the history information is retained to give the next action instruction to the robot. Experimental results demonstrate that the proposed method effectively addresses the localization and decision-making challenges for robotic vacuum cleaners.

Article Preview

Top

Introduction

With the advancement of society and technology, mobile robots have gained the ability to sense, decide, and move, much like humans, by utilizing sensors integrated into their design. The applications of mobile robots have demonstrated their extensive value across various domains, ranging from industrial automation to household services, as new frontiers in robotic technology continue to emerge (Biswal & Mohanty, 2021). In this omnipresent wave of robotics, robotic vacuum cleaners, as members of the mobile robot family, have garnered widespread popularity. These intelligent machines are equipped with autonomous navigation and cleaning capabilities, and users can manipulate the robot verbally to reach the designated place for cleaning, improving people’s quality of life and making daily cleaning tasks more convenient. However, with their widespread adoption in real-life scenarios, autonomous navigation requires addressing the link between language and images, which remains a formidable challenge for robotic vacuum cleaners (Li et al., 2023).

Vision-and-language navigation (VLN) is a fundamental task for achieving general-purpose robots with practical applications in industries such as household robotics and autonomous driving (Anderson et al., 2018). The challenge of this task lies in the unstructured nature of navigation instructions and the complexity of the navigation environment (Wang et al., 2021). Navigation instructions are derived from natural language descriptions, which exhibit diverse forms and intricate expressions. Additionally, real-world navigation environments exhibit considerable variability in their details. Without highly precise maps, the intelligent agent can only access partial environmental information at each moment. Consequently, deducing its current location, making informed decisions, and navigating correctly require the agent to possess robust environmental modeling and reasoning capabilities (Wen et al., 2023). Therefore, researching vision and language navigation methods is essential for robotic vacuum cleaners (Xiao & Fu, 2023).

Various scientific and technological advancements coupled with the emergence of deep learning have made artificial intelligence algorithms one of the most prominent and extensively studied areas in the field of computer science (Ning et al., 2024; Varma & James, 2021; Yang et al., 2021; Zeng et al., 2020; Zhang et al., 2023). These algorithms have achieved significant success in the domains of computer vision and natural language processing (Li et al., 2022; Wang et al., 2018; Zheng et al., 2020). For example, in image feature extraction, Ding et al. (2022) introduced a scheme for reducing the computing power requirements, extracting the multi-scale pixel-wise local features for HSI classification, and extracting the spectral features for superpixels. Verdú et al. (2023) aimed to utilize laser scattering imaging in conjunction with pre-designed convolutional neural networks (CNNs) to simulate the textural variations of pre-processed banana vegetable tissue. Using three different neural network models and four types of red–green–blue (RGB) images, the research revealed observable textural differences as storage time increased, accompanied by a reduction in firmness. It effectively captured and modeled the texture changes resulting from variations in storage and fruit zones. Lai et al. (2022) utilized images captured by an underwater video system equipped with a camera and infrared LED illuminator. They achieved the recognition and measurement of shrimp body lengths in aquaculture pond bottoms by training a YOLOv4-tiny CNN model. These studies all highlight the success of deep learning in image recognition, but there has been limited exploration in the context of natural language processing.

Complete Article List

Search this Journal:

Reset

Volume 36: 1 Issue (2024)

Volume 35: 3 Issues (2023)

Volume 34: 10 Issues (2022)

Volume 33: 6 Issues (2021)

Volume 32: 4 Issues (2020)

Volume 31: 4 Issues (2019)

Volume 30: 4 Issues (2018)

Volume 29: 4 Issues (2017)

Volume 28: 4 Issues (2016)

Volume 27: 4 Issues (2015)

Volume 26: 4 Issues (2014)

Volume 25: 4 Issues (2013)

Volume 24: 4 Issues (2012)

Volume 23: 4 Issues (2011)

Volume 22: 4 Issues (2010)

Volume 21: 4 Issues (2009)

Volume 20: 4 Issues (2008)

Volume 19: 4 Issues (2007)

Volume 18: 4 Issues (2006)

Volume 17: 4 Issues (2005)

Volume 16: 4 Issues (2004)

Volume 15: 4 Issues (2003)

Volume 14: 4 Issues (2002)

Volume 13: 4 Issues (2001)

Volume 12: 4 Issues (2000)

Volume 11: 4 Issues (1999)

Volume 10: 4 Issues (1998)

Volume 9: 4 Issues (1997)

Volume 8: 4 Issues (1996)

Volume 7: 4 Issues (1995)

Volume 6: 4 Issues (1994)

Volume 5: 4 Issues (1993)

Volume 4: 4 Issues (1992)

Volume 3: 4 Issues (1991)

Volume 2: 4 Issues (1990)

Volume 1: 3 Issues (1989)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU

Abstract

Introduction

Complete Article List