Article Preview
TopIntroduction
With the advancement of society and technology, mobile robots have gained the ability to sense, decide, and move, much like humans, by utilizing sensors integrated into their design. The applications of mobile robots have demonstrated their extensive value across various domains, ranging from industrial automation to household services, as new frontiers in robotic technology continue to emerge (Biswal & Mohanty, 2021). In this omnipresent wave of robotics, robotic vacuum cleaners, as members of the mobile robot family, have garnered widespread popularity. These intelligent machines are equipped with autonomous navigation and cleaning capabilities, and users can manipulate the robot verbally to reach the designated place for cleaning, improving people’s quality of life and making daily cleaning tasks more convenient. However, with their widespread adoption in real-life scenarios, autonomous navigation requires addressing the link between language and images, which remains a formidable challenge for robotic vacuum cleaners (Li et al., 2023).
Vision-and-language navigation (VLN) is a fundamental task for achieving general-purpose robots with practical applications in industries such as household robotics and autonomous driving (Anderson et al., 2018). The challenge of this task lies in the unstructured nature of navigation instructions and the complexity of the navigation environment (Wang et al., 2021). Navigation instructions are derived from natural language descriptions, which exhibit diverse forms and intricate expressions. Additionally, real-world navigation environments exhibit considerable variability in their details. Without highly precise maps, the intelligent agent can only access partial environmental information at each moment. Consequently, deducing its current location, making informed decisions, and navigating correctly require the agent to possess robust environmental modeling and reasoning capabilities (Wen et al., 2023). Therefore, researching vision and language navigation methods is essential for robotic vacuum cleaners (Xiao & Fu, 2023).
Various scientific and technological advancements coupled with the emergence of deep learning have made artificial intelligence algorithms one of the most prominent and extensively studied areas in the field of computer science (Ning et al., 2024; Varma & James, 2021; Yang et al., 2021; Zeng et al., 2020; Zhang et al., 2023). These algorithms have achieved significant success in the domains of computer vision and natural language processing (Li et al., 2022; Wang et al., 2018; Zheng et al., 2020). For example, in image feature extraction, Ding et al. (2022) introduced a scheme for reducing the computing power requirements, extracting the multi-scale pixel-wise local features for HSI classification, and extracting the spectral features for superpixels. Verdú et al. (2023) aimed to utilize laser scattering imaging in conjunction with pre-designed convolutional neural networks (CNNs) to simulate the textural variations of pre-processed banana vegetable tissue. Using three different neural network models and four types of red–green–blue (RGB) images, the research revealed observable textural differences as storage time increased, accompanied by a reduction in firmness. It effectively captured and modeled the texture changes resulting from variations in storage and fruit zones. Lai et al. (2022) utilized images captured by an underwater video system equipped with a camera and infrared LED illuminator. They achieved the recognition and measurement of shrimp body lengths in aquaculture pond bottoms by training a YOLOv4-tiny CNN model. These studies all highlight the success of deep learning in image recognition, but there has been limited exploration in the context of natural language processing.