Article Preview
Top1. Introduction
With rapid changes occurring in the global economy and ways of doing business, the fortunes of companies and industries are also changing rapidly. Researchers, investors, and policy-makers are keen to face these changes proactively. They invest a great deal of resources to collect and analyse data to understand business performance and, more importantly, to predict the future of a company. One important measurement of a company's performance and its potential is its popularity with the general public. In particular, if a company's trademark appears frequently, it can indicate that the company is highly popular. Consequently, retrieving trademark images efficiently and accurately is becoming increasingly important.
Image retrieval technology has gone through three stages of development: text-based image retrieval (TBIR), content-based image retrieval (CBIR), and semantic-based image retrieval. TBIR is known as “searching images by tags”. This method is simple but time-consuming and labour-intensive because tags and indices such as titles, authors, and other metadata attributes are added by manual annotation. There were enormous amount of trademarks registered worldwide (World Intellectual Property Organization, 2018). Since the volume of digital image data on the internet has increased rapidly, along with the number of trademark images, TBIR is unsuitable for trademark retrieval from the internet where images lack annotation.
In contrast to TBIR, CBIR uses features that can be extracted automatically to retrieve images, avoiding the subjectivity of manual description, and improving retrieval efficiency. Low-level visual features include colour, texture, shape, etc., and different feature representations require different similarity measurement methods. Colour is the most intuitive physical feature of colour images; the methods available to describe colour include colour histograms (Swain & Ballard, 1991), colour correlograms (Huang et al., 1997), and colour coherence vectors (Pass et al., 1997). Texture is a measurement of the relationship between pixels in a local area; its purpose is to describe the spatial distribution of grey levels in the neighbourhood of pixels. Shape descriptors are even more important than colour or texture descriptors and can be grouped into contour-based and region-based approaches. The former uses image boundary information, while the latter uses information on the grey distribution in a certain area. The Fourier descriptor (Del Vecchio & Salvini, 2000) is one of the most commonly studied and used contour-based shape descriptors. It is characterized by good computational performance and is easy to normalize. However, it is unable to capture the local representation of shapes and is sensitive to boundary noise and variations, leading to the Gibbs phenomenon when used to reconstruct complex trademarks.
In addition to low-level features, images can be analysed according to their high-level semantic content, i.e., what they conceptually represent. Machine learning and neural network models such as AlexNet (Krizhevsky et al. 2017), VGGNet (Simonyan & Zisserman, 2014), Inception V4 (Szegedy et al., 2017), ResNet (He et al, 2016), and DenseNet (Huang et al., 2017) have been widely used due to their strength in extracting highly semantic and abstract features and realizing nonlinear feature mapping (Perez et al., 2018). Some methods achieve improved performance through deep learning. An end-to-end model (Mafla et al., 2021) combines text and visual features to achieve fine-grained classification and image retrieval through a multimodal inference module. Recently, more novel deep learning models have been proposed. CVNet (Lee et al., 2022) adopts geometric verification after a global search with global descriptor matching and local feature matching. Global search quickly performs a rough search across the entire database, and geometric validation reorders the results of a rough search by precisely assessing only the candidates identified by the global search. ViT-Slim (Chavan et al., 2022) replaces the convolutional neural network in network slimming with a transformer to realize more flexible and efficient visual retrieval and classification. Zhao et al. drew on the idea of dense retrieval, discretized images and texts into tokens, and aligned them across modalities, greatly improving the efficiency of large-scale graphic retrieval (2023).