Article Preview
Top1. Introduction
A malware infection can easily attack Android apps for malicious purposes and compromise security (Lu & Da Xu, 2018). Mobile network expansion has increased the number of portable devices. Because of this, financial malware apps threaten mobile users. Despite massive prevention and mitigation efforts, malware remains a major cyber security threat. Thus, in 2016, Symantec discovered 357,019,453, in 2017, 669,974,865, and in 2018, 246,002,762 new malware variants. Yet more malware variants are attempting to bypass anti-virus tools and avoid detection by several malware detection systems.
The rapid expansion of mobile interaction places a significant burden on smartphone security management. According to a recent study1, the number of apps in the Google Play Store has increased from 16K in Dec. 2009 to over 2 million in Feb. 2016. As a result, mobile traffic has topped 3.7 exabytes. The growth of the mobile ecosystem is seriously compromised by malicious apps. There has been a massive increase in mobile malware, especially targeting Android devices. Devastating digital payment thefts and other attacks threaten mobile security. Despite the Android platforms and mobile antivirus security measures, sophisticated mobile malware continues to infiltrate mobile systems. The widespread use of mobile devices also exposes users to multiple risks. So we urgently need Android-based mobile malware detection systems (McLaughlin et al., 2017). A malware family is a group of malicious apps sharing code. Various malware samples use the malware families' codebase. All samples with the same interpretation are combined.
Faruki et al. (Faruki et al., 2014) explore the characteristics of a huge assortment of malware and categorizes existing mobile malware detection methods into static, dynamic, and traffic-based categories. Static analysis has been used in several previous studies to discover data leakage, malware, and security breaches in Android apps (Zhu et al., 2018). Nevertheless, static analysis of malware is challenging due to code polymorphism and obfuscation. These methods are used to produce malware variants to avoid detection. Numerous different dynamic analysis techniques strive to alter the device's operating system to monitor and access confidential information at runtime (Bader, Lichy, Hajaj, Dubin, & Dvir, 2022; Ucci, Aniello, & Baldoni, 2019). Such methods are helpful, but they necessitate a massive number of executions to encompass all app behavioral patterns (Ahmed, Lin, & Srivastava, 2021).
1.1 Motivation
Many virus detection techniques concentrate on the network traffic generated by mobile apps. Malware is identified by abnormal network behavior patterns. This type of malware detection system has the potential to be effective because the majority of Android malware performs its malicious functions via network traffic (Zhou & Jiang, 2012). The malware must communicate with a remote server over the Internet to carry out malicious tasks. These traces can be used to identify and track down specific malware. Furthermore, malware detection strategies based on network characteristics are more straightforward to design and implement than static or dynamic analytic approaches. For example, methods based on traffic detection can be installed at an access point or gateway. These methods rely solely on user-generated network traffic data, ensuring that users do not lose access to their mobile resources. Furthermore, these solutions do not necessitate any user actions aside from granting licenses to the detection service (W. Li, Bao, Zhang, & Li, 2022; S. Wang et al., 2020). The goal of network traffic-based approaches is to find distinguishing features that can be used to classify malware more effectively. Selecting efficient features, on the other hand, is a difficult task. We concentrate our investigation on malware samples that use the HTTP/HTTPS protocol to send data. Because HTTP accounts for 70% of the network traffic generated by Android apps, we chose it for our research. (Dai, Tongaonkar, Wang, Nucci, & Song, 2013). However, because HTTP traffic is generated in encrypted form, extracting useful information from it is extremely difficult.