VCGERG: Vulnerability Classification With Graph Embedding Algorithm on Vulnerability Report Graphs

VCGERG: Vulnerability Classification With Graph Embedding Algorithm on Vulnerability Report Graphs

Yashu Liu, Xiaoyi Zhao, Xiaohua Qiu, Han-Bing Yan
Copyright: © 2024 |Pages: 21
DOI: 10.4018/IJISP.342596
Article PDF Download
Open access articles are freely available for download

Abstract

Vulnerability can lead to data loss, privacy leakage and financial loss. Accurate detection and identification of vulnerabilities is essential to prevent information leakage and APT attacks. This paper explores the possibility of digging the valuable information in vulnerability reports deeply. We propose a new model, VCGERG, which products a graph using key information from vulnerability reports and embeds the graph into the vector space using a keywords-LINE graph embedding algorithm based on the attention of neighboring nodes. VCGERG model uses the OVR random forest algorithm to classify vulnerabilities. Our model can get the complicated local and global information of the graph in large-scale dataset and achieve better results. In order to verify the effectiveness of our model, it is evaluated on many experiments. Compared with other models, our method has a higher accuracy rate of 0.975.
Article Preview
Top

Sun et al. (2021) proposed a vulnerability detection model, VDSimilar, based on code similarity using BiLSTM and attention network integrated into the Siamese model to get the similarity between two vulnerability functions and the difference between the vulnerability function and the patch function. By comparing the tested program to known vulnerability codes to discriminate whether the code is vulnerable, Hu et al. (2023) extracted slices of C/C++ source code and implemented an efficient and accurate vulnerability detection and interpretation method using a graph neural network. Zou et al. (2019) proposed a multiclass vulnerability detection system. They introduced the concept of code attention, using local features to detect vulnerability types. They completed multiclass vulnerability detection by considering program control dependencies during program slice construction. Wartschinski et al. (2022) constructed the Vudenc vulnerability detection model to implement Python code detection. Python codes are trained by a word2vec model and represented as vectors. The LSTM (Long Short-Term Memory) network then classifies the sequence of vulnerable code tokens at a fine-grained level and highlights with different colors specific regions of the source code that may contain vulnerabilities.

Compared with source code, vulnerability reports can represent the characteristic information of vulnerability more intuitively and are released by authoritative security organizations and vendors. The content is reliable and trustworthy. Aljedaani et al. (2020) used the latent Dirichlet allocation (LDA) to classify security bug reports (SBRs) in the Chromium project. They found the potential topics in the SBR text and proved they were very close to vulnerability types. Alperin et al. (2020) used the LIME model to interpret vulnerability description and proposed the GenSim latent semantic indexing module to create a latent semantic analysis (LSA) for each category. Aota et al. (2020) vectorized the vulnerability information on the NVD with a bag-of-words model (BoW). They used the Boruta algorithm to select meaningful features (e.g., CWE-ID) and random forest (RF) for classification. Han et al. (2017) used only vulnerability descriptions and CVSS (2022) scores from the CVE database (CVE, 2017) to predict vulnerability severity. It used a skip-gram word vector model and a single-layer CNN for classification. Han et al. (2018) extracted the CWE-ID text description and expected consequences of the vulnerability report on CWE to construct a knowledge graph. Additionally, they used a graph embedding algorithm to get a low-dimensional vector, effectively supporting various inference tasks for vulnerabilities.

Graph embedding algorithms aim to map nodes to low-dimensional vectors in a graph, preserving the structural relationships between nodes and allowing otherwise complex graph data to be represented more efficiently in vector space. DeepWalk (Perozzi et al., 2014) captures the structural information of a graph by randomly wandering through it and learning the low-dimensional representations using the skip-gram model. Node2vec (Grover & Leskovec, 2016) captures the structural information of a graph by introducing flexible stochastic wandering strategies, including breadth-first and depth-first, to balance local and global structures. LINE (Tang et al., 2015) is specifically designed to deal with large-scale datasets in the real world. It plays a vital role in representing high-dimensional graph structures in low-dimensional vector spaces using the local and global structure between nodes and nodes in the graph. LINE graph embedding algorithm optimizes the local structure vector of each node in the graph using first-order similarity. It optimizes the global structure vector using second-order similarity.

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing