Article Preview
TopIntroduction
Clustering is one of the most popular and effective techniques for pattern analysis where data objects are partitioned into meaningful groups called clusters (Bhatnagar, Kaur & Mignet, 2009; Wan, Gao & Li, 2012). Traditional clustering techniques like k-means clustering work well when applied to small homogeneous datasets (Van Hieu & Meesad, 2015; Sreedhar, Kasiviswanath, & Chenna Reddy, 2017). However, as the data size becomes large, it becomes increasingly difficult to find meaningful and well-formed clusters. In addition, in large real-world datasets the attributes are rarely homogeneous and contain both continuous and categorical attributes (Ji et al., 2019) and applying an exclusive selection of homogeneous attributes like continuous attributes, may minimize the effectiveness of detecting hidden clusters when heterogeneous attributes are utilized (D’Urso & Massari, 2019). In the vast literature of traditional data mining and the domain of big data there is very limited work on mixed attribute clustering (Madhuri et al., 2014). Previous approaches for handling heterogeneous data clustering, attempt to cluster data together by converting categorical attributes into continuous attributes. On the other hand, the similarity measures proposed especially for categorical data may not truly capture the inherent nature of the datasets involved given that different similarity coefficients may lead to different outcomes (Lewis & Janeja, 2011). Our paper addresses this gap in the literature by proposing an approach to combine continuous and categorical features for the purpose of cluster analysis in large datasets (Foss & Markatou, 2018). Specifically, we propose the following; a unified clustering approach that combines features from multiple heterogeneous datasets to detect similarity. This approach utilizes a combined similarity function, which looks at similarity across numeric and categorical features and employs this function in a clustering algorithm to identify patient similarity. However, given that clustering large heterogeneous data may result into malformed clusters, we propose an iterative unified clustering approach, which extends our unified clustering by drilling down into such malformed clusters in order to improve clustering outcomes. Indeed, such attributes are commonly found in many real-world applications generating massive amounts of data. For example, in health care, individuals can have varying degrees of susceptibility to a disease, which poses challenges to developing personalized treatments (U.S Food and Drug Administration). Let us a consider a male patient that is 55 years old that weighs 190lbs with a Body Mass Index of 29.7kg/m2 and a history of hypertension and dyslipidemia. He has a family history of Type 2 Diabetes, Coronary Artery Disease and Renal insufficiency. He presents with some of the common symptoms of Type 2 Diabetes like weight gain and takes medication to reduce his cholesterol levels (Hickner, 2011). According to the following assessment, does this patient meet the criteria for a diagnosis of Type 2 Diabetes? Now, we should also consider his genomic makeup because many cases of Type 2 Diabetes are caused by genetic predispositions. In addition, patients with certain genetic makeup respond differently to different treatment plans (Wu et al., 2014). Diabetes has confounded the research and practitioner world and remains a highly prevalent and well-studied condition (Wu et al., 2014). In consideration of this, it is important to integrate genomic factors with clinical data of diabetes patients to identify a well-designed treatment plan. Specifically, similar past diabetes cases can be retrieved to treat new cases based on similar clinical and genomic factors.