The Effects of Dimensionality Reduction in the Classification of Network Traffic Datasets Via Clustering
Abstract:Unsupervised learning has emerged
as an alternative meta-learning approach that is capable of accurately
classifying the massive amount of data generated by modern-day applications. It
is useful for active monitoring and
provision of improved service quality by the network administrators. Extracting
the optimal and most essential features
with high discriminative power remains one of the critical challenges in unsupervised learning due to the absence of
the class labels. The main objective of this research is to determine the effects
of Dimensionality Reduction in Feature Selection via the clustering of internet
traffic data sets. To achieve this overall goal,
internet traffic data sets were retrieved, analyzed
and clustered into application classes. A reduced form of these datasets was
obtained and clustered using feature selection techniques. The results of the
original and reduced data sets were compared
and evaluated. The effects of two feature reduction techniques;
Correlation-based Feature Selection (CFS) and Information Gain Attribute
Evaluator were examined in K-means,
Expectation Maximization and the Farthest-first clustering algorithms. The
effectiveness of the candidate clustering algorithms was determined and the
evaluation was based on overall accuracy, precision, recall, and Receiver
Operating Characteristic (ROC) area metrics. Results revealed that both CFS and
information gain significantly increase the performance of the three algorithms.