Research

Kyonggi University's Contents Convergence Software Research Institute
is a Gyeonggi-do Regional Research Center (GRRC)
that conducts various academic/exchange activities in connection
with local industry and academic institutions,
creates intellectual property in the software engineering field
for crime prevention, and contribute to social safety.
Traffic classification using distributions of latent space in software-defined networks: An experimental evaluation
Jang, Y., Kim, N., & Lee, B. D. (2023.)
Engineering Applications of Artificial Intelligence, vol. 119, 105736.

Abstract

With the emergence of new Internet services and the drastic increase in Internet traffic, traffic classification has become increasingly important to effectively satisfy the quality of service to users. The traffic classification system should be resilient and operate smoothly regardless of network conditions or performance and should be capable of handling various classes of Internet services. This paper proposes a traffic classification method in a software-defined network environment that employs a variational autoencoder (VAE) to accomplish this. The proposed method trains the VAE using six statistical features and extracts the distributions of latent features for the flows in each service class. Furthermore, it classifies the query traffic by comparing the distributions of latent features for the query traffic with the learned distributions of the service classes. For the experiment, the statistical features of network flows were collected from real-world domestic and overseas Internet services for training and testing. According to the experimental results, the proposed method has an average accuracy of 89%. This accuracy was 52%, 47%, 39%, 59%, and 26% higher than conventional statistics-based classification methods, MLP, AE+MLP, VAE+MLP, and SVM, respectively. This result clearly suggests that probability distributions of latent features, rather than specific values for latent features, can be used as more stable features.


Keywords

Flow classification
Jensen–Shannon divergence
Latent features
Software-defined network
Variational autoencoder

1. Introduction

Various devices and contents have been developed and deployed as a result of recent advances in information technology (IT), leading to a rapid increase in network traffic. This increase in network traffic has resulted in the need for systems to efficiently manage, distribute, and allocate traffic along with satisfying quality of service (QoS) to users. Consequently, various systems have been developed for efficient network management. Among them, software-defined network (SDN) (Kreutz et al., 2015) is one of the most popular recent technologies. SDN overcomes the limitations of conventional network infrastructure by separating the control plane from the data plane, thereby enabling efficient control of the network through software applications. It manages the network topology, path, and flow through controllers, calculates the optimal path, and monitors the network. Furthermore, new network functions or modules are programmed through controllers to enable efficient addition or modification of new functions or modules and, consequently, a more effective network operation. In SDN, packets are sent through the flow, which is a set of rules for sending packets to their destination. Therefore, switches transmit the packet in accordance with the policy defined in the flow. In this case, the packets to be sent are classified into service classes such as VoIP, Game, and Web-based on the type of application such as Skype, YouTube, and Facebook. Each of these service classes requires a different QoS that must be ensured. Therefore, controllers should send packets by generating flow entries to meet the QoS requirements of the corresponding services (Xie et al., 2018Nguyen and Armitage, 2008).

The classification of traffic is becoming increasingly complex owing to the characteristic of the Internet service changing based on the given environment and on the emergence of new applications. There are various flow classification approaches: port-, payload-, statistics-, and machine learning-based. The port-based classification approach (Karagiannis et al., 2004Karagiannis et al., 2005Schneider, 1996) performs classification using port numbers assigned to the respective applications. However, with the recent introduction of applications that use more than two ports or arbitrary port numbers, it has become difficult to classify flows using this approach. The payload-based classification approach (Zhang et al., 2012Finamore et al., 2010Lin et al., 2008aWang et al., 2011Fernandes et al., 2009Risso et al., 2008) classifies traffic by analyzing the payload of a packet. This approach has a high accuracy because it identifies the content of each packet for classification. However, it has limitations in that it relies on expert opinion when extracting payload features. Furthermore, it is unable to classify encrypted traffic packets. The statistics-based classification approach (Santiago del Rio et al., 2012Zuev and Moore, 2005Crotti et al., 2006Lin et al., 2008bRoughan et al., 2004Crotti et al., 2007) classifies flows using the statistical properties of each service class, such as the distribution of flow duration or the average packet size. Decoupling the control plane from the data plane in SDN gives a great opportunity for machine learning techniques to be applied to network management (Xu et al., 2018Malik et al., 2020Setiawan et al., 2021Bhowmik and Gayen, 2021). Deep learning-based approaches aim to learn the discriminative features of network traffic using only flow characteristics; they are applicable to a wide range of traffic classification problems, including those with encrypted traffic.

Classification methods based on statistics- and deep-learning often use raw information collected from network flows or leverage important features learned from network flow data. However, such approaches have a limitation in that different services can have similar statistical values. Hence, some flows cannot be properly classified based solely on flow characteristics. For instance, Fig. 1 shows the packet-level statistical features (e.g., packet length and inter-arrival time) of the Game service class and VoIP service class used for testing, in which significant amount of data values overlap. The figure shows that the Game service has a wide distribution at the bottom and the VoIP service has a high vertical distribution to the left, demonstrating different distribution patterns. Therefore, comparing the distributions of latent features of services rather than utilizing latent feature values directly would be more effective.

To address these problems, a new traffic classification method is proposed through which the empirical distributions of latent features for the network flows are learned using the variational autoencoder (VAE) (Kingma and Welling, 2013), and the flows are classified based on the learned distributions. The proposed method trains the VAE first using the monitored statistical features of the flows for each service class. The trained VAE is then used to extract the distributions of the latent features for each class, which are then saved in the reference database. Subsequently, when a new flow to be classified enters, the monitored features of this flow are fed into the VAE to learn the empirical distributions of the latent features of the flow. Finally, the flow is assigned to a specific service class based on the distribution in the database that is most similar to this distribution.

Fig. 1. Distribution of features for Game and VoIP service classes (the number of flows per service class is 2100).

This work has made the following contributions:

The proposed traffic classification method leverages the empirical distribution of latent features to represent the discriminative features for flows using VAE, and then it classifies flows by comparing the distributions. In this way, the proposed method overcomes the limitations of the conventional traffic classification methods.

The experiments used input data gathered from various real-world Internet services at various times and places. Training and test data were collected separately from various Internet services for each service class. As a result, the overall performance of classification methods in real-world Internet environments is assessed using several new services.

A significant number of experiments, conducted while changing configurable parameters, demonstrated that the proposed traffic classification method can achieve competitive performance with high classification accuracy.

The remainder of this paper is organized as follows: Section 2 discusses the related work. Section 3 explains the proposed traffic classification method. Section 4 describes the defined service classes and the traffic collection method used for the experiments. Section 5 verifies the performance of the proposed method through experiments using various hyperparameters and compares it with that of existing classification techniques. Finally, Section 6 summarizes the conclusions and outlines future research plans.

2. Related works

2.1. Machine-learning-based traffic classification

Traffic classification is a long-studied problem, and a significant number of algorithms and methods have been proposed, ranging from simple port-based approaches to recent deep learning-based approaches. However, as this study involves the learning-based traffic classification in the SDN environment, our review of related research focuses primarily on methods based on machine learning. Common machine learning algorithms used for traffic classification include support vector machine (SVM), K-nearest neighbors (KNN), multi-layer perceptron (MLP), random forest, and decision tree. For instance, Zhongsheng et al. (2019) analyzed the performance of an SVM-based traffic classification method by conducting an empirical experiment on actual data of network traffic. While the proposed method showed reliable and accurate results, the application technology of SVM in multi-class traffic classification requires considerable algorithm and parameter adjustment. Perera et al. (2017) compared six machine learning algorithms in traffic classification: Naïve Bayes, Bayesian network, random forest, Naïve Bayes tree, decision tree, and MLP. The experimental results showed that the random forest and decision tree algorithms achieved the highest classification accuracy as well as computational efficiency. A similar analysis was also conducted by Parsaei et al. (2017), but the empirical results were different. In their study, four machine learning algorithms (e.g., Naïve Bayes, feed-forward neural network, MLP, and Levenberg–Marquardt SVM) were compared. The Naïve Bayes algorithm achieved the highest classification accuracy. Liu et al. (2018) proposed an SVM-based traffic identification method. The SVM was used to extract traffic features and classify them into 28 traffic patterns. CMSVM (Dong, 2022) is an SVM variant that solves the network traffic dataset imbalance problem by introducing weights and active learning for each traffic class. Possebon et al. (2019) proposed an ensemble learning for network traffic classification, in which three different methods – MLP, decision tree, and K-nearest neighbors (KNN) – independently classify the network traffic and the corresponding results are merged through majority voting to produce the final prediction.

As deep learning technologies have rapidly evolved in recent years and their application areas have also expanded, many studies have tried to apply deep learning to the network traffic classification and recognition research. Ikram et al. (2021) developed a robust anomaly traffic detection model for both encrypted and unencrypted traffics using an ensemble of deep neural network models: MLP, backpropagation network, and long–short term memory. XGBoost then integrates the results of each deep learning model to achieve higher classification accuracy. As another example of ensemble learning, Setiawan et al. (2021) stacked MLP, convolutional neural network and stacked autoencoder (AE) to build an encrypted data packet classifier. Bayat et al. (2021) proposed a traffic classification method that uses inter-arrival time, payload length, and packet length represented in time series as features. The proposed deep learning architecture primarily uses a one-dimensional convolutional neural network and gated recurrent units, a type of recurrent neural network. The former captures dependencies between feature vectors in consecutive time slots, whereas the latter captures dependencies from time series data. Li et al. (2017) used a VAE to create a traffic classifier by first transforming some useful HTTP session fields from the original traffic into a meaningful image. Comprehensive surveys on machine learning and deep learning-based traffic classification methods can be found in the literature (Wang et al., 2019Li and Pan, 2020Alzoman and Alenazi, 2021Shahraki et al., 2022).

Most of the previous research concentrated on extracting informative features from packet-level or statistical network flow data, such as packet size and packet inter-arrival time. The learned features were then used to determine the traffic class. These approaches are limited in that network flow information may be the same across different traffic classes and may change depending on the environment, which may reduce the effectiveness of feature learning. Our method, however, compares the empirical distributions of latent features learned from the monitored statistical information of the SDN flows with reference distributions of traffic classes for identification, rather than leveraging learned features directly. Although this approach is similar to that of Li et al. (2017) in the feature representation phase, the methodology differs in the traffic classification phase, regarding the representation of probability distributions of latent features and the comparison between the distributions of latent features.

2.2. Variational autoencoder

The VAE (Kingma and Welling, 2013) is an autoencoder-based generative model that learns the probability distribution of the data to generate new data. It is widely used in a variety of applications, including image synthesis, text generation, image super-resolution, and intrusion detection. VAE is trained with the help of two neural networks: encoder and decoder. The encoder generates the latent variable z from the input data x in the VAE, and the decoder restores x using the latent variable z. New data are generated by sampling latent variables from the distribution’s encoder. The encoder must detect an ideal probability distribution p(z|x) that allows the decoder to restore the original input data. However, the ideal probability distribution p(z|x) is impossible to calculate. To solve this problem, the VAE assumes p(z|x) as the q(z|x) distribution, which is easy to calculate using variational inference. It adjusts the parameter through training to make q(z|x) as close to p(z|x) as possible. The probability distribution q(z|x) is assumed to be a Gaussian normal distribution. When Gaussian normal distribution is used, the encoder outputs the mean and variance parameters of the distribution for sampling z. The VAE detects a distribution close to the ideal distribution by employing a loss function based on variational inference. The defined loss function isL,;=|[|]+(|)|

Eq. (1) defines the evidence lower bound (ELBO), which the VAE trains to minimize. (1) denotes the reconstruction and regularization errors, respectively. When z is given, the reconstruction error is for x, and the goal of the decoder is to minimize this error. A small reconstruction error indicates that the parameter  of the decoder network is functioning well. The regularization error is a term for calculating the difference between the actual distribution of z and the distribution of z estimated from the input data. This error is calculated using the KL-divergence (KLD) (Goldberger et al., 2003), which determines the similarity between two probability distributions. KLD always has a value higher than zero, and the distributions are identical if it is zero. The estimated distribution can be made close to the actual distribution of z by adjusting the parameter . As the training progresses, the value of the first term decreases while the value of the second term approaches zero. As a result, the VAE can detect the best distribution of z that represents x and restore data via z.

3. Proposed method

This section describes the proposed method based on the VAE to achieve effective flow classification. Fig. 2 depicts the overall process of classifying flows in an SDN environment using the VAE. The solid line represents the VAE training, and the dotted line represents the test flow classification.

Out of 11 network flow features employed in Amaral et al. (2016), six statistical features were chosen in this study, as shown in Table 1. These features were manually selected based on the following observations. First, the selected statistical information can be measured in the same manner for both encrypted and unencrypted flows without any additional processing. Second, they are frequently used in many network traffic classification studies because they can, to some extent, represent the characteristics of flows of different classes. For example, the features -- or -- are obvious for web traffic because the packets are generated only when a user moves to another web page. Cloud traffic, which allows large files to be uploaded and large-sized packets to be quickly transmitted, differs from web traffic in that the time-related features appear very small, but the packet size is large. The five static features not used in this study are Src IP, Dst IP, Src Port, Dst Port, and Protocol Type. These features may have different values depending on the network, which limits their active utilization.

The flow monitor analyzes and extracts the statistical features of the flows. To prevent the controller from becoming overloaded, the flow monitor extracts six statistical features from the first n packets (i.e., Packet) that enter the system. The extracted features are used as input for the VAE training. Through training, the VAE learns the distributions of latent features that best represent the flow. The mean and variance values of the normal distribution are used to create latent vectors generated by the VAE encoder. The distributions of the latent features for the flow corresponding to each service class are saved in the reference database and used for service classification of newly inputted flows. The size of the latent features is set by the hyperparameter during model training. The procedure for the reference database establishment is summarized in Fig. 3. Lines 1–10 show a typical VAE training procedure using stochastic gradient descent, as described in Section 2.2. Once the VAE model training is completed, k flows for each service type are randomly selected and their empirical distributions of latent features are stored in the database for later use (lines 11–17).

Table 1. Flow statistical features.

FeatureDescriptions
Duration of N packets
Average inter-arrival time of N packets
Maximum inter-arrival-time of N packets
The average packet size of N packets
The maximum packet size of N packets
Total packet size up to N packets

Fig. 2. Overall process of flow classification of the proposed system.

The process of classifying a query flow using the distributions of latent features extracted through the VAE is as follows. First, the flow monitor is used to extract the statistical characteristics of the query flow. The extracted statistical features are fed into the VAE model, which then learns the distributions of latent features that represent the query flow. The distributions learned are compared to the distributions in the database. To improve the generalizability of the classification the distributions of k flows per service class are sampled from the database and compared to the distributions of the query flow. The similarity between the two distributions is calculated using the Jensen–Shannon divergence (JSD) (Fuglede and Topsoe, 2004) algorithm, which is defined by (2):JSDp,q=12(+2)+12(+2)

JSD is a transformation of KLD and is a function that compares two probability distributions p and q. When the calculated value is close to zero, both JSD and KLD indicate that the two probability distributions are similar. However, because of its asymmetry, KLD cannot be used a distance metric between them. JSD, unlike the KLD, is symmetrical and can be used to calculate the distance between two probability distributions. Therefore, we used JSD to compare distributions.

Fig. 4 illustrates a pseudo code of the JSD comparison module of Fig. 2 for classifying a query flow, where the average value of k similarity scores is used for prediction (e.g., Class). For each service class, the empirical distributions of latent features of k-flows (i.e., ServiceClass) are randomly selected from the reference DB (lines 1–6). The distribution of each latent feature is represented as a normal distribution with () where  and  are the mean and variance, respectively. The dimension of the latent vectors (i.e., latent) is a hyperparameter that is set when designing the model. The latent vector of the query flow is compared against those for service classes using JSD (lines 7–10). The distributions of latent features generated by the VAE are mutually independent because they are normal distributions based on the diagonal covariance matrix (Blei et al., 2017Do, 2008). Thus, the JSD comparisons of latent features between a query flow and a service class flow can be calculated independently (lines 11–14). After completing the JSD comparisons for a service class, the flow similarity scores should be added together to form the final service class similarity score. There are several ways to summarize k-flow similarity scores. Among them, two representative methods were used: Class and ClassClass selects the smallest value among the k-flow similarity scores, and Class calculates the mean value from the k-flow similarity scores.

Fig. 3. Procedure for VAE training and reference database establishment.

The computational complexity of the inference phase of the proposed method is determined by two factors. The first step is to use VAE to extract the latent features for the target flow. Its computational complexity, dominated by matrix multiplication, is approximately =21(1)×+2×(1)×, where  refers to the number of nodes in the  layer, L is the total number of layers, and z is the dimension of the latent space. Although its computational complexity scales with the number of hidden layers and nodes on each hidden layer, VAE can be well optimized to have a linear complexity to the number of nodes in the graph when the nodes are sparsely connected (Tian et al., 2014). The second part is the computing process to summarize k-flow similarity scores with the complexity of , where s represents the number of service classes to be identified. As a result, the proposed method can be applied to the SDN environment without incurring significant overhead.

4. Datasets

4.1. Service class

Service classes represent clusters of Internet services that have similar characteristics, and they must be defined to effectively describe the characteristics of various Internet services. In this study, six service classes were defined with reference to Chen et al. (2004) and summarized in Table 2. These six classes have distinct characteristics and flow patterns as described below.

Fig. 4. Algorithm for comparing distributions of latent features between a query flow and reference flows of each service class.

The defined service classes are largely divided into real-time and non-real-time classes. The real-time service class represents the interactive services, in which two or more users participate and communicate with each other in real time. The VoIP class is for two or more users talking over the phone using videos and audios. To avoid service disconnection of the video or audio during the conversation, it is critical in the VoIP class to minimize the delay rate of flows. Because the flow of the VoIP class continuously transmits small packets to deliver video and audio accurately while maintaining the call, the VoIP class has a small packet size and a small inter-arrival time. In the case of the Game class the flow delay and error rates should be minimized to maintain acceptable QoS levels for gaming users. Furthermore, the Game class can exhibit different characteristics depending on the game category. For example, the flows for the role-playing game show the characteristic that many packets are transmitted to the flow only when there are many users, or an event occurs such as attacking or killing. If no event occurs, only a small number of packets are sent. However, strategy or competition games with other users have a short inter-arrival time because packets are frequently transmitted through the flow to apply rapidly changing scenarios without delay. Finally, because the videos must be transmitted seamlessly for smooth real-time broadcasting, the real-time streaming class must send packets quickly to minimize packet delay. The characteristics of the real-time streaming class include small inter-arrival time and large packet size because of small delay and large images.

The non-real-time streaming class is for watching previously uploaded videos. Unlike real-time streaming, packets no longer need to be generated continuously and can be transmitted in batches in the middle of watching a video. As a result, the flow has a long inter-arrival time, and the packet size varies depending on video quality. The web class, like non-real-time streaming, is for browsing the web and generates packets in bulk only when a page is moved or an event occurs. It generates a few packets when maintaining a connection. Finally, the cloud storage class is for delivering files. It has the characteristics of large packet sizes and small inter-arrival time because relatively large data are transferred.

Table 2. Service classes.

Real-time service classNon real-time service class
Service classDescriptionService classDescription
VoIP
Game
Real-time streaming
Video conferencing
Interactive games multiple users
Live video streaming
Non real-time streaming
Cloud storage
Web
Non-conversational video (Buffered Streaming)
Large file transfer
WWW browsing

Table 3. Internet services for collecting training traffic.

Service classService providers
VoIPGoToMeeting, Hangout, Imo, Kakaotalk, Line, Skype, Slack, Nateon, Chime
GameBattleground, Starcraft, FifaOnline, BlackDesert, LeagueofLegend, Lostark, Mabinogi, Valorant, Dungeon&Fighter
Real-time streamingYouTube, Mixer, Twitchtv, Afreecatv, Bigo, Douyu, Kbs
Non real-time streamingNetflix, Vlive, Navertv, Melontv, Kakaotv, Facebook, Dailymotion
Cloud storageNaverCloud, MegaBox, GoogleCloud, FTP, Dropbox, Digoo, Cloudlike, Cloudberry, Box, Bighard
WebGoogle, Naver, Yahoo, Zum, Instagram, Dreamx, Dreamwiz, Chol, AOL, Amazon

Table 4. Internet services for collecting test traffic.

Service classService providers
VoIPDiscord, WeChat, WebEx, VooV
GameCyphers, Kartrider, CrazyArcade, SpecialForce,
Baramuenara, Dota2
Real-time streamingTving, Wave, Huyatv, Periscope, Smashcast
Non real-time streamingPandoratv, Wave, Mgoon, Afreecatv, Rumble
Cloud storageOneDrive, Yandex, Ubox, Teambox
WebDaum, Nate, MSN, Tistory, Zdnet, Korea.com

4.2. Traffic collection through internet services

Table 3Table 4 list the Internet services used to collect traffic information for training and testing, respectively. Wireshark was used to collect each traffic for about 10 min. The following methods were used to collect traffic for each service class:

VoIP: a camera was installed to capture the entire meeting room for continuous VoIP packet transmission.

Game: packets were collected in the middle of game play, ensuring that packets were collected in an environment similar to that of the actual play.

Real-time streaming: packets were collected through an actual online broadcast service with more than 1000 viewers.

Non-real-time streaming: packets were collected by watching a video for 10 min or longer with the highest quality possible.

Cloud storage: packets generated during the uploading of photo and video files were collected.

Web: packets were collected while browsing web sites for news, online shopping, or performing web searches, similar to the activities of actual web surfing.

To reduce bias in traffic collection, flows were collected at both schools and homes, which have different network characteristics. To capture time-variant characteristics, traffic at various time intervals was collected. The number of Internet users fluctuates over time, which may cause changes in traffic patterns or speed. For example, early in the morning, there are few users, resulting in less traffic and few cases of delay, but in the evening, the number of users increases, resulting in heavy traffic and delays. As these examples show, traffic can vary depending on when the services are used. Therefore, the data were collected by dividing the collection time into the morning (10:00–12:00), afternoon (14:00–16:00), and evening (20:00–22:00). Furthermore, traffic information for each service class was collected from domestic and international Internet services to capture geographical diversity. The entire sequences of traffic information collected from each server were then divided into subsequences of certain sizes, and the subsequences were regarded as independent flows belonging to the same service class and used as training samples. Since the traffic pattern may change over time even for the same service, this approach facilitates capturing diverse traffic patterns for a given service.

Table 5. Flow information used in the experiment.

Service classVoIPGameReal-time streamingNon-real-time streamingCloud storageWeb
0.29297 ± 0.23903
(0.21378 ± 0.09620)
1.79104 ± 2.17950
(1.68036 ± 1.58020)
0.17763 ± 0.46098
(0.26201 ± 0.55080)
0.06423 ± 0.25041
(0.49826 ± 2.04210)
0.02629 ± 0.19908
(0.02416 ± 0.05310)
0.92238 ± 2.956013
(0.74256 ± 1.82900)
0.00425 ± 0.00346
(0.00310 ± 0.00139)
0.02596 ± 0.03159
(0.02435 ± 0.02290)
0.00257 ± 0.00668
(0.00379 ± 0.00798)
0.00093 ± 0.00363
(0.00722 ± 0.02955)
0.00038 ± 0.00289
(0.00035 ± 0.00077)
0.01337 ± 0.04284
(0.01076 ± 0.02651)
0.02835 ± 0.06418
(0.02200 ± 0.00884)
0.20129 ± 0.36994
(0.15147 ± 0.24914)
0.08080 ± 0.41704
(0.13256 ± 0.48545)
0.04124 ± 0.21986
(0.13599 ± 0.89665)
0.01093 ± 0.16845
(0.00747 ± 0.03468)
0.57375 ± 2.09464
(0.46492 ± 1.51939)
714.75 ± 187.99
(730.85 ± 159.53)
259.85 ± 299.79
(186.17 ± 135.19)
1078.17 ± 184.75
(1094.83 ± 217.97)
1160.39 ± 189.29
(1174.22 ± 189.71)
2188.68 ± 2161.54
(3152.19 ± 1440.18)
925.99 ± 277.76
(891.31 ± 306.48)
1158.35 ± 167.33
(1213.13 ± 65.09)
753.83 ± 538.09
(484.11 ± 433.63)
1481.49 ± 67.48
(1645.92 ± 349.67)
1485.44 ± 59.09
(1514.21 ± 43.66)
120 93.37 ± 15 391.9
(23 453.4 ± 13 021.9)
1595.45 ± 537.86
(1766.39 ± 1251.95)
50 032 ± 13 159
(51 159 ± 11 166)
18 189 ± 20 985
(13 032 ± 9463)
75 471 ± 12 932
(76 637 ± 15 257)
81 227 ± 13 250
(82 195 ± 13 279)
153 207 ± 151 307
(220 653 ± 100 812)
64 819 ± 19 442
(62 391 ± 21 453)
Number of training flows210021002100210021002100
Number of testing flows500500500500500500

5. Experimental results

To evaluate the effectiveness of the proposed method, a vanilla VAE was used for feature extraction, which used MLPs with Gaussian outputs in both the encoder and decoder. The encoder MLP network has one input layer, two hidden layers (each with 32 and 16 neurons), and latent vectors with sizes ranging from one to six. A decoder MLP network mirrored the encoder MLP, except that the means were constrained to the interval (0, 1) using a sigmoidal activation function at the decoder output (Kingma and Welling, 2013). Each MLP used a ReLU activation function. The mean squared error loss was used for the reconstruction errors and the KLD loss for the regularization error in the training. The total loss calculation formula is presented in (3).Loss=MSELoss+KLDLossMSELoss=1ni=1n(yiti)2=12=1(2+2ln21)

These MLP networks were initiated with random weights generated by the Xavier initializer (Glorot and Bengio, 2010). The entire VAE network was built with the PyTorch framework and a CUDA backend, and it was trained end-to-end with the Adam optimizer at a learning rate of 1e−4 for 20 epochs. An NVIDIA GeForce RTX 2080 super GPU was used for training and testing. Furthermore, no data augmentation was applied on the deep learning models used in this study.

The statistical features for network flows were extracted from real-world Internet traffic using Wireshark. Table 5 presents the mean values ± standard deviations of individual statistical features of the flows of each service class collected for training and testing. The data in parentheses are the statistical features of 500 flows collected for each service class for testing. The classification performance was evaluated using the metric of accuracy, which is defined as ++++×100, where TP, TN, FP, and FN represent the number of true positive, true negative, false positive and false negative test samples, respectively.

5.1. Size of the latent vectors

Fig. 5 shows the classification accuracy according to the size of the latent vectors. As for the parameters of the experiment, ServiceClass=10Packet=70, and Class were used, which showed the best performance. The size of the latent vectors, latent, was varied from one to six. The accuracy results for all latent are 80% or higher. The accuracy increases and has stably small variations in proportion to the increasing latent. However, the accuracy decreases when latent is larger than four because the distance between the data becomes farther and sparser as latent increases. In these cases, more data are required to improve accuracy. Conversely, the accuracy decreases when latent is quite small because all important features among the six features cannot be embedded, thereby making it difficult to express the characteristics of the traffic.

Fig. 5. Accuracy according to the size of the latent vectors.

5.2. Number of service class flows

It is critical to improve performance while minimizing classification time when classifying flows. As previously stated, the classification time of the proposed method is the sum of (1) the time required to extract the empirical distribution of the latent features for the query flow using VAE and (2) the time required to compute the difference between the empirical distributions of the query flow and the reference distributions of the service classes. Table 6 and Fig. 6 show the comparison time and accuracy according to the number of service class flows when latent=4Packet=70, and Class are used. The experimental results show that the comparison time and the accuracy increase in proportion to the number of service class flows. From the results, the classification accuracy is saturated at ServiceClass=20 or higher. However, at these points, the comparison time significantly increases. This is incompatible with an SDN environment that must process data in real time. Therefore, the performance at ServiceClass=10 can be assumed to be the best when considering both the comparison time and accuracy.

Table 6. Comparison time according to number of service class flows.

ServiceClassTime (s)
20.010713 ± 0.009
50.044350 ± 0.020
100.058290 ± 0.025
200.107868 ± 0.080
400.253541 ± 0.130

Fig. 6. Accuracy according to the number of service class flows.

5.3. Number of required packets

The number of required packets for classification is critical in reducing the overheads of the flow monitor in the controller. When comparing flows using many packets, the controller must collect more information from the switches and spends more time extracting statistical features of the flow. Therefore, the characteristics of the flow must be extracted from a small number of packets while maintaining high accuracy. The number of required packets for extracting flow characteristics in this study was set to 20, 50, 70, 100, 150, 200, and 300. Fig. 7 shows the classification accuracy when latent=4ServiceClass=10, and Class. The accuracy increases in proportion to the number of required packets and reaches the high level from Packet=70 onwards (Fig. 6). This implies that the appropriate distribution of latent features for the flows can be found from the VAE when the set of statistical information of flows is extracted through 70 or more packets. The accuracy decreases the most at Packet=20 because it is insufficient to deliver the characteristics of flows with a small number of packets. This result consequently makes it difficult for VAE to appropriately extract the distribution of latent features for flows. Furthermore, if the number of required packets is insufficient, a flow may be incorrectly classified because similar features are extracted from other sections of flows included in different service class flows.

Fig. 7. Accuracy according to the number of required packets.

5.4. Similarity calculation methods

In this study, two similarity calculation methods were used to compute the similarities with k-service class flows: Class and ClassClass selects the smallest value among the k similarity scores, whereas Class calculates the mean value from the k similarity scores. Fig. 8 shows the accuracy of the two methods when Packet=70 and ServiceClass=10. The Class method shows approximately 30% higher performance than Class, regardless of the size of the latent vectors. Class uses the mean of the similarity values of k-service class flows. In this case, there may be a service class flow whose similarity to others in the same service class is quite different. This problem lowers the performance of Class and is often observed in the method that uses the average. In contrast, Class selects the flow with the smallest difference. As a result, the likelihood of selecting the incorrect service class is reduced even if flows with varying degrees of similarity are included in the correct service class.

5.5. Performance comparison

The performance of the proposed method was verified by comparing its accuracy with those of the other five classification methods: statistics-based classification, MLP, AE+MLP, VAE+MLP, and SVM. The statistics-based classification method specifies the ranges of six statistical features for each service class and classifies flows to one of the six ranges. MLP is one of the most basic and widely used deep learning classification methods, and SVM is a machine learning algorithm commonly used in traffic classification research (Wang et al., 2019Li and Pan, 2020Alzoman and Alenazi, 2021Shahraki et al., 2022). The MLP model used was composed of one input layer and two hidden layers (each with 32 and 16 neurons). The output layer consisted of six neurons with a softmax classifier. The AE model shared the same network architecture as the VAE model except for a layer of data means and standard deviations. The AE+MLP uses only the encoder part from the trained AE and combines it with the MLP instead of the decoder. The VAE+MLP uses VAE as the encoder. An SVM model was implemented using the scikit-learn machine learning suite (Scikit-learn, 2021), and the same settings for training and testing as for the VAE model were used.

The performances of classification methods combining the MLP with the autoencoder (AE) and the VAE ware first evaluated against the proposed method. Fig. 9 depicts the accuracy of AE-based classification methods according to the size of the latent feature. The accuracy increases proportionally with the increase in the size of the latent features for all three methods. The highest accuracy is observed at latent=4. However, the accuracy decreases when the latent > 4. This is because more training data are required owing to the curse of dimensionality.

Fig. 10 shows the classification accuracies of all the compared methods when Packet=70ServiceClass=10latent=4, and Class, which showed good performance in previous experiments. The classification methods of the statistics-based classification, MLP, AE+MLP, VAE+MLP, and SVM show accuracies of 37%, 42%, 50%, 30%, and 63% respectively, whereas the proposed classification method shows a high accuracy of 89%. The accuracy of the VAE+MLP is the lowest at 30%.

The MLP in the VAE+MLP is trained using latent features sampled from the VAE output distribution. However, even if the flows belong to the same service class, different values may be obtained if the values sampled from the output distribution are used. Furthermore, data from the distribution that was randomly sampled may overlap with data from different service classes. As a result, the same values selected by sampling from different service classes make it difficult to train the network. The proposed classification method outperforms other methods because it extracts the distribution of latent features from the VAE encoder and compares the distribution using JSD without sampling. This result clearly suggests that probability distributions of latent features, rather than specific values for latent features, can be used as more stable features as in Possebon et al. (2019). Furthermore, as shown in Fig. 11, the proposed method demonstrated high classification accuracy across all service classes, whereas the classification accuracy of other methods varied across service classes. Except for SVM and the proposed method, all methods tended to classify network flows to specific service classes regardless of their actual service classes. For instance, MLP, AE+MLP and VAE+MLP frequently mis-classified as the real-time streaming service class. Another important factor enabling improved accuracy and robustness is attributed to leveraging similarity scores of k-service class flows, which plays the role of ensemble learning. That is, it is more likely to find flows that conform to similar probability distributions by comparing the distributions of the latent features of the target flow to those of several flow instances of the same service class.

6. Discussion and conclusions

This paper proposed a new flow classification method that uses VAE in an SDN environment to leverage empirical distributions of latent features learned from network flow information. The service class of the query flow was determined by summarizing k-flow similarity scores obtained by comparing the distributions through JSD. The proposed method exhibited an average accuracy of 89%, which was 40%P higher on average than those of the conventional statistics-based and machine learning-based classification methods such as MLP, AE+MLP, VAE+MLP, and SVM. Although it is intended for use in the SDN environment, where flow information can be easily collected at an SDN controller, the proposed method is not limited to the SDN environment and can be used in traditional network architectures if network traffic information can be collected and monitored.

Future research lies in several areas. The input to a deep learning model during training and testing has a significant impact on model performance. Instead of manually selecting important network flow information, one of our future research goals is to develop algorithms that automatically extract latent features from a wide range of raw data. Furthermore, the proposed method should be improved to support dynamic environments where new types of services that have not been seen before are added or where the characteristics of services change dynamically in real time through online training. Finally, we plan to evaluate the performance of the proposed method using various public datasets and assess its behaviors according to the characteristics of the training samples such as volume and imbalance.

CRediT authorship contribution statement

Yehoon Jang: Methodology, Software, Validation, Investigation, Writing – original draft, Writing – review & editing. Namgi Kim: Methodology, Validation, Formal analysis, Investigation, Visualization, Writing – original draft, Writing – review & editing. Byoung-Dai Lee: Conceptualization, Validation, Writing – original draft, Writing – review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (grant number 2020R1A6A1A03040583).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Data availability

Data will be made available on request.

References

Cited by (0)

1

Equally contributed.