2022-09-07
Molding the symbiosis between human and machine : contributions to anomaly detection, model evaluation, and active learning
Publication
Publication
The enormous amounts of data make it progressively difficult for a human to assign meaning to individual data points. For example, if several users are active on different devices in a computer network, all kinds of connections are registered: browsing websites, scrolling through social media, video calling with friends, and so on. However, these `human' labels are usually not stored by the computer. And, simply due to time restrictions, it is not possible for a cyber expert to give an interpretation to all the traffic. Fortunately, over the past decades, the field of data science has grown at lightning speed. Often Machine Learning (ML) is used to extract information from all kinds of data. ML includes mathematical algorithms that learn patterns in data and that can make predictions about it. Since a computer can perform calculations extremely much faster, huge datasets can be interpreted in considerably less time. Typically ML has difficulty recognising patterns of data that is has not been able to learn from before, though. People, however, are good at assigning meaning to deviating data points by using domain knowledge. Therefore, we investigate how we can combine human knowledge and computer power to arrive at better and understandable predictions. This thesis focuses on molding the symbiosis between human and machine. We consider the following three themes: Anomaly Detection, Model Evaluation and Active Learning. Anomaly detection algorithms are widely used within network intrusion detection, the field that deals with finding cyber attacks in network data.
In Chapters 2 and 3, we apply algorithms to detect malicious points in, respectively, real cyber data and data with online aeroplane bookings that contain fraudulent reservations. First, we remove redundant or uninformative data features and create new variables by using domain knowledge and by combining or standardising other features. In addition, we demonstrate how we deal with unlabelled and partially labelled data. Finally, we involve human experts by having them evaluate the results. By having the experts assess data points in a targeted manner, we efficiently obtain an indication of the predictive power of ML algorithms. Moreover, malicious data can be found faster. It is difficult to determine the predictive power of an ML model when fully labelled evaluation data is insufficiently available.
Therefore, in Chapter 4 we develop a new evaluation metric that provides a good and robust estimate of the predictive power when no negative labels are available.
In Chapter 5, we consider Model Evaluation more fundamentally using a common problem that arises when evaluating an algorithm. Often an evaluation metric is used to determine predictive power. For example, a metric score of 80% sounds good, but on what is this feeling based? Therefore, we introduce the Dutch Draw as a method to generate general, simple, but informative baselines. In the Active Learning (AL) paradigm, the human expert becomes an integral part of the methodology as a labelling source. AL is important in fields where labelled data is relatively scarce.
In Chapters 6 and 7, we develop new AL methods for network intrusion detection. The first method, Jasmine, has a dynamic selection procedure built in. This means that unlabelled data is given meaning in a more effective manner, allowing malicious network traffic to be discovered at an earlier stage. In Chapter 7, we improve the Jasmine methodology and extend it to Plusmine. We refine the dynamic selection procedure and we include Automatic Classification (AC) that increases the speed of labelling. An important advantage of our method is that it works relatively simply and therefore requires little extra computational resources.
Additional Metadata | |
---|---|
R.D. van der Mei (Rob) , S. Bhulai (Sandjai) | |
Vrije Universiteit Amsterdam | |
hdl.handle.net/1871.1/4716b1e2-bb4a-45ab-94ca-f4a986a89e54 | |
Organisation | Stochastics |
Klein, J. (2022, September 7). Molding the symbiosis between human and machine : contributions to anomaly detection, model evaluation, and active learning. Retrieved from http://hdl.handle.net/1871.1/4716b1e2-bb4a-45ab-94ca-f4a986a89e54 |