11 - Towards Comparability of Intrusion Detection Systems: New Data Sets

Robert Koch (UNIBW), Mario Golling and Gabi Dreo Rodosek (UNIBW)

Contemporaneously with the overwhelming success of the Internet and its dissemination in all areas of life, attacks have risen dramatically over the past few years - both from a qualitative as well as a quantitative point of view. As a consequence, despite other mechanisms, intrusion detection systems (IDSs) are under intense investigation. Different systems have been put on the market and numerous research contributions continuously try to improve the performance of the systems wrt. detection and false alarm rates as well as data rates that can be processed. Despite more than 30 years of research, the practical deployment of IDSs still suffers from substantial weaknesses: Knowledge-based systems are no longer in a good position to withstand the latest malware adequately. The scope of the signatures used is increasing more and more throughout the past years.
However, this in turn solves the real problem less and less. Firstly, a signature must initially be developed for the malware, delaying the detection capability and secondly - as a study by McAfee shows impressively - these signatures are less successful. On average, signatures are taking effect less than 10 times worldwide. On the other hand, behaviour-based systems are still affected by high false alarm rates, which complicates a practical usage.
Even more, an adequate evaluation of the proposed improvements for IDSs also represents a major challenge:
Even today, evaluations are often based on the DARPA 98/99 data sets. However, these data sets have several shortcomings and thus have been criticized a lot. Therefore, researches advised to stop using them any longer for evaluation purposes. In the meantime, several other data sets have been released, for example a redesign of the DARPA data set, the data of the MAWI Working Group, the MoMe Cluster, data from the Consortium Internet 2, from ACM SigComm, from the Cooperative Association for Internet Data Analysis (CAIDA), RIPE or the Internet Archive. Unfortunately, none of them was able to get accepted comprehensively throughout the community: Specific scenarios, limited availability, a lack of ground truth, etc., prevent a wide application and acceptance.
To solve this issue, the corresponding poster aims to encourage Internet Service Providers (ISPs) and/or companies (SMB as well as large companies) to create a new, realistic data set in order to set the basis for a resilient evaluation of IDSs. In contrast to works of others, we aim to produce a data set that is realistic, up to date and generally applicable. With regard to backbone operators, we want to receive real flow data, while for company networks, we propose to collect flow data as well as full captures. For legal and privacy reasons, all data will be pseudonymized. In addition, we plan to keep this dataset updated by producing a new release twice a year.

Download file