Network Intrusion Data Generation via Deep Representation Learning

Gabin Noblet PhD defense

03.04.26 - 03.04.26

sara

This thesis addresses the challenge of generating labeled synthetic network traffic data, essential for the training and evaluation of machine learning-based intrusion detection systems. The scarcity of diverse, contemporary, and labeled datasets limits their effectiveness, particularly against sophisticated attacks such as Advanced Persistent Threats
(APTs).
We develop NetGlyphizer, a discrete representation learning method inspired by VQ-VAE, that converts network traffic into sequences of discrete tokens (NetGlyphs). To achieve this, we propose Nexus, a tool that represents network traffic in Nxcap format, a minimalist network data format representing flows as packet sequences. This format captures the temporal and structural distribution of packets within a flow, with greater detail than sets of descriptive statistics. NetGlyphizer encodes these flows into NetGlyphs, which a label-conditioned generative model based on the Transformer architecture uses to produce labeled sequences. These sequences are then decoded back into network flows and exported to Pcap format via the Nexus tool. Label conditioning enables the generation of specific traffic for various scenarios or traffic classes while preserving the statistical and protocol properties of the
original data.
Results confirm that the synthetic traffic faithfully reproduces the characteristics of real traffic in terms of statistical distributions and protocol compliance. This work introduces: (1) the Nxcap format and the Nexus tool, (2) NetGlyphizer, a discrete representation learning mechanism for network traffic, and (3) a Transformer-based controlled generation of synthetic traffic compatible with existing analysis tools.

published on 01.04.26