| Dataset | Train | Eval | Domain | Role in your work | Size | Format | Verified details |
|---|---|---|---|---|---|---|---|
NetConfEval Wang et al., KTH / Red Hat, 2024 |
⚠ partial | yes | Network config | Main benchmark | Hundreds of tasks (5 iterations per task type) | JSONL, instruction-output pairs | 4 tasks verified: (1) formal spec translation from Config2Spec policies, (2) API/function call generation, (3) routing algorithm code (Python), (4) low-level config for OSPF/RIP/BGP/RIFT. Runner-up best paper ACM CoNEXT 2024. Key finding: small models handle spec/API tasks; only larger models (GPT-4 scale) handle routing code generation. |
TeleQnA Maatouk et al., 2023 |
no | yes | Telecom | Domain benchmark | 10,000 MCQ questions | JSON, multiple-choice | 5 categories: telecom lexicon, research overview, publications, standards overview, 3GPP standards specs. Drawn from 3GPP + IEEE sources. Dataset access via GitHub (password: teleqnadataset). GPT-4 and GPT-3.5 both struggle on complex standards questions. LLMs rival telecom professionals on easier categories. |
Tele-Eval Maatouk et al. / Tele-LLMs, 2024 |
no | yes | Telecom | Domain benchmark | 750,000 QA pairs | Open-ended QA (LLM-judged) | Used to benchmark Tele-LLMs series. Evaluation uses Mixtral 8x7B-Instruct as judge (Yes/No correctness scoring). Covers scholarly material, standards, and general telecom. Used for eval only — not for fine-tuning in your work. |
Tele-Data Maatouk et al. / Tele-LLMs, 2024 |
yes | no | Telecom | Domain pretraining | Large (arXiv papers + 3GPP + Wikipedia + CommonCrawl) | Raw text (continual pretraining) | 4 sources: arXiv CS/EE papers (filtered from 610k), 3GPP standards, telecom Wikipedia articles, CommonCrawl telecom pages. Used for continual pretraining before instruction tuning. Useful if you want domain vocabulary grounding before fine-tuning on NetConfEval. |
NetBench arXiv 2024 |
check | yes | Network traffic | Supplementary eval | ~5,390 samples | Packet/traffic classification | Correction from previous table: NetBench is a network traffic analysis benchmark (packet classification, encrypted traffic), not a general networking QA dataset. May be less directly relevant to config generation tasks in NetConfEval. Verify task alignment before use. |
Cisco CCNA / Exams |
yes | yes | Network config | Domain tuning | Medium | QA / instruction-completion | Cisco-specific syntax (IOS CLI), OSPF/BGP configuration exercises. Useful for config syntax grounding. Not publicly standardized — source and quality may vary depending on collection method. |
GLUE |
no | yes | General NLP | Method comparison only | Large (multiple tasks) | Classification / NLI / QA | Used only to compare fine-tuning methods (LoRA vs LISA vs selective FT) on standard benchmarks. Not network-specific. Useful for showing your method does not cause catastrophic forgetting of general NLP capabilities. |
SQuAD |
no | yes | General NLP | Forgetting check | 100k+ questions | Extractive QA | Reading comprehension. Used as a forgetting/retention benchmark — does the model still answer general QA after domain fine-tuning? Eval only in your work. |
Mobile-LLaMA data Kan et al., IEEE 2024 |
no | reference only | 5G / network | Not recommended | 15,111 self-instruct sets | JSON self-instruct | Real 5G data: BGP routing tables, pcap captures, UE traffic traces from IEEE Dataport. Used internally for LLaMA-2 13B fine-tuning. Highly specific to 5G NWDAF analytics. Not suitable for config generation tasks. Reference for methodology only. |
| Paper | Venue | Method | Layer finding | vs Full FT | vs LoRA | Params | Usable for networking? |
|---|---|---|---|---|---|---|---|
LISA Pan et al., 2024 — arXiv 2403.17919 |
NeurIPS 2024 | Observe weight norm skewness in LoRA (on Alpaca-GPT4). Always keep embedding + LM head active. Randomly sample 2 middle layers per step. | Embedding and LM head have much larger weight norms than middle layers in LoRA. Middle layers can be randomly frozen without loss. | Beats or matches | +10–35% MT-Bench | E + H + 2 layers | Recommended. Memory as low as LoRA. Tested on LLaMA-2 7B–70B, Mistral, Phi-2, TinyLlama. Key: the weight norm observation was made on general data — whether it holds for network config data is an open question your work can answer. |
IST / OWS Outlier-weighted layerwise sampling, NAACL 2025 |
NAACL 2025 | Non-uniform layer sampling based on outlier gradient norms. By default samples 2 layers like LISA but weights sampling toward high-norm layers. | Layer importance is non-uniform and skewed toward outlier layers. Uniform sampling (vanilla LoRA) wastes capacity on unimportant layers. | On par | Beats vanilla LoRA | Subset of layers | Recommended. Can be combined with LISA. Gradient norm probe on your network data identifies which specific layers are high-norm for config tasks — a 1 mini-batch operation. |
Surgical fine-tuning Lee et al., 2023 |
ICLR 2023 | Tune only 1 contiguous block of layers. Block chosen based on type of distribution shift. | Early layers → input-level shift. Middle layers → semantic/feature shift. Last layers → output/label shift. Full FT forgets pretrained features on small data. | Beats Full FT | Not compared | ~1 block only | Partial. Informs which block to probe first for each NetConfEval sub-task. Config generation = output shift (last layers?). Formal spec = semantic shift (middle?). Needs empirical validation on your data. |
Similarity metric layer selection arXiv 2602.05988, 2025 |
arXiv 2025 | Measure cosine similarity / CKA between each layer's input and output using pretrained model only. High similarity = layer doing little = freeze it. | Not all layers transform representations equally. Many middle layers have very high input-output similarity and can be safely frozen. | Matches (up to 50% param reduction) | Beats on math/code | Up to 50% reduction | Can use as baseline. No dataset needed — purely model-intrinsic. Good ablation baseline to compare against task-aware selection. Does not know your task is networking. |
FLoE (Fisher layer selection) |
arXiv | Compute Fisher information scores per parameter block using small task data sample. Apply LoRA only to high-scoring blocks. | Fisher scores identify which parameters are most sensitive to task loss. Not all layers need LoRA adapters. | On par | Beats vanilla LoRA | Sparse LoRA | Can use. Uses your task data (small sample). More principled than gradient norm but more expensive. Good for a thorough ablation if compute allows. |
DoRA Liu et al., 2024 |
ICML 2024 (Oral) | Decompose weight matrix into magnitude + direction components. Apply LoRA only on the direction component. | LoRA changes both magnitude and direction simultaneously — separating them enables more stable, controllable updates. | Matches / slightly below | +3.7–4.4 on reasoning benchmarks | Low-rank + magnitude | Can use. Consistently outperforms vanilla LoRA. Drop-in replacement for LoRA. Useful as a stronger LoRA baseline in your ablations. |
CHILD-TUNING Xu et al., 2021 |
EMNLP 2021 | Bernoulli mask over gradients. Task-driven variant (CHILD-TUNING_D) uses Fisher gradients to select child network. | Task-driven child network selection outperforms full FT on all 10 GLUE tasks. Not all parameters contribute equally. | Beats Full FT | Beats LoRA | 0.1–0.4% | Informational. Supports selective FT motivation. Gradient masking approach is more aggressive than LISA — less directly applicable to decoder LLMs but validates the principle. |
GPS (gradient param selection) CVPR 2024 |
CVPR 2024 | Parameter-level gradient-based selection across all layers. | Individual parameter importance varies — not all gradients are useful. | Matches | Beats LoRA (ViT) | Sparse params | Not applicable. Designed for Vision Transformers (ViT). Operates at individual parameter level — not layer-level. Processes all gradients — too expensive for large LLMs. Do not use. |
| Paper | Venue | Base model | Fine-tuning method | Task | Dataset used | Key verified result |
|---|---|---|---|---|---|---|
NetLLM Wu et al., 2024 |
SIGCOMM 2024 | LLaMA-2 7B | DD-LRNA: low-rank matrices (0.31% of params) + data-driven RL. Not standard LoRA — uses offline experience pool to eliminate live env interaction. | Viewport prediction (VP), adaptive bitrate streaming (ABR), cluster job scheduling (CJS) | Envivio-Dash3, FCC bandwidth traces, TPC-H (ISPASS'16), Jin2022 (SIGMM) | Low-rank matrices = 0.31% of total parameters. Reduces fine-tuning cost by 60.9% GPU memory and 15.1% training time vs full FT. NetLLM-adapted LLaMA-2 significantly outperforms SoTA DNN baselines on all 3 tasks. First "one model for all networking tasks" framework. |
NetConfEval Wang et al., 2024 |
CoNEXT 2024 | GPT-4, GPT-4-Turbo, GPT-4o, HuggingFace open models | Zero-shot and few-shot prompting only (no fine-tuning in the paper itself). Benchmark designed for evaluating and fine-tuning. | 4 tasks: formal spec, API call gen, routing algorithm code, low-level device config (OSPF/RIP/BGP/RIFT) | Config2Spec policy dataset (task 1), Kathará network emulator scenarios (task 4) | Small models sufficient for spec/API tasks. GPT-4 required for routing code generation. Breaking tasks into subtasks significantly improves accuracy. GPT-4 handles simple policy conflicts but struggles with complex ones. Runner-up best paper at CoNEXT 2024. |
Tele-LLMs Maatouk et al., 2024 |
arXiv 2409.05314 | TinyLlama-1.1B, Gemma-2B, Gemma-2-2B, LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3-8B | 2-stage: (1) continual pretraining on Tele-Data with full params, (2) instruction fine-tuning. LoRA also tested in initial experiments — full param FT found better for this domain. | Telecom QA, standards understanding, mathematical modeling of telecom systems | Tele-Data (arXiv + 3GPP + Wikipedia + CommonCrawl), Tele-Eval (750k QA), TeleQnA | 25% average relative improvement on Tele-Eval. Smaller adapted models rival larger general models on telecom benchmarks. Retain general capabilities (MMLU, commonsense) — no catastrophic forgetting. Full param fine-tuning outperformed LoRA for this domain adaptation task. |
Mobile-LLaMA Kan et al., 2024 |
IEEE Network 2024 | LLaMA-2 13B | Instruction fine-tuning via self-instruct (15,111 instruction sets generated using OpenAI APIs from real 5G data) | Packet capture analysis, IP routing table analysis, performance analysis for 5G NWDAF | Real 5G datasets: BGP routing tables, pcap files, UE traffic traces (IEEE Dataport) | Score 247/300 vs GPT-3.5's 209/300 on code generation tasks. Shows domain-specific instruction fine-tuning on real network data outperforms general models. Dataset is NWDAF-specific — not recommended for config generation tasks. |
TeleQnA benchmark Maatouk et al., 2023 |
IEEE / arXiv 2023 | GPT-3.5, GPT-4, Mixtral 8x7B (evaluation only) | Zero-shot evaluation — no fine-tuning | Telecom MCQ: 5 categories from standards and research | TeleQnA (10,000 questions from 3GPP + IEEE) | LLMs struggle with complex 3GPP standards questions. Performance improves significantly when relevant knowledge context is provided (RAG-style). LLMs rival active telecom professionals on general telecom categories. First telecom-specific LLM benchmark. |
MeshAgent zaoxing et al., 2026 |
SIGMETRICS 2026 | Not confirmed (preprint) | LLM-based multi-agent framework for mesh network config | Mesh network configuration, multi-agent coordination | Not confirmed (preprint, PDF inaccessible) | Preprint only — full details unverified. Applies LLM agents to mesh networking. Relevant as emerging work in the space. Treat as reference only until published version available. |
| Method | Source | Dataset needed? | What data used and how | Pre or during? | Selection basis | Cost | Task-aware? | Use in your work |
|---|---|---|---|---|---|---|---|---|
Similarity metric arXiv 2602.05988, 2025 |
arXiv 2025 | No dataset Purely model-intrinsic |
None. Runs a forward pass through the pretrained model and measures cosine similarity or CKA between each layer's input and output representations. | Pre-selection | High input-output similarity = layer transforming little = safe to freeze. Low similarity = layer is active = apply LoRA. | Very low Single forward pass, no labels |
No Knows nothing about networking or your task |
Use as ablation baseline. Compare your task-aware selection against this model-intrinsic baseline to show domain-awareness matters. |
Gradient norm probe IST / OWS — NAACL 2025 |
NAACL 2025 | Small batch 1 mini-batch of task data |
Run 1 forward + backward pass on a small batch of your task data (e.g. NetConfEval samples). Compute gradient norm per layer from your task's loss signal. | Pre-selection | Layers with high gradient norm = most sensitive to your task loss = important to update. | Very low Minutes, 1 mini-batch |
Partially Uses your data but measures statistical sensitivity, not semantic importance |
First step. Run before training on NIT/NetConfEval data to get initial candidate layers. Cheap and informative. |
Fisher information scoring FLoE |
arXiv | Small batch Sample of task data |
Estimate Fisher information matrix using a sample of your task data. Fisher score per parameter block = sensitivity of task loss to that parameter. | Pre-selection | High Fisher score = parameter is task-critical = apply LoRA there. Low score = safe to freeze. | Medium Fisher estimation more expensive than gradient norm |
Partially Task loss driven but not interpretable about what each layer semantically encodes |
Can use. More principled than gradient norm. Good for ablation study comparing selection strategies. |
Probing classifiers General practice |
General | Full dataset Labeled task data required |
Attach a small linear classifier to each transformer layer's output. Train each probe on your labeled task data (e.g. NetConfEval instruction-output pairs). Measure probe accuracy per layer. | Pre-selection | Layers where probe accuracy is highest = most task-relevant representations = tune these. | Medium Train one probe per layer |
Yes Directly measures which layer representations encode task-relevant features |
Best for your research. Only method that tells you what each layer semantically encodes for networking. Core contribution of your domain-aware study. |
Binary mask learning (ILA) |
arXiv | Full dataset Full task data for short pre-run |
Short pre-training run on your task data while learning a binary mask over layers. Mask converges to select layers that minimize task loss most effectively. | Pre-selection | Learned mask identifies which layers contribute most to task loss reduction. | Medium-high Requires a full short training run |
Partially Task loss driven but mask is binary — not interpretable about what each layer encodes |
Optional. Useful if compute allows. Less interpretable than probing classifiers for your research goals. |
LISA sampling Pan et al. — NeurIPS 2024 |
NeurIPS 2024 | Full dataset Used at runtime during training |
Full training dataset used during training. Layer sampling probabilities are determined by weight norm skewness observed across layers during training steps. Always-active: embedding + LM head. Randomly sampled: 2 middle layers per step. | During training Not pre-selection — adapts dynamically each step |
Weight norm skewness of middle layers (observed in LoRA). Embedding and LM head always have highest norms — always active. | Low No extra cost — runs inside training loop |
Partially Adapts to your data dynamically but selection basis is weight norms, not semantic domain knowledge |
Recommended training strategy. Replace vanilla LoRA with LISA as default. Combine with gradient norm probe for pre-selection. Key open question: does weight norm skewness pattern hold for network config data? |