Datasets
Layer selection papers
Networking LLM papers
Pre-selection methods
Corrections: NetConfEval fine-tune column updated — it is an eval benchmark primarily; partial fine-tune use only via Hugging Face dataset. NetBench clarified as network traffic benchmark, not NLP QA. Mobile-LLaMA dataset marked correctly as not reusable for your work.
Dataset Train Eval Domain Role in your work Size Format Verified details
NetConfEval
Wang et al., KTH / Red Hat, 2024
⚠ partial yes Network config Main benchmark Hundreds of tasks (5 iterations per task type) JSONL, instruction-output pairs 4 tasks verified: (1) formal spec translation from Config2Spec policies, (2) API/function call generation, (3) routing algorithm code (Python), (4) low-level config for OSPF/RIP/BGP/RIFT. Runner-up best paper ACM CoNEXT 2024. Key finding: small models handle spec/API tasks; only larger models (GPT-4 scale) handle routing code generation.
TeleQnA
Maatouk et al., 2023
no yes Telecom Domain benchmark 10,000 MCQ questions JSON, multiple-choice 5 categories: telecom lexicon, research overview, publications, standards overview, 3GPP standards specs. Drawn from 3GPP + IEEE sources. Dataset access via GitHub (password: teleqnadataset). GPT-4 and GPT-3.5 both struggle on complex standards questions. LLMs rival telecom professionals on easier categories.
Tele-Eval
Maatouk et al. / Tele-LLMs, 2024
no yes Telecom Domain benchmark 750,000 QA pairs Open-ended QA (LLM-judged) Used to benchmark Tele-LLMs series. Evaluation uses Mixtral 8x7B-Instruct as judge (Yes/No correctness scoring). Covers scholarly material, standards, and general telecom. Used for eval only — not for fine-tuning in your work.
Tele-Data
Maatouk et al. / Tele-LLMs, 2024
yes no Telecom Domain pretraining Large (arXiv papers + 3GPP + Wikipedia + CommonCrawl) Raw text (continual pretraining) 4 sources: arXiv CS/EE papers (filtered from 610k), 3GPP standards, telecom Wikipedia articles, CommonCrawl telecom pages. Used for continual pretraining before instruction tuning. Useful if you want domain vocabulary grounding before fine-tuning on NetConfEval.
NetBench
arXiv 2024
check yes Network traffic Supplementary eval ~5,390 samples Packet/traffic classification Correction from previous table: NetBench is a network traffic analysis benchmark (packet classification, encrypted traffic), not a general networking QA dataset. May be less directly relevant to config generation tasks in NetConfEval. Verify task alignment before use.
Cisco CCNA / Exams
yes yes Network config Domain tuning Medium QA / instruction-completion Cisco-specific syntax (IOS CLI), OSPF/BGP configuration exercises. Useful for config syntax grounding. Not publicly standardized — source and quality may vary depending on collection method.
GLUE
no yes General NLP Method comparison only Large (multiple tasks) Classification / NLI / QA Used only to compare fine-tuning methods (LoRA vs LISA vs selective FT) on standard benchmarks. Not network-specific. Useful for showing your method does not cause catastrophic forgetting of general NLP capabilities.
SQuAD
no yes General NLP Forgetting check 100k+ questions Extractive QA Reading comprehension. Used as a forgetting/retention benchmark — does the model still answer general QA after domain fine-tuning? Eval only in your work.
Mobile-LLaMA data
Kan et al., IEEE 2024
no reference only 5G / network Not recommended 15,111 self-instruct sets JSON self-instruct Real 5G data: BGP routing tables, pcap captures, UE traffic traces from IEEE Dataport. Used internally for LLaMA-2 13B fine-tuning. Highly specific to 5G NWDAF analytics. Not suitable for config generation tasks. Reference for methodology only.
Corrections: LISA venue is NeurIPS 2024 (not NeurIPS poster — it is a full conference paper). LISA selection basis is model-intrinsic weight norm observation from LoRA experiments on Alpaca-GPT4, not purely task data. IST/OWS correct venue is NAACL 2025. GPS (CVPR) confirmed not applicable. DoRA confirmed ICML 2024 oral.
Paper Venue Method Layer finding vs Full FT vs LoRA Params Usable for networking?
LISA
Pan et al., 2024 — arXiv 2403.17919
NeurIPS 2024 Observe weight norm skewness in LoRA (on Alpaca-GPT4). Always keep embedding + LM head active. Randomly sample 2 middle layers per step. Embedding and LM head have much larger weight norms than middle layers in LoRA. Middle layers can be randomly frozen without loss. Beats or matches +10–35% MT-Bench E + H + 2 layers Recommended. Memory as low as LoRA. Tested on LLaMA-2 7B–70B, Mistral, Phi-2, TinyLlama. Key: the weight norm observation was made on general data — whether it holds for network config data is an open question your work can answer.
IST / OWS
Outlier-weighted layerwise sampling, NAACL 2025
NAACL 2025 Non-uniform layer sampling based on outlier gradient norms. By default samples 2 layers like LISA but weights sampling toward high-norm layers. Layer importance is non-uniform and skewed toward outlier layers. Uniform sampling (vanilla LoRA) wastes capacity on unimportant layers. On par Beats vanilla LoRA Subset of layers Recommended. Can be combined with LISA. Gradient norm probe on your network data identifies which specific layers are high-norm for config tasks — a 1 mini-batch operation.
Surgical fine-tuning
Lee et al., 2023
ICLR 2023 Tune only 1 contiguous block of layers. Block chosen based on type of distribution shift. Early layers → input-level shift. Middle layers → semantic/feature shift. Last layers → output/label shift. Full FT forgets pretrained features on small data. Beats Full FT Not compared ~1 block only Partial. Informs which block to probe first for each NetConfEval sub-task. Config generation = output shift (last layers?). Formal spec = semantic shift (middle?). Needs empirical validation on your data.
Similarity metric layer selection
arXiv 2602.05988, 2025
arXiv 2025 Measure cosine similarity / CKA between each layer's input and output using pretrained model only. High similarity = layer doing little = freeze it. Not all layers transform representations equally. Many middle layers have very high input-output similarity and can be safely frozen. Matches (up to 50% param reduction) Beats on math/code Up to 50% reduction Can use as baseline. No dataset needed — purely model-intrinsic. Good ablation baseline to compare against task-aware selection. Does not know your task is networking.
FLoE (Fisher layer selection)
arXiv Compute Fisher information scores per parameter block using small task data sample. Apply LoRA only to high-scoring blocks. Fisher scores identify which parameters are most sensitive to task loss. Not all layers need LoRA adapters. On par Beats vanilla LoRA Sparse LoRA Can use. Uses your task data (small sample). More principled than gradient norm but more expensive. Good for a thorough ablation if compute allows.
DoRA
Liu et al., 2024
ICML 2024 (Oral) Decompose weight matrix into magnitude + direction components. Apply LoRA only on the direction component. LoRA changes both magnitude and direction simultaneously — separating them enables more stable, controllable updates. Matches / slightly below +3.7–4.4 on reasoning benchmarks Low-rank + magnitude Can use. Consistently outperforms vanilla LoRA. Drop-in replacement for LoRA. Useful as a stronger LoRA baseline in your ablations.
CHILD-TUNING
Xu et al., 2021
EMNLP 2021 Bernoulli mask over gradients. Task-driven variant (CHILD-TUNING_D) uses Fisher gradients to select child network. Task-driven child network selection outperforms full FT on all 10 GLUE tasks. Not all parameters contribute equally. Beats Full FT Beats LoRA 0.1–0.4% Informational. Supports selective FT motivation. Gradient masking approach is more aggressive than LISA — less directly applicable to decoder LLMs but validates the principle.
GPS (gradient param selection)
CVPR 2024
CVPR 2024 Parameter-level gradient-based selection across all layers. Individual parameter importance varies — not all gradients are useful. Matches Beats LoRA (ViT) Sparse params Not applicable. Designed for Vision Transformers (ViT). Operates at individual parameter level — not layer-level. Processes all gradients — too expensive for large LLMs. Do not use.
Corrections: NetLLM fine-tuning method is DD-LRNA (low-rank + data-driven RL), not standard LoRA. Tele-LLMs base models updated to verified list: TinyLlama-1.1B, Gemma-2B, Gemma-2-2B, LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3-8B. NetConfEval tasks confirmed as 4 (not 3). MeshAgent marked as preprint with limited verified detail.
Paper Venue Base model Fine-tuning method Task Dataset used Key verified result
NetLLM
Wu et al., 2024
SIGCOMM 2024 LLaMA-2 7B DD-LRNA: low-rank matrices (0.31% of params) + data-driven RL. Not standard LoRA — uses offline experience pool to eliminate live env interaction. Viewport prediction (VP), adaptive bitrate streaming (ABR), cluster job scheduling (CJS) Envivio-Dash3, FCC bandwidth traces, TPC-H (ISPASS'16), Jin2022 (SIGMM) Low-rank matrices = 0.31% of total parameters. Reduces fine-tuning cost by 60.9% GPU memory and 15.1% training time vs full FT. NetLLM-adapted LLaMA-2 significantly outperforms SoTA DNN baselines on all 3 tasks. First "one model for all networking tasks" framework.
NetConfEval
Wang et al., 2024
CoNEXT 2024 GPT-4, GPT-4-Turbo, GPT-4o, HuggingFace open models Zero-shot and few-shot prompting only (no fine-tuning in the paper itself). Benchmark designed for evaluating and fine-tuning. 4 tasks: formal spec, API call gen, routing algorithm code, low-level device config (OSPF/RIP/BGP/RIFT) Config2Spec policy dataset (task 1), Kathará network emulator scenarios (task 4) Small models sufficient for spec/API tasks. GPT-4 required for routing code generation. Breaking tasks into subtasks significantly improves accuracy. GPT-4 handles simple policy conflicts but struggles with complex ones. Runner-up best paper at CoNEXT 2024.
Tele-LLMs
Maatouk et al., 2024
arXiv 2409.05314 TinyLlama-1.1B, Gemma-2B, Gemma-2-2B, LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3-8B 2-stage: (1) continual pretraining on Tele-Data with full params, (2) instruction fine-tuning. LoRA also tested in initial experiments — full param FT found better for this domain. Telecom QA, standards understanding, mathematical modeling of telecom systems Tele-Data (arXiv + 3GPP + Wikipedia + CommonCrawl), Tele-Eval (750k QA), TeleQnA 25% average relative improvement on Tele-Eval. Smaller adapted models rival larger general models on telecom benchmarks. Retain general capabilities (MMLU, commonsense) — no catastrophic forgetting. Full param fine-tuning outperformed LoRA for this domain adaptation task.
Mobile-LLaMA
Kan et al., 2024
IEEE Network 2024 LLaMA-2 13B Instruction fine-tuning via self-instruct (15,111 instruction sets generated using OpenAI APIs from real 5G data) Packet capture analysis, IP routing table analysis, performance analysis for 5G NWDAF Real 5G datasets: BGP routing tables, pcap files, UE traffic traces (IEEE Dataport) Score 247/300 vs GPT-3.5's 209/300 on code generation tasks. Shows domain-specific instruction fine-tuning on real network data outperforms general models. Dataset is NWDAF-specific — not recommended for config generation tasks.
TeleQnA benchmark
Maatouk et al., 2023
IEEE / arXiv 2023 GPT-3.5, GPT-4, Mixtral 8x7B (evaluation only) Zero-shot evaluation — no fine-tuning Telecom MCQ: 5 categories from standards and research TeleQnA (10,000 questions from 3GPP + IEEE) LLMs struggle with complex 3GPP standards questions. Performance improves significantly when relevant knowledge context is provided (RAG-style). LLMs rival active telecom professionals on general telecom categories. First telecom-specific LLM benchmark.
MeshAgent
zaoxing et al., 2026
SIGMETRICS 2026 Not confirmed (preprint) LLM-based multi-agent framework for mesh network config Mesh network configuration, multi-agent coordination Not confirmed (preprint, PDF inaccessible) Preprint only — full details unverified. Applies LLM agents to mesh networking. Relevant as emerging work in the space. Treat as reference only until published version available.
Corrections from previous table: (1) All methods except similarity metric use task dataset — corrected and confirmed. (2) LISA is during-training, not pre-selection — confirmed. (3) "Task-aware" column added with honest assessment. (4) Data requirement column now distinguishes no-data / small-batch / full-dataset accurately.
Method Source Dataset needed? What data used and how Pre or during? Selection basis Cost Task-aware? Use in your work
Similarity metric
arXiv 2602.05988, 2025
arXiv 2025 No dataset
Purely model-intrinsic
None. Runs a forward pass through the pretrained model and measures cosine similarity or CKA between each layer's input and output representations. Pre-selection High input-output similarity = layer transforming little = safe to freeze. Low similarity = layer is active = apply LoRA. Very low
Single forward pass, no labels
No
Knows nothing about networking or your task
Use as ablation baseline. Compare your task-aware selection against this model-intrinsic baseline to show domain-awareness matters.
Gradient norm probe
IST / OWS — NAACL 2025
NAACL 2025 Small batch
1 mini-batch of task data
Run 1 forward + backward pass on a small batch of your task data (e.g. NetConfEval samples). Compute gradient norm per layer from your task's loss signal. Pre-selection Layers with high gradient norm = most sensitive to your task loss = important to update. Very low
Minutes, 1 mini-batch
Partially
Uses your data but measures statistical sensitivity, not semantic importance
First step. Run before training on NIT/NetConfEval data to get initial candidate layers. Cheap and informative.
Fisher information scoring
FLoE
arXiv Small batch
Sample of task data
Estimate Fisher information matrix using a sample of your task data. Fisher score per parameter block = sensitivity of task loss to that parameter. Pre-selection High Fisher score = parameter is task-critical = apply LoRA there. Low score = safe to freeze. Medium
Fisher estimation more expensive than gradient norm
Partially
Task loss driven but not interpretable about what each layer semantically encodes
Can use. More principled than gradient norm. Good for ablation study comparing selection strategies.
Probing classifiers
General practice
General Full dataset
Labeled task data required
Attach a small linear classifier to each transformer layer's output. Train each probe on your labeled task data (e.g. NetConfEval instruction-output pairs). Measure probe accuracy per layer. Pre-selection Layers where probe accuracy is highest = most task-relevant representations = tune these. Medium
Train one probe per layer
Yes
Directly measures which layer representations encode task-relevant features
Best for your research. Only method that tells you what each layer semantically encodes for networking. Core contribution of your domain-aware study.
Binary mask learning (ILA)
arXiv Full dataset
Full task data for short pre-run
Short pre-training run on your task data while learning a binary mask over layers. Mask converges to select layers that minimize task loss most effectively. Pre-selection Learned mask identifies which layers contribute most to task loss reduction. Medium-high
Requires a full short training run
Partially
Task loss driven but mask is binary — not interpretable about what each layer encodes
Optional. Useful if compute allows. Less interpretable than probing classifiers for your research goals.
LISA sampling
Pan et al. — NeurIPS 2024
NeurIPS 2024 Full dataset
Used at runtime during training
Full training dataset used during training. Layer sampling probabilities are determined by weight norm skewness observed across layers during training steps. Always-active: embedding + LM head. Randomly sampled: 2 middle layers per step. During training
Not pre-selection — adapts dynamically each step
Weight norm skewness of middle layers (observed in LoRA). Embedding and LM head always have highest norms — always active. Low
No extra cost — runs inside training loop
Partially
Adapts to your data dynamically but selection basis is weight norms, not semantic domain knowledge
Recommended training strategy. Replace vanilla LoRA with LISA as default. Combine with gradient norm probe for pre-selection. Key open question: does weight norm skewness pattern hold for network config data?
The three-level hierarchy for your research framing
Level 1 — No data (similarity metric): finds structurally active layers in the pretrained model. No task awareness.
Level 2 — Data-aware (gradient norm, Fisher, LISA, mask): finds layers statistically sensitive to your task loss. Knows your data but not its domain meaning.
Level 3 — Domain-aware (probing classifiers + your study): finds which specific layers encode network protocol knowledge. Only your work answers this for networking tasks.