Corrections: NetConfEval fine-tune column updated — it is an eval benchmark primarily; partial fine-tune use only via Hugging Face dataset. NetBench clarified as network traffic benchmark, not NLP QA. Mobile-LLaMA dataset marked correctly as not reusable for your work.

Dataset	Train	Eval	Domain	Role in your work	Size	Format	Verified details
NetConfEval Wang et al., KTH / Red Hat, 2024	⚠ partial	yes	Network config	Main benchmark	Hundreds of tasks (5 iterations per task type)	JSONL, instruction-output pairs	4 tasks verified: (1) formal spec translation from Config2Spec policies, (2) API/function call generation, (3) routing algorithm code (Python), (4) low-level config for OSPF/RIP/BGP/RIFT. Runner-up best paper ACM CoNEXT 2024. Key finding: small models handle spec/API tasks; only larger models (GPT-4 scale) handle routing code generation.
TeleQnA Maatouk et al., 2023	no	yes	Telecom	Domain benchmark	10,000 MCQ questions	JSON, multiple-choice	5 categories: telecom lexicon, research overview, publications, standards overview, 3GPP standards specs. Drawn from 3GPP + IEEE sources. Dataset access via GitHub (password: teleqnadataset). GPT-4 and GPT-3.5 both struggle on complex standards questions. LLMs rival telecom professionals on easier categories.
Tele-Eval Maatouk et al. / Tele-LLMs, 2024	no	yes	Telecom	Domain benchmark	750,000 QA pairs	Open-ended QA (LLM-judged)	Used to benchmark Tele-LLMs series. Evaluation uses Mixtral 8x7B-Instruct as judge (Yes/No correctness scoring). Covers scholarly material, standards, and general telecom. Used for eval only — not for fine-tuning in your work.
Tele-Data Maatouk et al. / Tele-LLMs, 2024	yes	no	Telecom	Domain pretraining	Large (arXiv papers + 3GPP + Wikipedia + CommonCrawl)	Raw text (continual pretraining)	4 sources: arXiv CS/EE papers (filtered from 610k), 3GPP standards, telecom Wikipedia articles, CommonCrawl telecom pages. Used for continual pretraining before instruction tuning. Useful if you want domain vocabulary grounding before fine-tuning on NetConfEval.
NetBench arXiv 2024	check	yes	Network traffic	Supplementary eval	~5,390 samples	Packet/traffic classification	Correction from previous table: NetBench is a network traffic analysis benchmark (packet classification, encrypted traffic), not a general networking QA dataset. May be less directly relevant to config generation tasks in NetConfEval. Verify task alignment before use.
Cisco CCNA / Exams	yes	yes	Network config	Domain tuning	Medium	QA / instruction-completion	Cisco-specific syntax (IOS CLI), OSPF/BGP configuration exercises. Useful for config syntax grounding. Not publicly standardized — source and quality may vary depending on collection method.
GLUE	no	yes	General NLP	Method comparison only	Large (multiple tasks)	Classification / NLI / QA	Used only to compare fine-tuning methods (LoRA vs LISA vs selective FT) on standard benchmarks. Not network-specific. Useful for showing your method does not cause catastrophic forgetting of general NLP capabilities.
SQuAD	no	yes	General NLP	Forgetting check	100k+ questions	Extractive QA	Reading comprehension. Used as a forgetting/retention benchmark — does the model still answer general QA after domain fine-tuning? Eval only in your work.
Mobile-LLaMA data Kan et al., IEEE 2024	no	reference only	5G / network	Not recommended	15,111 self-instruct sets	JSON self-instruct	Real 5G data: BGP routing tables, pcap captures, UE traffic traces from IEEE Dataport. Used internally for LLaMA-2 13B fine-tuning. Highly specific to 5G NWDAF analytics. Not suitable for config generation tasks. Reference for methodology only.

Corrections: LISA venue is NeurIPS 2024 (not NeurIPS poster — it is a full conference paper). LISA selection basis is model-intrinsic weight norm observation from LoRA experiments on Alpaca-GPT4, not purely task data. IST/OWS correct venue is NAACL 2025. GPS (CVPR) confirmed not applicable. DoRA confirmed ICML 2024 oral.

Paper	Venue	Method	Layer finding	vs Full FT	vs LoRA	Params	Usable for networking?
LISA Pan et al., 2024 — arXiv 2403.17919	NeurIPS 2024	Observe weight norm skewness in LoRA (on Alpaca-GPT4). Always keep embedding + LM head active. Randomly sample 2 middle layers per step.	Embedding and LM head have much larger weight norms than middle layers in LoRA. Middle layers can be randomly frozen without loss.	Beats or matches	+10–35% MT-Bench	E + H + 2 layers	Recommended. Memory as low as LoRA. Tested on LLaMA-2 7B–70B, Mistral, Phi-2, TinyLlama. Key: the weight norm observation was made on general data — whether it holds for network config data is an open question your work can answer.
IST / OWS Outlier-weighted layerwise sampling, NAACL 2025	NAACL 2025	Non-uniform layer sampling based on outlier gradient norms. By default samples 2 layers like LISA but weights sampling toward high-norm layers.	Layer importance is non-uniform and skewed toward outlier layers. Uniform sampling (vanilla LoRA) wastes capacity on unimportant layers.	On par	Beats vanilla LoRA	Subset of layers	Recommended. Can be combined with LISA. Gradient norm probe on your network data identifies which specific layers are high-norm for config tasks — a 1 mini-batch operation.
Surgical fine-tuning Lee et al., 2023	ICLR 2023	Tune only 1 contiguous block of layers. Block chosen based on type of distribution shift.	Early layers → input-level shift. Middle layers → semantic/feature shift. Last layers → output/label shift. Full FT forgets pretrained features on small data.	Beats Full FT	Not compared	~1 block only	Partial. Informs which block to probe first for each NetConfEval sub-task. Config generation = output shift (last layers?). Formal spec = semantic shift (middle?). Needs empirical validation on your data.
Similarity metric layer selection arXiv 2602.05988, 2025	arXiv 2025	Measure cosine similarity / CKA between each layer's input and output using pretrained model only. High similarity = layer doing little = freeze it.	Not all layers transform representations equally. Many middle layers have very high input-output similarity and can be safely frozen.	Matches (up to 50% param reduction)	Beats on math/code	Up to 50% reduction	Can use as baseline. No dataset needed — purely model-intrinsic. Good ablation baseline to compare against task-aware selection. Does not know your task is networking.
FLoE (Fisher layer selection)	arXiv	Compute Fisher information scores per parameter block using small task data sample. Apply LoRA only to high-scoring blocks.	Fisher scores identify which parameters are most sensitive to task loss. Not all layers need LoRA adapters.	On par	Beats vanilla LoRA	Sparse LoRA	Can use. Uses your task data (small sample). More principled than gradient norm but more expensive. Good for a thorough ablation if compute allows.
DoRA Liu et al., 2024	ICML 2024 (Oral)	Decompose weight matrix into magnitude + direction components. Apply LoRA only on the direction component.	LoRA changes both magnitude and direction simultaneously — separating them enables more stable, controllable updates.	Matches / slightly below	+3.7–4.4 on reasoning benchmarks	Low-rank + magnitude	Can use. Consistently outperforms vanilla LoRA. Drop-in replacement for LoRA. Useful as a stronger LoRA baseline in your ablations.
CHILD-TUNING Xu et al., 2021	EMNLP 2021	Bernoulli mask over gradients. Task-driven variant (CHILD-TUNING_D) uses Fisher gradients to select child network.	Task-driven child network selection outperforms full FT on all 10 GLUE tasks. Not all parameters contribute equally.	Beats Full FT	Beats LoRA	0.1–0.4%	Informational. Supports selective FT motivation. Gradient masking approach is more aggressive than LISA — less directly applicable to decoder LLMs but validates the principle.
GPS (gradient param selection) CVPR 2024	CVPR 2024	Parameter-level gradient-based selection across all layers.	Individual parameter importance varies — not all gradients are useful.	Matches	Beats LoRA (ViT)	Sparse params	Not applicable. Designed for Vision Transformers (ViT). Operates at individual parameter level — not layer-level. Processes all gradients — too expensive for large LLMs. Do not use.

Corrections: NetLLM fine-tuning method is DD-LRNA (low-rank + data-driven RL), not standard LoRA. Tele-LLMs base models updated to verified list: TinyLlama-1.1B, Gemma-2B, Gemma-2-2B, LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3-8B. NetConfEval tasks confirmed as 4 (not 3). MeshAgent marked as preprint with limited verified detail.

Paper	Venue	Base model	Fine-tuning method	Task	Dataset used	Key verified result
NetLLM Wu et al., 2024	SIGCOMM 2024	LLaMA-2 7B	DD-LRNA: low-rank matrices (0.31% of params) + data-driven RL. Not standard LoRA — uses offline experience pool to eliminate live env interaction.	Viewport prediction (VP), adaptive bitrate streaming (ABR), cluster job scheduling (CJS)	Envivio-Dash3, FCC bandwidth traces, TPC-H (ISPASS'16), Jin2022 (SIGMM)	Low-rank matrices = 0.31% of total parameters. Reduces fine-tuning cost by 60.9% GPU memory and 15.1% training time vs full FT. NetLLM-adapted LLaMA-2 significantly outperforms SoTA DNN baselines on all 3 tasks. First "one model for all networking tasks" framework.
NetConfEval Wang et al., 2024	CoNEXT 2024	GPT-4, GPT-4-Turbo, GPT-4o, HuggingFace open models	Zero-shot and few-shot prompting only (no fine-tuning in the paper itself). Benchmark designed for evaluating and fine-tuning.	4 tasks: formal spec, API call gen, routing algorithm code, low-level device config (OSPF/RIP/BGP/RIFT)	Config2Spec policy dataset (task 1), Kathará network emulator scenarios (task 4)	Small models sufficient for spec/API tasks. GPT-4 required for routing code generation. Breaking tasks into subtasks significantly improves accuracy. GPT-4 handles simple policy conflicts but struggles with complex ones. Runner-up best paper at CoNEXT 2024.
Tele-LLMs Maatouk et al., 2024	arXiv 2409.05314	TinyLlama-1.1B, Gemma-2B, Gemma-2-2B, LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3-8B	2-stage: (1) continual pretraining on Tele-Data with full params, (2) instruction fine-tuning. LoRA also tested in initial experiments — full param FT found better for this domain.	Telecom QA, standards understanding, mathematical modeling of telecom systems	Tele-Data (arXiv + 3GPP + Wikipedia + CommonCrawl), Tele-Eval (750k QA), TeleQnA	25% average relative improvement on Tele-Eval. Smaller adapted models rival larger general models on telecom benchmarks. Retain general capabilities (MMLU, commonsense) — no catastrophic forgetting. Full param fine-tuning outperformed LoRA for this domain adaptation task.
Mobile-LLaMA Kan et al., 2024	IEEE Network 2024	LLaMA-2 13B	Instruction fine-tuning via self-instruct (15,111 instruction sets generated using OpenAI APIs from real 5G data)	Packet capture analysis, IP routing table analysis, performance analysis for 5G NWDAF	Real 5G datasets: BGP routing tables, pcap files, UE traffic traces (IEEE Dataport)	Score 247/300 vs GPT-3.5's 209/300 on code generation tasks. Shows domain-specific instruction fine-tuning on real network data outperforms general models. Dataset is NWDAF-specific — not recommended for config generation tasks.
TeleQnA benchmark Maatouk et al., 2023	IEEE / arXiv 2023	GPT-3.5, GPT-4, Mixtral 8x7B (evaluation only)	Zero-shot evaluation — no fine-tuning	Telecom MCQ: 5 categories from standards and research	TeleQnA (10,000 questions from 3GPP + IEEE)	LLMs struggle with complex 3GPP standards questions. Performance improves significantly when relevant knowledge context is provided (RAG-style). LLMs rival active telecom professionals on general telecom categories. First telecom-specific LLM benchmark.
MeshAgent zaoxing et al., 2026	SIGMETRICS 2026	Not confirmed (preprint)	LLM-based multi-agent framework for mesh network config	Mesh network configuration, multi-agent coordination	Not confirmed (preprint, PDF inaccessible)	Preprint only — full details unverified. Applies LLM agents to mesh networking. Relevant as emerging work in the space. Treat as reference only until published version available.

Corrections from previous table: (1) All methods except similarity metric use task dataset — corrected and confirmed. (2) LISA is during-training, not pre-selection — confirmed. (3) "Task-aware" column added with honest assessment. (4) Data requirement column now distinguishes no-data / small-batch / full-dataset accurately.

Method	Source	Dataset needed?	What data used and how	Pre or during?	Selection basis	Cost	Task-aware?	Use in your work
Similarity metric arXiv 2602.05988, 2025	arXiv 2025	No dataset Purely model-intrinsic	None. Runs a forward pass through the pretrained model and measures cosine similarity or CKA between each layer's input and output representations.	Pre-selection	High input-output similarity = layer transforming little = safe to freeze. Low similarity = layer is active = apply LoRA.	Very low Single forward pass, no labels	No Knows nothing about networking or your task	Use as ablation baseline. Compare your task-aware selection against this model-intrinsic baseline to show domain-awareness matters.
Gradient norm probe IST / OWS — NAACL 2025	NAACL 2025	Small batch 1 mini-batch of task data	Run 1 forward + backward pass on a small batch of your task data (e.g. NetConfEval samples). Compute gradient norm per layer from your task's loss signal.	Pre-selection	Layers with high gradient norm = most sensitive to your task loss = important to update.	Very low Minutes, 1 mini-batch	Partially Uses your data but measures statistical sensitivity, not semantic importance	First step. Run before training on NIT/NetConfEval data to get initial candidate layers. Cheap and informative.
Fisher information scoring FLoE	arXiv	Small batch Sample of task data	Estimate Fisher information matrix using a sample of your task data. Fisher score per parameter block = sensitivity of task loss to that parameter.	Pre-selection	High Fisher score = parameter is task-critical = apply LoRA there. Low score = safe to freeze.	Medium Fisher estimation more expensive than gradient norm	Partially Task loss driven but not interpretable about what each layer semantically encodes	Can use. More principled than gradient norm. Good for ablation study comparing selection strategies.
Probing classifiers General practice	General	Full dataset Labeled task data required	Attach a small linear classifier to each transformer layer's output. Train each probe on your labeled task data (e.g. NetConfEval instruction-output pairs). Measure probe accuracy per layer.	Pre-selection	Layers where probe accuracy is highest = most task-relevant representations = tune these.	Medium Train one probe per layer	Yes Directly measures which layer representations encode task-relevant features	Best for your research. Only method that tells you what each layer semantically encodes for networking. Core contribution of your domain-aware study.
Binary mask learning (ILA)	arXiv	Full dataset Full task data for short pre-run	Short pre-training run on your task data while learning a binary mask over layers. Mask converges to select layers that minimize task loss most effectively.	Pre-selection	Learned mask identifies which layers contribute most to task loss reduction.	Medium-high Requires a full short training run	Partially Task loss driven but mask is binary — not interpretable about what each layer encodes	Optional. Useful if compute allows. Less interpretable than probing classifiers for your research goals.
LISA sampling Pan et al. — NeurIPS 2024	NeurIPS 2024	Full dataset Used at runtime during training	Full training dataset used during training. Layer sampling probabilities are determined by weight norm skewness observed across layers during training steps. Always-active: embedding + LM head. Randomly sampled: 2 middle layers per step.	During training Not pre-selection — adapts dynamically each step	Weight norm skewness of middle layers (observed in LoRA). Embedding and LM head always have highest norms — always active.	Low No extra cost — runs inside training loop	Partially Adapts to your data dynamically but selection basis is weight norms, not semantic domain knowledge	Recommended training strategy. Replace vanilla LoRA with LISA as default. Combine with gradient norm probe for pre-selection. Key open question: does weight norm skewness pattern hold for network config data?

The three-level hierarchy for your research framing
Level 1 — No data (similarity metric): finds structurally active layers in the pretrained model. No task awareness.
Level 2 — Data-aware (gradient norm, Fisher, LISA, mask): finds layers statistically sensitive to your task loss. Knows your data but not its domain meaning.
Level 3 — Domain-aware (probing classifiers + your study): finds which specific layers encode network protocol knowledge. Only your work answers this for networking tasks.