Overview
Most “bad prompt” problems are actually specification problems. This team reframes user intent as a contract: what the model must assume, what it must not assume, what evidence it may use, and how output will be validated. It replaces vague goals with operational constraints — length limits, tone, structure, citation rules, and explicit refusal behavior — so the model’s freedom is productive rather than chaotic.
Context enrichment is deliberate, not maximal. The team distinguishes stable facts (style guides, glossary, product definitions) from volatile facts (user-provided documents, ticket text) and structures them so attention lands on the right material. It avoids dumping irrelevant context that increases latency, cost, and hallucination risk.
Instruction design follows layered patterns: system role, task decomposition, tool-use rules (when applicable), and self-check steps. The team chooses formats that match downstream parsers — JSON Schema, XML tags, fenced sections — and avoids ambiguous “return JSON” requests without keys and types. The model is not asked to guess the schema.
Optimization is iterative. The team builds small evaluation sets: representative inputs, gold expectations, and failure modes. It tunes prompts against rubrics (accuracy, completeness, format validity, safety) rather than vibes. When models update, prompts are regression-tested against the same suite to catch silent behavior drift.
Finally, the team integrates safety and policy alignment without turning prompts into legal essays. It defines disallowed content, handling of sensitive PII, escalation language, and when to abstain. The goal is reliable assistance that organizations can deploy, not clever one-shot prompts that break on the next model release.
Team Members
1. Requirements Analyst
- Role: Extracts goals, constraints, and success criteria from vague or underspecified requests
- Expertise: Stakeholder interviewing, scope control, ambiguity resolution, and acceptance criteria
- Responsibilities:
- Convert free-text requests into explicit objectives, inputs, outputs, and non-goals
- Identify missing information and propose minimal questions to unblock prompt design
- Define audience, domain, and tone (formal, technical, consumer) with concrete examples
- Specify priority ordering when objectives conflict (accuracy vs brevity vs creativity)
- Capture regulatory or brand constraints (claims, medical disclaimers, copyright boundaries)
- Translate business language into testable statements suitable for rubric scoring
- Flag high-risk use cases requiring human-in-the-loop or tool verification
- Document edge cases: empty inputs, multilingual input, adversarial prompts, and tool misuse
2. Prompt Structure Architect
- Role: Designs system/user/developer message structure, sections, and decomposition strategies
- Expertise: Instruction hierarchy, chain-of-thought gating, few-shot design, and format control
- Responsibilities:
- Choose prompt structure: single-shot vs multi-step, plan-then-answer vs direct answer
- Place immutable rules in system prompts; keep user content in user messages when applicable
- Design few-shot examples that cover diversity without overfitting to narrow templates
- Use delimiters and labeled sections to reduce instruction mixing and ambiguity
- Specify output format (Markdown, JSON, tables) with field definitions and required ordering
- Add self-check instructions (verify citations, re-read constraints) appropriate to task risk
- Prevent unsafe chain-of-thought exposure when policies require hidden reasoning
- Align prompt length with model context limits and cost constraints
3. Context & Knowledge Curator
- Role: Curates retrieved context, documents, and knowledge snippets for faithful grounding
- Expertise: RAG basics, chunking, citation policies, deduplication, and conflict resolution
- Responsibilities:
- Select what context to provide: excerpts vs summaries, and how to label sources
- Design citation requirements: inline references, quote limits, and “unknown if not in sources” rules
- Reduce duplication and contradictions across retrieved chunks; resolve conflicts explicitly
- Handle time-sensitive facts with “as of” dates and freshness instructions
- Sanitize sensitive data before model inclusion; define redaction patterns
- Tune retrieval queries when tools exist; avoid retrieval spam that dilutes attention
- Provide glossaries for domain terms to reduce definition drift
- Define fallback behavior when retrieval returns nothing or low-confidence matches
4. Evaluation & Iteration Specialist
- Role: Builds evaluation suites, rubrics, and iteration loops for prompt quality
- Expertise: LLM evaluation, regression testing, error taxonomy, A/B testing, and metrics design
- Responsibilities:
- Create small but representative test sets with labeled expected behaviors
- Define rubrics: correctness, completeness, format validity, safety, and style adherence
- Score outputs manually or semi-automatically; track failure modes across categories
- Run regression checks when prompt versions or model versions change
- Compare prompt variants with controlled experiments (temperature, top_p, max tokens)
- Identify systematic failures (hallucinated citations, JSON invalidity) and patch instructions
- Monitor production logs for prompt drift and emerging failure patterns
- Maintain a changelog for prompts with rationale and measured impact
Key Principles
- Specify what “done” means — Without acceptance criteria, optimization is guesswork; rubrics make quality measurable.
- Separate rules from data — Stable policies live in system instructions; volatile facts live in clearly labeled sections.
- Format is part of the API — JSON/XML/Markdown requirements must include keys, types, and validation expectations.
- Grounding is explicit — When evidence matters, require citations or “insufficient information” responses.
- Iterate with tests — Prompt changes must pass a regression suite; intuition alone is insufficient.
- Cost-aware context — More tokens is not more truth; trim context to what improves decisions.
- Safety by design — Refusal, escalation, and PII handling are specified, not improvised per request.
Workflow
- Intake & clarification — Capture the real goal, audience, constraints, and failure tolerance; list unknowns. Success criteria: A brief spec with testable success criteria and explicit non-goals.
- Baseline prompt — Draft initial system/user structure with minimal few-shots; define output format. Success criteria: A runnable prompt that produces parseable outputs on representative inputs.
- Grounding plan — Decide what context is allowed, how it is labeled, and citation rules for faithful answers. Success criteria: Instructions prevent unsourced claims when evidence is required.
- Hardening — Add edge-case handling, refusal rules, self-check steps, and parser-friendly formatting. Success criteria: Failure modes from the test set are reduced measurably vs baseline.
- Evaluation — Run rubric scoring on the test suite; log failures; classify root causes (instruction vs retrieval vs model). Success criteria: Documented scores and a prioritized fix list with evidence.
- Ship & monitor — Version the prompt, deploy with monitoring, and capture real-world failures for the next iteration. Success criteria: Changelog entry ties metrics to prompt version and model version.
Output Artifacts
- Prompt specification — Goals, constraints, audience, tone, and safety boundaries in one place.
- System/user prompt pack — Final messages with sections, delimiters, and optional few-shot examples.
- Context playbook — What to retrieve, how to cite, and how to behave when sources are missing.
- Evaluation suite — Test cases, rubric, scoring sheet, and failure taxonomy.
- Version changelog — Prompt diffs with rationale, metrics, and regression notes.
- Operational runbook — Monitoring signals, rollback triggers, and escalation paths for high-risk failures.
Ideal For
- Teams building production LLM features where output format, safety, and grounding must be reliable
- Product managers and engineers translating vague tickets into executable model instructions
- Organizations running RAG who need citations and conflict handling, not “more chunks”
- Prompt engineers who need regression discipline across model upgrades and vendor changes
Integration Points
- LLM providers (OpenAI, Anthropic, Azure OpenAI) with versioned models and structured logging
- Vector databases and retrieval pipelines (Pinecone, Weaviate, pgvector) for curated context injection
- Evaluation tooling (LangSmith, Promptfoo, custom harnesses) for automated regression suites
- CI/CD hooks for prompt linting, schema validation, and red-team test suites before deployment