THN Interview Prep

Modern Scenarios (AI & Scale)

As system engineering evolves, modern interview loops require understanding AI infrastructures, GPU resource management, and safety frameworks. This page covers these real-world 2026 scenarios in depth.


🤖 1. LLM Serving & Embedding Pipelines

Serving Large Language Models (LLMs) is highly resource-intensive and latency-sensitive. Traditional HTTP request/response paradigms fail to handle the continuous text generation patterns of LLMs.

Key LLM Serving Architectures

Request ──► API Gateway / Load Balancer


       [ Continuous Batching Router ]


     [ GPU Serving Nodes (PagedAttention) ]
       ├── Tensor Parallelism (Inter-GPU)
       └── Pipeline Parallelism (Layer-by-Layer)
  1. Continuous Batching (Iteration-Level Batching):
    • Problem: Static batching requires waiting for all sequences in a batch to complete text generation before releasing GPU threads. Because generation length varies, this wastes compute cycles.
    • Solution: Continuous batching inserts new requests into the batch at the token iteration level. As soon as one sequence completes (reaches the end-of-sequence token), a new request starts processing on those compute cores.
  2. KV Cache Management (PagedAttention):
    • Problem: The Attention mechanism of LLMs requires storing keys/values for all previous tokens in GPU memory. This is called the KV Cache. Unmanaged caches partition memory statically, leading to 60-80% memory fragmentation.
    • Solution: PagedAttention divides the KV Cache into fixed-size virtual pages, similar to virtual memory in operating systems. KV caches can be non-contiguous in physical memory, allowing the server to dynamically allocate memory as sequences grow and eliminate fragmentation.
  3. Model Parallelism:
    • Tensor Parallelism (Intra-operator): Splits layers horizontally across multiple GPUs. Required when the weights of a single layer exceed a single GPU's memory.
    • Pipeline Parallelism (Inter-operator): Splits model layers sequentially across a chain of GPUs.

Embedding & Vector Ingestion Pipelines

  • Vector Database Indexing: High-dimensional search is accelerated using Approximate Nearest Neighbor (ANN) index structures:
    • HNSW (Hierarchical Navigable Small World): Build a multi-layered graph where searches start at coarse layers and zoom in on dense layers. Offers fast lookup times but high memory consumption.
    • IVF (Inverted File Index): Clusters vectors and limits the search to the nearest centroid clusters. Reduces memory footprint at the expense of accuracy.
  • Ingestion Pipeline: Reads source documents $\rightarrow$ Splits text using semantic chunking $\rightarrow$ Computes embeddings via embedding worker pools $\rightarrow$ Writes to Vector DB while index rebuild runs asynchronously.

🏎️ 2. GPU Resource Management

Orchestrating GPU workloads is fundamentally different from CPU container scheduling due to high hardware costs and rigid memory boundaries.

  • Fractional GPU Sharing (MIG): NVIDIA Multi-Instance GPU (MIG) partitions a single physical GPU into up to 7 isolated hardware instances, ensuring CPU/GPU partition memory safety.
  • GPU Queueing & Schedulers: Schedulers (e.g., Run:AI, Kubernetes custom schedulers) prioritize workloads. Training jobs are batch-scheduled (throughput-oriented), while inference tasks are real-time scheduled (latency-oriented, requiring preemptive priority to guarantee SLAs).
  • Warm Booting vs Cold Start: LLM weights can take minutes to load into GPU memory. Systems must keep a pool of warm instances, routing traffic dynamically during traffic spikes to avoid long cold starts.

🛡️ 3. Safeguards for AI Action Systems (AI Agents)

When an AI system can take actions on behalf of a user (e.g., sending emails, editing code, making purchases), security and safety safeguards must be engineered into the architecture.

User Request ──► [ Prompt Sanitizer ] ──► [ LLM Processor ] ──► [ Action Plan ]


[ Human-in-the-Loop Approval ] ◄──► [ Sandbox Execution ] ◄──► [ Output Guardrail ]

Safety Architecture Layers

  1. Prompt Injection & Input Guardrails:
    • Input Sanitization: Scan user queries for malicious jailbreak patterns, prompt injection sequences, or attempts to bypass system constraints before forwarding to the model.
  2. Isolated Execution Sandboxes:
    • If the agent writes or executes code, it must run inside a highly isolated, ephemeral sandbox environment (e.g., Firecracker microVMs or Docker containers with read-only root filesystems and restricted network access).
  3. Output Analysis & Action Guardrails:
    • Structured Output Validation: Ensure the LLM output conforms to a strict JSON schema before executing actions.
    • Policy Checks: Compare planned actions against a security policy engine (e.g., blocking writes to sensitive API endpoints).
  4. Human-In-The-Loop (HITL) Authorizations:
    • Sensitive Mutations: Any mutation action (like financial transfers or database deletes) must trigger an asynchronous webhook that sends a confirmation request to the user, pausing agent execution until manually approved.
  5. Strict Token Rate-Limiting:
    • Limit the rate of actions and model tokens consumed per user session to prevent run-away agent loops (e.g., an agent executing an infinite write loop).

📦 4. Distributing Massive Files to Thousands of Machines

Distributing large files (such as 100GB+ ML model weights or large container images) to tens of thousands of machines concurrently creates massive network bottlenecks at the origin servers.

Architectural Approaches

  1. Peer-to-Peer (P2P) Distribution (BitTorrent / Kraken style):
    • Mechanism: The origin server splits the large file into small, hashed chunks (e.g., 4MB).
    • Flow: The initial machines download chunks from the origin and immediately act as "seeders" to share those chunks with other peer nodes in the network.
    • Tracker/DHT: A centralized Tracker server or Distributed Hash Table (DHT) coordinates which nodes hold which chunks.
    • Benefit: Network bandwidth scales linearly with the number of nodes, eliminating origin network bottlenecks.
  2. Master-Worker Distribution Trees:
    • Mechanism: Nodes are arranged in a hierarchical tree topology.
    • Flow: The master node sends the file to a small group of tier-1 worker nodes. Each tier-1 node then distributes it to tier-2 nodes, repeating down the tree.
    • Benefit: Limits concurrent connections on the source while offering controlled, deterministic delivery.

Mark this page when you finish learning it.

Last updated on

Spotted something unclear or wrong on this page?

On this page