Google Competition

MLSys DAG Scheduling Challenge

Designing optimal execution schedulers for computational graphs on memory-constrained AI accelerators.

Problem Statement

Status Active

Challenge: Execute massive computational graphs on hardware with limited on-chip memory by orchestrating complex data movement strategies.

Design a scheduler that analyzes a Directed Acyclic Graph (DAG) of operations and generates an execution strategy that minimizes total latency while respecting all memory capacity limits.

Core Concepts

Memory Hierarchy

  • Slow Memory: Infinite capacity, limited bandwidth - all inputs start here, all outputs must end here
  • Fast Memory: High-speed scratchpad with finite capacity (e.g., 50KB) - compute cores can only access data here
  • Ephemeral Data: When operations are grouped, intermediate data passes directly between ops without touching fast memory (zero capacity cost)

Execution Granularity

Control how hardware slices computation using a 3D configuration [w, h, k]:

  • w, h: Spatial dimensions - size of output slice (w × h)
  • k: Reduction depth - for MatMul operations, controls dot product step size

Key Constraints

  • Working Set Limit: Sum of input slices + output slices must fit in fast memory
  • Native Granularity: Hardware has minimum execution size (e.g., 128×128) - smaller sizes incur padding overhead
  • Roofline Model: Latency = max(compute_time, memory_transfer_time)

Optimization Strategies

1. Subgraph Grouping

Group connected operations to make intermediate data ephemeral, eliminating memory transfers and capacity consumption.

2. Granularity Tuning

Balance between memory footprint and compute efficiency by adjusting tile sizes. Smaller tiles fit in memory but may incur padding overhead.

3. Data Residency Management

Decide which tensors to keep in fast memory vs. spill to slow memory. Trade-off between recomputation and memory bandwidth.

4. Traversal Order Optimization

Process tiles in "zig-zag" or "snake" patterns to maximize data reuse and minimize revisits to slow memory.

5. Split-K Pipelining

For chained MatMuls, use small reduction depths to stream data and minimize intermediate tensor size in fast memory.

Input/Output Format

Input Specification (JSON)

The problem is specified in a JSON file containing graph topology and hardware specs:

  • widths, heights: Dimensions of each tensor
  • inputs: List of tensor indexes consumed by each operation
  • outputs: List of tensor IDs produced by each operation
  • base_costs: Compute cost for each operation
  • op_types: Operation type (MatMul or Pointwise)
  • fast_memory_capacity: Available fast memory (e.g., 25000)
  • slow_memory_bandwidth: Transfer rate (e.g., 10)
  • native_granularity: Hardware native execution size [w, h]

Output Specification (JSON)

Must provide a complete execution schedule with:

  • subgraphs: List of operation groups to execute
  • granularities: [w, h, k] configuration for each subgraph
  • tensors_to_retain: Which tensors to keep in fast memory after each step
  • traversal_orders: Optional custom tile execution order (e.g., "snake" pattern)
  • subgraph_latencies: Calculated latency for each execution step

Key Trade-offs

Spilling vs. Recomputation

Spilling: Save intermediate results to slow memory (costs bandwidth, saves compute)
Recomputation: Discard intermediates and recompute when needed (costs compute, saves bandwidth)

Granularity Selection

Large tiles: Efficient compute (matches native granularity) but high memory footprint
Small tiles: Fits in memory but incurs padding overhead and more execution steps

Grouping Strategy

Mega-grouping: Combine many ops to eliminate intermediate transfers
Fine-grained: Execute ops separately for more flexibility in memory management

Example Scenarios

Example 1: Baseline (128×128 Pointwise Chain)

Two sequential pointwise operations. Strategy comparison:

  • Always Spill: 6,553.6 latency (memory bound, 2× slower)
  • Mega-Group (128×128): 3,276.8 latency (memory bound, optimal)
  • Mega-Group (64×64): 4,400 latency (compute bound, padding overhead)

Example 2: Diamond Graph (Skip Connection)

Intermediate tensor needed by two downstream branches:

  • Spilling (Cache): 11,468.8 latency (save to slow memory, reload later)
  • Recomputation (Flash): 6,276.8 latency (45% faster, recompute when needed)
  • Selective Residency (Hybrid): 4,638.4 latency (60% faster, keep in fast memory)

Example 3: MatMul with Revisit Optimization

128×128 MatMul with 64×64 granularity creates 4 tiles:

  • Naive Raster Order: 8,192 latency (reload data every tile)
  • Zig-Zag Traversal: 6,548 latency (20% faster, reuse resident data)

Example 4: Chained MatMul with Split-K

Computing (A @ B) @ C with tight memory:

  • Full Materialization (k=128): OOM error (intermediate too large)
  • Split-K Pipeline (k=32): 6,915.2 latency (streams data, fits in memory)

Implementation Approach

  • Languages: Python for scheduler logic, C++ for performance-critical paths
  • Graph Analysis: NetworkX for DAG traversal and dependency analysis
  • Optimization: Dynamic programming, greedy heuristics, genetic algorithms
  • Simulation: Custom roofline model simulator for latency calculation
  • Validation: JSON schema validation, memory constraint checking

Progress & Updates

Current Status

Phase: Algorithm development and baseline implementation
Focus: Implementing core scheduling algorithms and testing on example cases

Updates on competition progress, experimental results, and optimization discoveries will be posted here as work progresses.

← Back to Home