Warm pools eliminate most cold starts; autoscaling is token-aware. Sub-300 ms p95 is theoretically achievable for short prompts on Gemma 2B with a warm replica.
17 min
Warm pools eliminate most cold starts; autoscaling is token-aware. Sub-300 ms p95 is theoretically achievable for short prompts on Gemma 2B with a warm replica.