Intelligence
One shared layer picks the right model for any request. Two tools surface it: recommend for a single generation step, plan_generation for multi-step workflows.
How model picking works
Every pick — whether it lands in recommend's ranked list or fills a step in a plan_generation pipeline — flows through the same resolver:
- Filter on hard constraints. Capability, required input features, cost / latency caps, duration. Models that don't qualify drop out.
- Enrich with live data. Join the per-capability performance variant (Arena ELO, p50 generation latency, p50 cost per asset — measured from 180 days of production traffic) plus curated usage insights and tags.
- Score and sort. Tag-based subcategory match (drops off-topic candidates), fuzzy boost on inferred intent (style, use case), priority metric (quality / speed / cost), recency tiebreak. Tags carry signals like
deprecated:<successor>so retired models are surfaced behind their replacements automatically.
The full data set — Arena scores, latency / cost percentiles, model tags, curated insights — refreshes at a weekly basis and based on real platform traffic. There's no editorial table of "best models per category"; the data does the ranking, with editorial knowledge limited to behavior notes the data can't capture (e.g. "use the same model for every clip when chaining").
The more you use Scenario, the more accurate it will be.
Top picks today
For each major capability, the top model the resolver returns when you ask for quality.
| Capability | Slug | Top pick (priority=quality) |
|---|---|---|
| Text → image | txt2img | GPT Image 2 — model_openai-gpt-image-2 |
| Image → image (edit) | img2img | GPT Image 2 — model_openai-gpt-image-2 |
| Text → video | txt2video | Seedance 2.0 — model_bytedance-seedance-2-0 |
| Image → video | img2video | Seedance 2.0 — model_bytedance-seedance-2-0 |
| Video → video | video2video | Grok Extend Video — model_xai-grok-extend-video |
| Text → audio | txt2audio | xAI Grok TTS — model_xai-grok-tts |
| Image → 3D | img23d | Pixal3D — model_pixal3d |
| Text → 3D | txt23d | Uthana Text-to-Motion — model_uthana-text-to-motion-bucmd |
Platform-aware formatting
Mention a target platform in the prompt — "TikTok", "Instagram story", "YouTube thumbnail" — and the resolver applies the matching aspect ratio and resolution without manual configuration. Recognized platforms:
| Platform | Aspect Ratio | Resolution | Note |
|---|---|---|---|
| instagram_post | 4:5 | 1080px | Instagram feed post |
| instagram_story | 9:16 | 1080px | Instagram/TikTok story |
| instagram_reel | 9:16 | 1080px | Instagram Reel |
| tiktok | 9:16 | 1080px | TikTok vertical video |
| youtube_thumbnail | 16:9 | 1280px | YouTube thumbnail |
| youtube_video | 16:9 | 1920px | YouTube video frame |
| twitter_post | 16:9 | 1200px | Twitter/X post image |
| linkedin_post | 1:1 | 1080px | LinkedIn square post |
| facebook_post | 1:1 | 1080px | Facebook post |
| facebook_cover | 16:9 | 1640px | Facebook cover photo |
| 2:3 | 1000px | Pinterest pin | |
| app_icon | 1:1 | 1024px | Mobile app icon |
| game_asset | 1:1 | 1024px | Game asset (power of 2) |
| game_texture | 1:1 | 1024px | Seamless game texture |
| print_a4 | 3:4 | 2480px | A4 print at 300 DPI |
| print_poster | 2:3 | 3000px | Poster print |
| wallpaper_desktop | 16:9 | 2560px | Desktop wallpaper |
| wallpaper_phone | 9:16 | 1440px | Phone wallpaper |
| banner_web | 16:9 | 1920px | Website hero banner |
| email_header | 16:9 | 600px | Email header image |
| avatar | 1:1 | 512px | Profile picture / avatar |
| ultrawide | 21:9 | 2560px | Ultrawide monitor |
recommend
Single-step model picker. Pass a capability and a prompt describing your goal; recommend returns a ranked list with explanations citing real numbers (ELO rating, p50 latency, p50 cost), tradeoffs called out per candidate, and suggested input parameters ready to feed into the next generation call. A focused reasoning step sits on top of the shared resolver to produce the explanations.
plan_generation
Multi-step pipeline composer. Describe a workflow — "product video with voiceover", "character sheet", "talking-head explainer" — and plan_generationreturns the recommended sequence of stages with model hints. Templates carry the workflow shape; each step calls the same resolver above to fill in the model. When a description doesn't match any template, an LLM fallback composes a custom plan from the templates as worked examples plus editorial guidance.
The templates below show the shape and editorial rules for each workflow. Per-step model picks come from the same resolver behind recommend— see the "Top picks today" table above for what each capability resolves to right now.
product_video— Product image → hero video → optional voiceover → optional musicrun_model— Generate hero product imageanalyze— Clean transparent product imagerun_model— Image-to-video product revealrun_model— Generate voiceover(optional)run_model— Generate background music(optional)
- Generate the hero image cleanly first; image-to-video quality follows directly from input quality.
- Voiceover and music are optional — only add when the request mentions them.
talking_head— Portrait → voice → lipsync videorun_model— Generate portrait/headshotrun_model— Generate voiceoverrun_model— Lipsync portrait + audio → talking video
- Pick a voice up front and reuse it across the whole sequence — switching voices breaks identity.
- Lipsync quality depends on the portrait being front-facing with a closed mouth.
game_asset_3d— Concept art → clean background → upscale → 3D conversionrun_model— Generate concept art (white bg, isometric)analyze— Remove background for clean 3D inputrun_model— Upscale 2x for better 3D texture quality(optional)run_model— Convert to 3D model
- Background removal before 3D conversion eliminates phantom geometry.
- Upscaling 2-4x before 3D conversion yields cleaner textures.
- Isometric or front-facing concept views work best for single-image 3D conversion.
character_sheet— Reference character → multiple scenes via reference-image editingrun_model— Generate strong reference character imagerun_model— Scene 1: pass reference image + new scene promptrun_model— Scene 2: same reference image + new scene prompt
- Generate one strong reference image, then pass it as an image input to every scene generation.
- Models with the most reference-image slots maintain identity best across scenes — prefer them when the request mentions consistency.
- LoRA training is for repeated brand-style work at scale, not single character runs.
logo_to_vector— Generate logo → vectorize to SVGrun_model— Generate logo as vectorrun_model— Vectorize raster output (alternative path)(optional)
- Direct SVG output is faster than raster→vectorize when the model supports it.
- If only raster is available, upscale before vectorization to get cleaner paths.
video_chain_long— Chain multiple video clips via last-frame handoff for longer contentrun_model— Generate hero image for first framerun_model— Image→video clip 1manage_assets— Read clip 1's lastFrame asset IDrun_model— LastFrame→video clip 2 (same model as clip 1)
- Use the SAME image-to-video model for every clip — switching models mid-chain breaks visual continuity.
- Pick a model whose inputs include `endImage` (last-frame conditioning); without it you cannot chain cleanly.
- Store each clip's last-frame asset ID before kicking off the next clip.
video_extend— Initial clip → extend repeatedly with native-audio extensionrun_model— Generate initial cliprun_model— Extend clip (chainable)
- Use a model that supports the `video2video` extend capability for seamless audio continuity.
- Each extension chains onto the previous one; latency scales linearly with the number of extensions.
style_transfer— Source image → edit with target-style promptrun_model— Edit source image with style description
- Pass the source as an image input and describe the target style in the prompt — don't try to encode style purely in text.
photorealism_polish— Generate base image → polish with creative upscaler to remove the AI lookrun_model— Generate base photoreal imagerun_model— Polish with creative upscaler (photorealistic preset)
- Generate the base composition first, then run a creative upscaler with photorealistic settings to break the plastic AI texture.
Training Recommendation
Picking the right base architecture before a custom training run is its own decision. The recommend_training tool maps a plain-language description of your dataset and goal to the exact type string to pass to manage_models create— so you don't have to memorize the catalog or guess at variant tradeoffs.
Why it's a separate tool from recommend:
- Different vocabulary. Training types (
flux.2-dev-lora,qwen-image-edit-2511-lora, …) are disjoint from inference capabilities (txt2img,img2video, …). - Different tradeoffs. Generation picks weigh Arena ELO, latency, and cost; training picks weigh dataset shape (single images vs before/after pairs), output style (photoreal vs stylized), and the right variant within a family for your speed-vs-quality budget.
- Family lock-in matters.A LoRA trained on one family runs only on that family at inference. The recommendation surfaces this explicitly so you don't commit to a training run that locks you into a family you didn't mean to pick.
Same intelligence layer, tuned for the training decision tree: an LLM turns your prose into a structured intent (modality, dataset shape, style, subject, priority), then a deterministic picker scores the catalog and returns one recommended variant plus up to three alternatives, each with its own "when this is better" note. Voice cloning shortcuts to a static lookup since the decision tree there is essentially binary (short rough sample → Instant Voice Cloning; long pristine recording → Professional Voice Clone).
For the full input shape, output schema, and worked examples, see recommend_training in the Tools reference.