Intelligence

One shared layer picks the right model for any request. Two tools surface it: recommend for a single generation step, plan_generation for multi-step workflows.

How model picking works

Every pick — whether it lands in recommend's ranked list or fills a step in a plan_generation pipeline — flows through the same resolver:

  1. Filter on hard constraints. Capability, required input features, cost / latency caps, duration. Models that don't qualify drop out.
  2. Enrich with live data. Join the per-capability performance variant (Arena ELO, p50 generation latency, p50 cost per asset — measured from 180 days of production traffic) plus curated usage insights and tags.
  3. Score and sort. Tag-based subcategory match (drops off-topic candidates), fuzzy boost on inferred intent (style, use case), priority metric (quality / speed / cost), recency tiebreak. Tags carry signals like deprecated:<successor> so retired models are surfaced behind their replacements automatically.

The full data set — Arena scores, latency / cost percentiles, model tags, curated insights — refreshes at a weekly basis and based on real platform traffic. There's no editorial table of "best models per category"; the data does the ranking, with editorial knowledge limited to behavior notes the data can't capture (e.g. "use the same model for every clip when chaining").

The more you use Scenario, the more accurate it will be.

Top picks today

For each major capability, the top model the resolver returns when you ask for quality.

CapabilitySlugTop pick (priority=quality)
Text → imagetxt2imgGPT Image 2 — model_openai-gpt-image-2
Image → image (edit)img2imgGPT Image 2 — model_openai-gpt-image-2
Text → videotxt2videoSeedance 2.0 — model_bytedance-seedance-2-0
Image → videoimg2videoSeedance 2.0 — model_bytedance-seedance-2-0
Video → videovideo2videoGrok Extend Video — model_xai-grok-extend-video
Text → audiotxt2audioxAI Grok TTS — model_xai-grok-tts
Image → 3Dimg23dPixal3D — model_pixal3d
Text → 3Dtxt23dUthana Text-to-Motion — model_uthana-text-to-motion-bucmd

Platform-aware formatting

Mention a target platform in the prompt — "TikTok", "Instagram story", "YouTube thumbnail" — and the resolver applies the matching aspect ratio and resolution without manual configuration. Recognized platforms:

PlatformAspect RatioResolutionNote
instagram_post4:51080pxInstagram feed post
instagram_story9:161080pxInstagram/TikTok story
instagram_reel9:161080pxInstagram Reel
tiktok9:161080pxTikTok vertical video
youtube_thumbnail16:91280pxYouTube thumbnail
youtube_video16:91920pxYouTube video frame
twitter_post16:91200pxTwitter/X post image
linkedin_post1:11080pxLinkedIn square post
facebook_post1:11080pxFacebook post
facebook_cover16:91640pxFacebook cover photo
pinterest2:31000pxPinterest pin
app_icon1:11024pxMobile app icon
game_asset1:11024pxGame asset (power of 2)
game_texture1:11024pxSeamless game texture
print_a43:42480pxA4 print at 300 DPI
print_poster2:33000pxPoster print
wallpaper_desktop16:92560pxDesktop wallpaper
wallpaper_phone9:161440pxPhone wallpaper
banner_web16:91920pxWebsite hero banner
email_header16:9600pxEmail header image
avatar1:1512pxProfile picture / avatar
ultrawide21:92560pxUltrawide monitor

recommend

Single-step model picker. Pass a capability and a prompt describing your goal; recommend returns a ranked list with explanations citing real numbers (ELO rating, p50 latency, p50 cost), tradeoffs called out per candidate, and suggested input parameters ready to feed into the next generation call. A focused reasoning step sits on top of the shared resolver to produce the explanations.

plan_generation

Multi-step pipeline composer. Describe a workflow — "product video with voiceover", "character sheet", "talking-head explainer" — and plan_generationreturns the recommended sequence of stages with model hints. Templates carry the workflow shape; each step calls the same resolver above to fill in the model. When a description doesn't match any template, an LLM fallback composes a custom plan from the templates as worked examples plus editorial guidance.

The templates below show the shape and editorial rules for each workflow. Per-step model picks come from the same resolver behind recommend— see the "Top picks today" table above for what each capability resolves to right now.

product_videoProduct image → hero video → optional voiceover → optional music
product videoproduct revealproduct launchcommercialad video
  1. run_model — Generate hero product image
  2. analyze — Clean transparent product image
  3. run_model — Image-to-video product reveal
  4. run_model — Generate voiceover(optional)
  5. run_model — Generate background music(optional)
  • Generate the hero image cleanly first; image-to-video quality follows directly from input quality.
  • Voiceover and music are optional — only add when the request mentions them.
talking_headPortrait → voice → lipsync video
talking headspokespersonpresenteravatar videowelcome videoapp introlipsync
  1. run_model — Generate portrait/headshot
  2. run_model — Generate voiceover
  3. run_model — Lipsync portrait + audio → talking video
  • Pick a voice up front and reuse it across the whole sequence — switching voices breaks identity.
  • Lipsync quality depends on the portrait being front-facing with a closed mouth.
game_asset_3dConcept art → clean background → upscale → 3D conversion
game asset 3d3d game asset3d model for gamegame prop 3dconcept to 3d
  1. run_model — Generate concept art (white bg, isometric)
  2. analyze — Remove background for clean 3D input
  3. run_model — Upscale 2x for better 3D texture quality(optional)
  4. run_model — Convert to 3D model
  • Background removal before 3D conversion eliminates phantom geometry.
  • Upscaling 2-4x before 3D conversion yields cleaner textures.
  • Isometric or front-facing concept views work best for single-image 3D conversion.
character_sheetReference character → multiple scenes via reference-image editing
character sheetcharacter turnaroundcharacter referencecharacter consistencysame character
  1. run_model — Generate strong reference character image
  2. run_model — Scene 1: pass reference image + new scene prompt
  3. run_model — Scene 2: same reference image + new scene prompt
  • Generate one strong reference image, then pass it as an image input to every scene generation.
  • Models with the most reference-image slots maintain identity best across scenes — prefer them when the request mentions consistency.
  • LoRA training is for repeated brand-style work at scale, not single character runs.
logo_to_vectorGenerate logo → vectorize to SVG
logo vectorvectorize logosvg logologo for printscalable logo
  1. run_model — Generate logo as vector
  2. run_model — Vectorize raster output (alternative path)(optional)
  • Direct SVG output is faster than raster→vectorize when the model supports it.
  • If only raster is available, upscale before vectorization to get cleaner paths.
video_chain_longChain multiple video clips via last-frame handoff for longer content
long video30 second video60 second videoextended videomulti-clip
  1. run_model — Generate hero image for first frame
  2. run_model — Image→video clip 1
  3. manage_assets — Read clip 1's lastFrame asset ID
  4. run_model — LastFrame→video clip 2 (same model as clip 1)
  • Use the SAME image-to-video model for every clip — switching models mid-chain breaks visual continuity.
  • Pick a model whose inputs include `endImage` (last-frame conditioning); without it you cannot chain cleanly.
  • Store each clip's last-frame asset ID before kicking off the next clip.
video_extendInitial clip → extend repeatedly with native-audio extension
extend videolonger videovideo extend60 second clipvideo continuation
  1. run_model — Generate initial clip
  2. run_model — Extend clip (chainable)
  • Use a model that supports the `video2video` extend capability for seamless audio continuity.
  • Each extension chains onto the previous one; latency scales linearly with the number of extensions.
style_transferSource image → edit with target-style prompt
style transfermake it look likeghibli styleanime styleoil painting stylewatercolor stylepixel art style
  1. run_model — Edit source image with style description
  • Pass the source as an image input and describe the target style in the prompt — don't try to encode style purely in text.
photorealism_polishGenerate base image → polish with creative upscaler to remove the AI look
photorealisticremove ai lookpolish photolook realphoto polish
  1. run_model — Generate base photoreal image
  2. run_model — Polish with creative upscaler (photorealistic preset)
  • Generate the base composition first, then run a creative upscaler with photorealistic settings to break the plastic AI texture.

Training Recommendation

Picking the right base architecture before a custom training run is its own decision. The recommend_training tool maps a plain-language description of your dataset and goal to the exact type string to pass to manage_models create— so you don't have to memorize the catalog or guess at variant tradeoffs.

Why it's a separate tool from recommend:

  • Different vocabulary. Training types (flux.2-dev-lora, qwen-image-edit-2511-lora, …) are disjoint from inference capabilities (txt2img, img2video, …).
  • Different tradeoffs. Generation picks weigh Arena ELO, latency, and cost; training picks weigh dataset shape (single images vs before/after pairs), output style (photoreal vs stylized), and the right variant within a family for your speed-vs-quality budget.
  • Family lock-in matters.A LoRA trained on one family runs only on that family at inference. The recommendation surfaces this explicitly so you don't commit to a training run that locks you into a family you didn't mean to pick.

Same intelligence layer, tuned for the training decision tree: an LLM turns your prose into a structured intent (modality, dataset shape, style, subject, priority), then a deterministic picker scores the catalog and returns one recommended variant plus up to three alternatives, each with its own "when this is better" note. Voice cloning shortcuts to a static lookup since the decision tree there is essentially binary (short rough sample → Instant Voice Cloning; long pristine recording → Professional Voice Clone).

For the full input shape, output schema, and worked examples, see recommend_training in the Tools reference.