Intelligence

One shared layer picks the right model for any request. Two tools surface it: recommend for a single generation step, plan_generation for multi-step workflows.

How model picking works

Every pick — whether it lands in recommend's ranked list or fills a step in a plan_generation pipeline — flows through the same resolver:

Filter on hard constraints. Capability, required input features, cost / latency caps, duration. Models that don't qualify drop out.
Enrich with live data. Join the per-capability performance variant (Arena ELO, p50 generation latency, p50 cost per asset — measured from 180 days of production traffic) plus curated usage insights and tags.
Score and sort. Tag-based subcategory match (drops off-topic candidates), fuzzy boost on inferred intent (style, use case), priority metric (quality / speed / cost), recency tiebreak. Tags carry signals like deprecated:<successor> so retired models are surfaced behind their replacements automatically.

The full data set — Arena scores, latency / cost percentiles, model tags, curated insights — refreshes at a weekly basis and based on real platform traffic. There's no editorial table of "best models per category"; the data does the ranking, with editorial knowledge limited to behavior notes the data can't capture (e.g. "use the same model for every clip when chaining").

The more you use Scenario, the more accurate it will be.

Top picks today

For each major capability, the top model the resolver returns when you ask for quality.

Capability	Slug	Top pick (priority=quality)
Text → image	txt2img	GPT Image 2 — model_openai-gpt-image-2
Image → image (edit)	img2img	GPT Image 2 — model_openai-gpt-image-2
Text → video	txt2video	Gemini Omni — model_google-omni-flash
Image → video	img2video	Seedance 2.0 — model_bytedance-seedance-2-0
Video → video	video2video	Grok Extend Video — model_xai-grok-extend-video
Text → audio	txt2audio	Seed Audio 1.0 Multilingual — model_byteplus-seed-audio-1-0-multilingual
Image → 3D	img23d	Tripo P1 — model_tripo-p1-image-to-3d
Text → 3D	txt23d	Cartwheel Text to Motion — model_cartwheel-text-to-motion

Platform-aware formatting

Mention a target platform in the prompt — "TikTok", "Instagram story", "YouTube thumbnail" — and the resolver applies the matching aspect ratio and resolution without manual configuration. Recognized platforms:

Platform	Aspect Ratio	Resolution	Note
instagram_post	4:5	1080px	Instagram feed post
instagram_story	9:16	1080px	Instagram/TikTok story
instagram_reel	9:16	1080px	Instagram Reel
tiktok	9:16	1080px	TikTok vertical video
youtube_thumbnail	16:9	1280px	YouTube thumbnail
youtube_video	16:9	1920px	YouTube video frame
twitter_post	16:9	1200px	Twitter/X post image
linkedin_post	1:1	1080px	LinkedIn square post
facebook_post	1:1	1080px	Facebook post
facebook_cover	16:9	1640px	Facebook cover photo
pinterest	2:3	1000px	Pinterest pin
app_icon	1:1	1024px	Mobile app icon
game_asset	1:1	1024px	Game asset (power of 2)
game_texture	1:1	1024px	Seamless game texture
print_a4	3:4	2480px	A4 print at 300 DPI
print_poster	2:3	3000px	Poster print
wallpaper_desktop	16:9	2560px	Desktop wallpaper
wallpaper_phone	9:16	1440px	Phone wallpaper
banner_web	16:9	1920px	Website hero banner
email_header	16:9	600px	Email header image
avatar	1:1	512px	Profile picture / avatar
ultrawide	21:9	2560px	Ultrawide monitor

Single-step model picker. Pass a capability and a prompt describing your goal; recommend returns a ranked list with explanations citing real numbers (ELO rating, p50 latency, p50 cost), tradeoffs called out per candidate, and suggested input parameters ready to feed into the next generation call. A focused reasoning step sits on top of the shared resolver to produce the explanations.

plan_generation

Multi-step pipeline composer. Describe a workflow — "product video with voiceover", "character sheet", "talking-head explainer" — and plan_generationreturns the recommended sequence of stages with model hints. Templates carry the workflow shape; each step calls the same resolver above to fill in the model. When a description doesn't match any template, an LLM fallback composes a custom plan from the templates as worked examples plus editorial guidance.

The templates below show the shape and editorial rules for each workflow. Per-step model picks come from the same resolver behind recommend— see the "Top picks today" table above for what each capability resolves to right now.

product_video— Product image → hero video → optional voiceover → optional music

product videoproduct revealproduct launchcommercialad video

model_run — Generate hero product image
model_run — Remove background for a clean product image
model_run — Image-to-video product reveal
model_run — Generate voiceover(optional)
model_run — Generate background music(optional)

Generate the hero image cleanly first; image-to-video quality follows directly from input quality.
Voiceover and music are optional — only add when the request mentions them.

talking_head— Portrait → voice → lipsync video

talking headspokespersonpresenteravatar videowelcome videoapp introlipsync

model_run — Generate portrait/headshot
model_run — Generate voiceover
model_run — Lipsync portrait + audio → talking video

Pick a voice up front and reuse it across the whole sequence — switching voices breaks identity.
Lipsync quality depends on the portrait being front-facing with a closed mouth.

game_asset_3d— Concept art → clean background → upscale → 3D conversion

game asset 3d3d game asset3d model for gamegame prop 3dconcept to 3d

model_run — Generate concept art (white bg, isometric)
model_run — Remove background for clean 3D input
model_run — Upscale 2x for better 3D texture quality(optional)
model_run — Convert to 3D model

Background removal before 3D conversion eliminates phantom geometry.
Upscaling 2-4x before 3D conversion yields cleaner textures.
Isometric or front-facing concept views work best for single-image 3D conversion.

character_sheet— Reference character → multiple scenes via reference-image editing

character sheetcharacter turnaroundcharacter referencecharacter consistencysame character

model_run — Generate strong reference character image
model_run — Scene 1: pass reference image + new scene prompt
model_run — Scene 2: same reference image + new scene prompt

Generate one strong reference image, then pass it as an image input to every scene generation.
Models with the most reference-image slots maintain identity best across scenes — prefer them when the request mentions consistency.
LoRA training is for repeated brand-style work at scale, not single character runs.

logo_to_vector— Generate logo → vectorize to SVG

logo vectorvectorize logosvg logologo for printscalable logo

model_run — Generate logo as vector
model_run — Vectorize raster output (alternative path)(optional)

Direct SVG output is faster than raster→vectorize when the model supports it.
If only raster is available, upscale before vectorization to get cleaner paths.

video_chain_long— Chain multiple video clips via last-frame handoff for longer content

long video30 second video60 second videoextended videomulti-clip

model_run — Generate hero image for first frame
model_run — Image→video clip 1
asset_get — Read clip 1's lastFrame asset ID
model_run — LastFrame→video clip 2 (same model as clip 1)

Use the SAME image-to-video model for every clip — switching models mid-chain breaks visual continuity.
Pick a model whose inputs include `endImage` (last-frame conditioning); without it you cannot chain cleanly.
Store each clip's last-frame asset ID before kicking off the next clip.

video_extend— Initial clip → extend repeatedly with native-audio extension

extend videolonger videovideo extend60 second clipvideo continuation

model_run — Generate initial clip
model_run — Extend clip (chainable)

Use a model that supports the `video2video` extend capability for seamless audio continuity.
Each extension chains onto the previous one; latency scales linearly with the number of extensions.

style_transfer— Source image → edit with target-style prompt

style transfermake it look likeghibli styleanime styleoil painting stylewatercolor stylepixel art style

model_run — Edit source image with style description

Pass the source as an image input and describe the target style in the prompt — don't try to encode style purely in text.

photorealism_polish— Generate base image → polish with creative upscaler to remove the AI look

photorealisticremove ai lookpolish photolook realphoto polish

model_run — Generate base photoreal image
model_run — Polish with creative upscaler (photorealistic preset)

Generate the base composition first, then run a creative upscaler with photorealistic settings to break the plastic AI texture.

Training Recommendation

Picking the right base architecture before a custom training run is its own decision. The recommend_training tool maps a plain-language description of your dataset and goal to the exact type string to pass to model_create— so you don't have to memorize the catalog or guess at variant tradeoffs.

Why it's a separate tool from recommend:

Different vocabulary. Training types (flux.2-dev-lora, qwen-image-edit-2511-lora, …) are disjoint from inference capabilities (txt2img, img2video, …).
Different tradeoffs. Generation picks weigh Arena ELO, latency, and cost; training picks weigh dataset shape (single images vs before/after pairs), output style (photoreal vs stylized), and the right variant within a family for your speed-vs-quality budget.
Family lock-in matters.A LoRA trained on one family runs only on that family at inference. The recommendation surfaces this explicitly so you don't commit to a training run that locks you into a family you didn't mean to pick.

Same intelligence layer, tuned for the training decision tree: an LLM turns your prose into a structured intent (modality, dataset shape, style, subject, priority), then a deterministic picker scores the catalog and returns one recommended variant plus up to three alternatives, each with its own "when this is better" note. Voice cloning shortcuts to a static lookup since the decision tree there is essentially binary (short rough sample → Instant Voice Cloning; long pristine recording → Professional Voice Clone).

For the full input shape, output schema, and worked examples, see recommend_training in the Tools reference.