Guías de decisión

Self-host the model, or call an API?

Four questions about your constraints, and an honest answer about whether you should run your own inference.

Esta página aún no está totalmente traducida a tu idioma — mostrando la versión en inglés.

Running an open-weight model yourself looks cheaper on a spreadsheet and feels like control. In practice it's a standing engineering commitment — GPUs, scaling, uptime, version upgrades, MLOps — that most teams badly underestimate. So the honest default is the hosted API, and self-hosting is something you should be *forced* into: by data that legally can't leave your control, a real need to own the model's lifecycle, or economics at very high steady volume — and only if you have the operational muscle to run it. When just part of your traffic is sensitive, the answer is usually a split. Whatever you pick, keep the model behind a swappable seam so the choice isn't permanent.

Can your data be sent to a third-party provider?

Yes — ordinary data, fine to send to a vendor's APINo — sensitive or regulated data that can't leave your control

How much traffic will this handle?

Low or spiky — modest, uneven, or still finding product-market fitVery high and steady — enough that per-token cost dominates the bill

How much do you need to own the model itself?

Not much — riding the vendor's models and roadmap is fineA lot — pin a version, deep-fine-tune, no surprise deprecations

What's your in-house infrastructure capacity?

Light — a small team, no real GPU or MLOps muscleStrong — you can run GPUs, scaling, and model ops in-house

Responde todas las preguntas para ver tu recomendación.

Todas las opciones de un vistazo

Call a hosted API

Someone else runs the GPUs, the scaling, and the upgrades; you make a call and get an answer. It's the right default for the vast majority of products: no infrastructure to own, instant access to frontier models, and you pay only for what you use. Don't take on inference as a job until something forces you to.

Elige esto cuando

Your data is fine to send to a vendor
Volume is modest, spiky, or still growing
You'd rather build your product than run a model platform

Compensaciones

You pay per token — watch the bill as volume climbs
You ride the vendor's roadmap: price changes, rate limits, deprecations
Sensitive data leaves your perimeter — a non-starter in regulated domains

Self-host an open-weight model

You run the model on infrastructure you control — your cloud, your servers, your rules. Justified when a hard constraint forces it and you have the ops capacity to back it: regulated data that can't leave, a real need to own the model's lifecycle, or economics that flip at very high steady volume. Powerful, and a real ongoing commitment.

Elige esto cuando

Sensitive data can't be sent to a third party
You need to pin a version, fine-tune deeply, or guarantee availability
Volume is high and steady enough that owning the hardware pays off

Compensaciones

GPUs, scaling, uptime, and upgrades are now your problem, forever
Without real MLOps capacity, the savings evaporate into ops cost and downtime
You're responsible for keeping pace as better open models ship

Hybrid — split by sensitivity

Route the part that must stay private to infrastructure you control, and send everything else to the API. You get the API's ease for the bulk of traffic and keep the sensitive slice in-house — without taking on a full self-hosted platform you don't have the capacity to run. Often the pragmatic answer when only some of your data is sensitive.

Elige esto cuando

Some traffic is regulated or sensitive, but plenty of it isn't
You need control over part of the system, not all of it
You can run a focused private deployment, but not your whole inference stack

Compensaciones

Two paths to build, route between, and keep consistent
You have to classify requests correctly — and fail closed on the sensitive ones
Still some self-hosting: the private slice carries the ops burden