Four questions about your constraints, and an honest answer about whether you should run your own inference.
Esta página aún no está totalmente traducida a tu idioma — mostrando la versión en inglés.
Running an open-weight model yourself looks cheaper on a spreadsheet and feels like control. In practice it's a standing engineering commitment — GPUs, scaling, uptime, version upgrades, MLOps — that most teams badly underestimate. So the honest default is the hosted API, and self-hosting is something you should be *forced* into: by data that legally can't leave your control, a real need to own the model's lifecycle, or economics at very high steady volume — and only if you have the operational muscle to run it. When just part of your traffic is sensitive, the answer is usually a split. Whatever you pick, keep the model behind a swappable seam so the choice isn't permanent.
Responde todas las preguntas para ver tu recomendación.
Todas las opciones de un vistazo
Call a hosted API
Someone else runs the GPUs, the scaling, and the upgrades; you make a call and get an answer. It's the right default for the vast majority of products: no infrastructure to own, instant access to frontier models, and you pay only for what you use. Don't take on inference as a job until something forces you to.
Elige esto cuando
Your data is fine to send to a vendor
Volume is modest, spiky, or still growing
You'd rather build your product than run a model platform
Compensaciones
You pay per token — watch the bill as volume climbs
You ride the vendor's roadmap: price changes, rate limits, deprecations
Sensitive data leaves your perimeter — a non-starter in regulated domains
Self-host an open-weight model
You run the model on infrastructure you control — your cloud, your servers, your rules. Justified when a hard constraint forces it and you have the ops capacity to back it: regulated data that can't leave, a real need to own the model's lifecycle, or economics that flip at very high steady volume. Powerful, and a real ongoing commitment.
Elige esto cuando
Sensitive data can't be sent to a third party
You need to pin a version, fine-tune deeply, or guarantee availability
Volume is high and steady enough that owning the hardware pays off
Compensaciones
GPUs, scaling, uptime, and upgrades are now your problem, forever
Without real MLOps capacity, the savings evaporate into ops cost and downtime
You're responsible for keeping pace as better open models ship
Hybrid — split by sensitivity
Route the part that must stay private to infrastructure you control, and send everything else to the API. You get the API's ease for the bulk of traffic and keep the sensitive slice in-house — without taking on a full self-hosted platform you don't have the capacity to run. Often the pragmatic answer when only some of your data is sensitive.
Elige esto cuando
Some traffic is regulated or sensitive, but plenty of it isn't
You need control over part of the system, not all of it
You can run a focused private deployment, but not your whole inference stack
Compensaciones
Two paths to build, route between, and keep consistent
You have to classify requests correctly — and fail closed on the sensitive ones
Still some self-hosting: the private slice carries the ops burden