Dedicated Deployments
Deploying Base Models
Private instances are helpful if you’re expecting significant request traffic. Predibase supports the serving of base models from Huggingface. See available models. Dedicated deployments are billed by gpu-time.
For the base_model
, use the short names provided in the list of available models.
- Python SDK
- CLI
pb.deployments.create(
name="my-mistral-7b-instruct",
config=DeploymentConfig(
base_model="mistral-7b-instruct-v0-2",
# cooldown_time=3600 # Value in seconds, defaults to 0 which means deployment is always on
)
# description="", # Optional
)
pbase deploy llm --deployment-name llama-2-7b --model-name hf://meta-llama/Llama-2-7b-hf --engine-template llm-gpu-small --wait
Note: Dedicated deployments are always on by default. To change this, modify the cooldown_time
parameter.
Dedicated fine-tuned adapter deployments
If you are looking for a private, dedicated instance of your fine-tuned adapter, we recommend deploying a base model (above) and using LoRAX to run inference on your adapter. LoRAX enables you to serve an unlimited number of adapters on a single base model.
If you would still like to have a dedicated deployment of your fine-tuned model, we are able to serve it for you -- reach out to support@predibase.com.
Customize Compute
By default, Predibase will do automatic right-sizing to choose a suitable accelerator for the LLM you intend to deploy. You may also use a specific accelerator if you'd like.
- Python SDK
- CLI
pb.deployments.create(
name="my-mistral-7b-instruct",
config=DeploymentConfig(
base_model="mistral-7b-instruct-v0-2",
accelerator="a10_24gb_100",
# cooldown_time=3600 # Value in seconds, defaults to 0 which means deployment is always on
)
)
pbase deploy llm --deployment-name my-first-llm --model-name google/flan-t5-xl --engine-template llm-gpu-small
Available Accelerators
The available accelerators and associated tiers are listed below. See our pricing.
Accelerator | ID | Predibase Tiers | GPUs | GPU SKU |
---|---|---|---|---|
1 A10G 24GB | a10_24gb_100 | Developer | 1 | A10G |
4 A10G 24GB | a10_24gb_400 | Enterprise (VPC) | 4 | A10G |
0.25 A100 80GB | a100_80gb_025 | Enterprise (Predibase AI Cloud) | 0.25 | A100 |
0.5 A100 80GB | a100_80gb_050 | Enterprise (Predibase AI Cloud) | 0.50 | A100 |
1 A100 80GB | a100_80gb_100 | Enterprise (Predibase AI Cloud) | 1 | A100 |
2 A100 80GB * | a100_80gb_200 | Enterprise (Predibase AI Cloud) | 2 | A100 |
*To deploy on 2x A100s or upgrade to Enterprise, please reach out to us at sales@predibase.com
Prompting
Prompt the Base Model
Dedicated LLMs can be prompted via the Python SDK or REST API once they have been deployed.
- Python SDK
- REST
# Specify the dedicated deployment by name
lorax_client = pb.deployments.client("my-mistral-7b-instruct")
print(lorax_client.generate("""<<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>
[INST] What is the best pizza restaurant in New York? [/INST]""", max_new_tokens=100).generated_text)
# Export environment variables
export PREDIBASE_API_TOKEN="<YOUR TOKEN HERE>" # Settings > My Profile > Generate API Token
export PREDIBASE_TENANT_ID="<YOUR TENANT ID>" # Settings > My Profile > Overview > Tenant ID
export PREDIBASE_DEPLOYMENT="my-llama-2-7b-chat"
# query the LLM deployment
curl -d '{"inputs": "What is your name?", "parameters": {"max_new_tokens": 256}}' \
-H "Content-Type: application/json" \
-X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/$PREDIBASE_DEPLOYMENT/generate \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"
Note: you can also use the /generate_stream
endpoint to have the tokens be streamed from the deployment.
See REST API for more parameters.
Prompt a Fine-Tuned Adapter (with LoRAX)
- Python SDK
- REST
# Specify the dedicated deployment of the base model which was fine-tuned
lorax_client = pb.deployments.client("my-mistral-7b-instruct")
# Specify your adapter_id as "adapter-repo-name/adapter-version-number"
print(lorax_client.generate("""<<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>
[INST] What is the best pizza restaurant in New York? [/INST]""", adapter_id="adapter-repo-name/1", max_new_tokens=100).generated_text)
# Export environment variables
export PREDIBASE_API_TOKEN="<YOUR TOKEN HERE>" # Settings > My Profile > Generate API Token
export PREDIBASE_TENANT_ID="<YOUR TENANT ID>" # Settings > My Profile > Overview > Tenant ID
export PREDIBASE_DEPLOYMENT="my-llama-2-7b-chat"
# query the LLM deployment
curl -d '{"inputs": "What is your name?", "parameters": {"api_token": "${PREDIBASE_API_TOKEN}", "adapter_source": "pbase", "adapter_id": "<finetuned_model_repo_name>/<finetuned_model_version>", "max_new_tokens": 256}}' \
-H "Content-Type: application/json" \
-X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/$PREDIBASE_DEPLOYMENT/generate \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"
Note: you can also use the /generate_stream
endpoint to have the tokens be streamed from the deployment.
See REST API for more parameters.
Delete a Deployment
Deployments can be deleted via the SDK to free up compute resources.
- Python SDK
- CLI
pb.deployments.delete("my-mistral-7b")
pbase delete llm --deployment-name my-llama-2-7b-chat
Other helpful methods
- List LLM Deployments - Method for fetching a list of LLM deployments
- Get LLM Status - Method used for checking in your deployment status and see if it is ready for prompting