diff --git a/docs/api/authentication.mdx b/docs/api/authentication.mdx new file mode 100644 index 000000000..03d802fb3 --- /dev/null +++ b/docs/api/authentication.mdx @@ -0,0 +1,63 @@ +--- +title: Authentication +--- + +No authentication is required when accessing Ollama's API locally via `http://localhost:11434`. + +Authentication is required for the following: + +* Running cloud models via ollama.com +* Publishing models +* Downloading private models + +Ollama supports two authentication methods: + +* **Signing in**: sign in from your local installation, and Ollama will automatically take care of authenticating requests to ollama.com when running commands +* **API keys**: API keys for programmatic access to ollama.com's API + +## Signing in + +To sign in to ollama.com from your local installation of Ollama, run: + +``` +ollama signin +``` + +Once signed in, Ollama will automatically authenticate commands as required: + +``` +ollama run gpt-oss:120b-cloud +``` + +Similarly, when accessing a local API endpoint that requires cloud access, Ollama will automatically authenticate the request: + +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "gpt-oss:120b-cloud", + "prompt": "Why is the sky blue?" +}' +``` + +## API keys + +For direct access to ollama.com's API served at `https://ollama.com/api`, authentication via API keys is required. + +First, create an [API key](https://ollama.com/settings/keys), then set the `OLLAMA_API_KEY` environment variable: + +```shell +export OLLAMA_API_KEY=your_api_key +``` + +Then use the API key in the Authorization header: + +```shell +curl https://ollama.com/api/generate \ + -H "Authorization: Bearer $OLLAMA_API_KEY" \ + -d '{ + "model": "gpt-oss:120b", + "prompt": "Why is the sky blue?", + "stream": false + }' +``` + +API keys don't currently expire, however you can revoke them at any time in your [API keys settings](https://ollama.com/settings/keys). diff --git a/docs/api/errors.mdx b/docs/api/errors.mdx new file mode 100644 index 000000000..15a8809ea --- /dev/null +++ b/docs/api/errors.mdx @@ -0,0 +1,36 @@ +--- +title: Errors +--- + +## Status codes + +Endpoints return appropriate HTTP status codes based on the success or failure of the request in the HTTP status line (e.g. `HTTP/1.1 200 OK` or `HTTP/1.1 400 Bad Request`). Common status codes are: + +- `200`: Success +- `400`: Bad Request (missing parameters, invalid JSON, etc.) +- `404`: Not Found (model doesn't exist, etc.) +- `429`: Too Many Requests (e.g. when a rate limit is exceeded) +- `500`: Internal Server Error +- `502`: Bad Gateway (e.g. when a cloud model cannot be reached) + +## Error messages + +Errors are returned in the `application/json` format with the following structure, with the error message in the `error` property: + +```json +{ + "error": "the model failed to generate a response" +} +``` + +## Errors that occur while streaming + +If an error occurs mid-stream, the error will be returned as an object in the `application/x-ndjson` format with an `error` property. Since the response has already started, the status code of the response will not be changed. + +```json +{"model":"gemma3","created_at":"2025-10-26T17:21:21.196249Z","response":" Yes","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:21:21.207235Z","response":".","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:21:21.219166Z","response":"I","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:21:21.231094Z","response":"can","done":false} +{"error":"an error was encountered while running the model"} +``` diff --git a/docs/api/index.mdx b/docs/api/index.mdx index f47af63c6..bdda0a62e 100644 --- a/docs/api/index.mdx +++ b/docs/api/index.mdx @@ -1,1873 +1,47 @@ -# API +--- +title: "Introduction" +--- -## Endpoints +Ollama's API allows you to run and interact with models programatically. -- [Generate a completion](#generate-a-completion) -- [Generate a chat completion](#generate-a-chat-completion) -- [Create a Model](#create-a-model) -- [List Local Models](#list-local-models) -- [Show Model Information](#show-model-information) -- [Copy a Model](#copy-a-model) -- [Delete a Model](#delete-a-model) -- [Pull a Model](#pull-a-model) -- [Push a Model](#push-a-model) -- [Generate Embeddings](#generate-embeddings) -- [List Running Models](#list-running-models) -- [Version](#version) +## Get started -## Conventions +If you're just getting started, follow the [quickstart](/quickstart) documentation to get up and running with Ollama's API. -### Model names +## Base URL -Model names follow a `model:tag` format, where `model` can have an optional namespace such as `example/model`. Some examples are `orca-mini:3b-q8_0` and `llama3:70b`. The tag is optional and, if not provided, will default to `latest`. The tag is used to identify a specific version. - -### Durations - -All durations are returned in nanoseconds. - -### Streaming responses - -Certain endpoints stream responses as JSON objects. Streaming can be disabled by providing `{"stream": false}` for these endpoints. - -## Generate a completion +After installation, Ollama's API is served by default at: ``` -POST /api/generate +http://localhost:11434/api ``` -Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request. +For running cloud models on **ollama.com**, the same API is available with the following base URL: -### Parameters +``` +https://ollama.com/api +``` -- `model`: (required) the [model name](#model-names) -- `prompt`: the prompt to generate a response for -- `suffix`: the text after the model response -- `images`: (optional) a list of base64-encoded images (for multimodal models such as `llava`) -- `think`: (for thinking models) should the model think before responding? +## Example request -Advanced parameters (optional): - -- `format`: the format to return a response in. Format can be `json` or a JSON schema -- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature` -- `system`: system message to (overrides what is defined in the `Modelfile`) -- `template`: the prompt template to use (overrides what is defined in the `Modelfile`) -- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects -- `raw`: if `true` no formatting will be applied to the prompt. You may choose to use the `raw` parameter if you are specifying a full templated prompt in your request to the API -- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`) -- `context` (deprecated): the context parameter returned from a previous request to `/generate`, this can be used to keep a short conversational memory - -#### Structured outputs - -Structured outputs are supported by providing a JSON schema in the `format` parameter. The model will generate a response that matches the schema. See the [structured outputs](#request-structured-outputs) example below. - -#### JSON mode - -Enable JSON mode by setting the `format` parameter to `json`. This will structure the response as a valid JSON object. See the JSON mode [example](#request-json-mode) below. - -> [!IMPORTANT] -> It's important to instruct the model to use JSON in the `prompt`. Otherwise, the model may generate large amounts whitespace. - -### Examples - -#### Generate request (Streaming) - -##### Request +Once Ollama is running, its API is automatically available and can be accessed via `curl`: ```shell curl http://localhost:11434/api/generate -d '{ - "model": "llama3.2", + "model": "gemma3", "prompt": "Why is the sky blue?" }' ``` -##### Response +## Libraries -A stream of JSON objects is returned: +Ollama has official libraries for Python and JavaScript: -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T08:52:19.385406455-07:00", - "response": "The", - "done": false -} -``` +- [Python](https://github.com/ollama/ollama-python) +- [JavaScript](https://github.com/ollama/ollama-js) -The final response in the stream also includes additional data about the generation: - -- `total_duration`: time spent generating the response -- `load_duration`: time spent in nanoseconds loading the model -- `prompt_eval_count`: number of tokens in the prompt -- `prompt_eval_duration`: time spent in nanoseconds evaluating the prompt -- `eval_count`: number of tokens in the response -- `eval_duration`: time in nanoseconds spent generating the response -- `context`: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory -- `response`: empty if the response was streamed, if not streamed, this will contain the full response - -To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration` * `10^9`. - -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T19:22:45.499127Z", - "response": "", - "done": true, - "context": [1, 2, 3], - "total_duration": 10706818083, - "load_duration": 6338219291, - "prompt_eval_count": 26, - "prompt_eval_duration": 130079000, - "eval_count": 259, - "eval_duration": 4232710000 -} -``` - -#### Request (No streaming) - -##### Request - -A response can be received in one reply when streaming is off. - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "llama3.2", - "prompt": "Why is the sky blue?", - "stream": false -}' -``` - -##### Response - -If `stream` is set to `false`, the response will be a single JSON object: - -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T19:22:45.499127Z", - "response": "The sky is blue because it is the color of the sky.", - "done": true, - "context": [1, 2, 3], - "total_duration": 5043500667, - "load_duration": 5025959, - "prompt_eval_count": 26, - "prompt_eval_duration": 325953000, - "eval_count": 290, - "eval_duration": 4709213000 -} -``` - -#### Request (with suffix) - -##### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "codellama:code", - "prompt": "def compute_gcd(a, b):", - "suffix": " return result", - "options": { - "temperature": 0 - }, - "stream": false -}' -``` - -##### Response - -```json5 -{ - "model": "codellama:code", - "created_at": "2024-07-22T20:47:51.147561Z", - "response": "\n if a == 0:\n return b\n else:\n return compute_gcd(b % a, a)\n\ndef compute_lcm(a, b):\n result = (a * b) / compute_gcd(a, b)\n", - "done": true, - "done_reason": "stop", - "context": [...], - "total_duration": 1162761250, - "load_duration": 6683708, - "prompt_eval_count": 17, - "prompt_eval_duration": 201222000, - "eval_count": 63, - "eval_duration": 953997000 -} -``` - -#### Request (Structured outputs) - -##### Request - -```shell -curl -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{ - "model": "llama3.1:8b", - "prompt": "Ollama is 22 years old and is busy saving the world. Respond using JSON", - "stream": false, - "format": { - "type": "object", - "properties": { - "age": { - "type": "integer" - }, - "available": { - "type": "boolean" - } - }, - "required": [ - "age", - "available" - ] - } -}' -``` - -##### Response - -```json -{ - "model": "llama3.1:8b", - "created_at": "2024-12-06T00:48:09.983619Z", - "response": "{\n \"age\": 22,\n \"available\": true\n}", - "done": true, - "done_reason": "stop", - "context": [1, 2, 3], - "total_duration": 1075509083, - "load_duration": 567678166, - "prompt_eval_count": 28, - "prompt_eval_duration": 236000000, - "eval_count": 16, - "eval_duration": 269000000 -} -``` - -#### Request (JSON mode) - -> [!IMPORTANT] -> When `format` is set to `json`, the output will always be a well-formed JSON object. It's important to also instruct the model to respond in JSON. - -##### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "llama3.2", - "prompt": "What color is the sky at different times of the day? Respond using JSON", - "format": "json", - "stream": false -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at": "2023-11-09T21:07:55.186497Z", - "response": "{\n\"morning\": {\n\"color\": \"blue\"\n},\n\"noon\": {\n\"color\": \"blue-gray\"\n},\n\"afternoon\": {\n\"color\": \"warm gray\"\n},\n\"evening\": {\n\"color\": \"orange\"\n}\n}\n", - "done": true, - "context": [1, 2, 3], - "total_duration": 4648158584, - "load_duration": 4071084, - "prompt_eval_count": 36, - "prompt_eval_duration": 439038000, - "eval_count": 180, - "eval_duration": 4196918000 -} -``` - -The value of `response` will be a string containing JSON similar to: - -```json -{ - "morning": { - "color": "blue" - }, - "noon": { - "color": "blue-gray" - }, - "afternoon": { - "color": "warm gray" - }, - "evening": { - "color": "orange" - } -} -``` - -#### Request (with images) - -To submit images to multimodal models such as `llava` or `bakllava`, provide a list of base64-encoded `images`: - -#### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "llava", - "prompt":"What is in this picture?", - "stream": false, - "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"] -}' -``` - -#### Response - -```json -{ - "model": "llava", - "created_at": "2023-11-03T15:36:02.583064Z", - "response": "A happy cartoon character, which is cute and cheerful.", - "done": true, - "context": [1, 2, 3], - "total_duration": 2938432250, - "load_duration": 2559292, - "prompt_eval_count": 1, - "prompt_eval_duration": 2195557000, - "eval_count": 44, - "eval_duration": 736432000 -} -``` - -#### Request (Raw Mode) - -In some cases, you may wish to bypass the templating system and provide a full prompt. In this case, you can use the `raw` parameter to disable templating. Also note that raw mode will not return a context. - -##### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "mistral", - "prompt": "[INST] why is the sky blue? [/INST]", - "raw": true, - "stream": false -}' -``` - -#### Request (Reproducible outputs) - -For reproducible outputs, set `seed` to a number: - -##### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "mistral", - "prompt": "Why is the sky blue?", - "options": { - "seed": 123 - } -}' -``` - -##### Response - -```json -{ - "model": "mistral", - "created_at": "2023-11-03T15:36:02.583064Z", - "response": " The sky appears blue because of a phenomenon called Rayleigh scattering.", - "done": true, - "total_duration": 8493852375, - "load_duration": 6589624375, - "prompt_eval_count": 14, - "prompt_eval_duration": 119039000, - "eval_count": 110, - "eval_duration": 1779061000 -} -``` - -#### Generate request (With options) - -If you want to set custom options for the model at runtime rather than in the Modelfile, you can do so with the `options` parameter. This example sets every available option, but you can set any of them individually and omit the ones you do not want to override. - -##### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "llama3.2", - "prompt": "Why is the sky blue?", - "stream": false, - "options": { - "num_keep": 5, - "seed": 42, - "num_predict": 100, - "top_k": 20, - "top_p": 0.9, - "min_p": 0.0, - "typical_p": 0.7, - "repeat_last_n": 33, - "temperature": 0.8, - "repeat_penalty": 1.2, - "presence_penalty": 1.5, - "frequency_penalty": 1.0, - "penalize_newline": true, - "stop": ["\n", "user:"], - "numa": false, - "num_ctx": 1024, - "num_batch": 2, - "num_gpu": 1, - "main_gpu": 0, - "use_mmap": true, - "num_thread": 8 - } -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T19:22:45.499127Z", - "response": "The sky is blue because it is the color of the sky.", - "done": true, - "context": [1, 2, 3], - "total_duration": 4935886791, - "load_duration": 534986708, - "prompt_eval_count": 26, - "prompt_eval_duration": 107345000, - "eval_count": 237, - "eval_duration": 4289432000 -} -``` - -#### Load a model - -If an empty prompt is provided, the model will be loaded into memory. - -##### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "llama3.2" -}' -``` - -##### Response - -A single JSON object is returned: - -```json -{ - "model": "llama3.2", - "created_at": "2023-12-18T19:52:07.071755Z", - "response": "", - "done": true -} -``` - -#### Unload a model - -If an empty prompt is provided and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory. - -##### Request - -```shell -curl http://localhost:11434/api/generate -d '{ - "model": "llama3.2", - "keep_alive": 0 -}' -``` - -##### Response - -A single JSON object is returned: - -```json -{ - "model": "llama3.2", - "created_at": "2024-09-12T03:54:03.516566Z", - "response": "", - "done": true, - "done_reason": "unload" -} -``` - -## Generate a chat completion - -``` -POST /api/chat -``` - -Generate the next message in a chat with a provided model. This is a streaming endpoint, so there will be a series of responses. Streaming can be disabled using `"stream": false`. The final response object will include statistics and additional data from the request. - -### Parameters - -- `model`: (required) the [model name](#model-names) -- `messages`: the messages of the chat, this can be used to keep a chat memory -- `tools`: list of tools in JSON for the model to use if supported -- `think`: (for thinking models) should the model think before responding? - -The `message` object has the following fields: - -- `role`: the role of the message, either `system`, `user`, `assistant`, or `tool` -- `content`: the content of the message -- `thinking`: (for thinking models) the model's thinking process -- `images` (optional): a list of images to include in the message (for multimodal models such as `llava`) -- `tool_calls` (optional): a list of tools in JSON that the model wants to use -- `tool_name` (optional): add the name of the tool that was executed to inform the model of the result - -Advanced parameters (optional): - -- `format`: the format to return a response in. Format can be `json` or a JSON schema. -- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature` -- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects -- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`) - -### Tool calling - -Tool calling is supported by providing a list of tools in the `tools` parameter. The model will generate a response that includes a list of tool calls. See the [Chat request (Streaming with tools)](#chat-request-streaming-with-tools) example below. - -Models can also explain the result of the tool call in the response. See the [Chat request (With history, with tools)](#chat-request-with-history-with-tools) example below. - -[See models with tool calling capabilities](https://ollama.com/search?c=tool). - -### Structured outputs - -Structured outputs are supported by providing a JSON schema in the `format` parameter. The model will generate a response that matches the schema. See the [Chat request (Structured outputs)](#chat-request-structured-outputs) example below. - -### Examples - -#### Chat request (Streaming) - -##### Request - -Send a chat message with a streaming response. - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "why is the sky blue?" - } - ] -}' -``` - -##### Response - -A stream of JSON objects is returned: - -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T08:52:19.385406455-07:00", - "message": { - "role": "assistant", - "content": "The", - "images": null - }, - "done": false -} -``` - -Final response: - -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T19:22:45.499127Z", - "message": { - "role": "assistant", - "content": "" - }, - "done": true, - "total_duration": 4883583458, - "load_duration": 1334875, - "prompt_eval_count": 26, - "prompt_eval_duration": 342546000, - "eval_count": 282, - "eval_duration": 4535599000 -} -``` - -#### Chat request (Streaming with tools) - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "what is the weather in tokyo?" - } - ], - "tools": [ - { - "type": "function", - "function": { - "name": "get_weather", - "description": "Get the weather in a given city", - "parameters": { - "type": "object", - "properties": { - "city": { - "type": "string", - "description": "The city to get the weather for" - } - }, - "required": ["city"] - } - } - } - ], - "stream": true -}' -``` - -##### Response - -A stream of JSON objects is returned: -```json -{ - "model": "llama3.2", - "created_at": "2025-07-07T20:22:19.184789Z", - "message": { - "role": "assistant", - "content": "", - "tool_calls": [ - { - "function": { - "name": "get_weather", - "arguments": { - "city": "Tokyo" - } - }, - } - ] - }, - "done": false -} -``` - -Final response: - -```json -{ - "model":"llama3.2", - "created_at":"2025-07-07T20:22:19.19314Z", - "message": { - "role": "assistant", - "content": "" - }, - "done_reason": "stop", - "done": true, - "total_duration": 182242375, - "load_duration": 41295167, - "prompt_eval_count": 169, - "prompt_eval_duration": 24573166, - "eval_count": 15, - "eval_duration": 115959084 -} -``` - -#### Chat request (No streaming) - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "why is the sky blue?" - } - ], - "stream": false -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at": "2023-12-12T14:13:43.416799Z", - "message": { - "role": "assistant", - "content": "Hello! How are you today?" - }, - "done": true, - "total_duration": 5191566416, - "load_duration": 2154458, - "prompt_eval_count": 26, - "prompt_eval_duration": 383809000, - "eval_count": 298, - "eval_duration": 4799921000 -} -``` - -#### Chat request (No streaming, with tools) - -##### Request - - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "what is the weather in tokyo?" - } - ], - "tools": [ - { - "type": "function", - "function": { - "name": "get_weather", - "description": "Get the weather in a given city", - "parameters": { - "type": "object", - "properties": { - "city": { - "type": "string", - "description": "The city to get the weather for" - } - }, - "required": ["city"] - } - } - } - ], - "stream": false -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at": "2025-07-07T20:32:53.844124Z", - "message": { - "role": "assistant", - "content": "", - "tool_calls": [ - { - "function": { - "name": "get_weather", - "arguments": { - "city": "Tokyo" - } - }, - } - ] - }, - "done_reason": "stop", - "done": true, - "total_duration": 3244883583, - "load_duration": 2969184542, - "prompt_eval_count": 169, - "prompt_eval_duration": 141656333, - "eval_count": 18, - "eval_duration": 133293625 -} -``` - -#### Chat request (Structured outputs) - -##### Request - -```shell -curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ - "model": "llama3.1", - "messages": [{"role": "user", "content": "Ollama is 22 years old and busy saving the world. Return a JSON object with the age and availability."}], - "stream": false, - "format": { - "type": "object", - "properties": { - "age": { - "type": "integer" - }, - "available": { - "type": "boolean" - } - }, - "required": [ - "age", - "available" - ] - }, - "options": { - "temperature": 0 - } -}' -``` - -##### Response - -```json -{ - "model": "llama3.1", - "created_at": "2024-12-06T00:46:58.265747Z", - "message": { "role": "assistant", "content": "{\"age\": 22, \"available\": false}" }, - "done_reason": "stop", - "done": true, - "total_duration": 2254970291, - "load_duration": 574751416, - "prompt_eval_count": 34, - "prompt_eval_duration": 1502000000, - "eval_count": 12, - "eval_duration": 175000000 -} -``` - -#### Chat request (With History) - -Send a chat message with a conversation history. You can use this same approach to start the conversation using multi-shot or chain-of-thought prompting. - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "why is the sky blue?" - }, - { - "role": "assistant", - "content": "due to rayleigh scattering." - }, - { - "role": "user", - "content": "how is that different than mie scattering?" - } - ] -}' -``` - -##### Response - -A stream of JSON objects is returned: - -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T08:52:19.385406455-07:00", - "message": { - "role": "assistant", - "content": "The" - }, - "done": false -} -``` - -Final response: - -```json -{ - "model": "llama3.2", - "created_at": "2023-08-04T19:22:45.499127Z", - "done": true, - "total_duration": 8113331500, - "load_duration": 6396458, - "prompt_eval_count": 61, - "prompt_eval_duration": 398801000, - "eval_count": 468, - "eval_duration": 7701267000 -} -``` - - -#### Chat request (With history, with tools) - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "what is the weather in Toronto?" - }, - // the message from the model appended to history - { - "role": "assistant", - "content": "", - "tool_calls": [ - { - "function": { - "name": "get_temperature", - "arguments": { - "city": "Toronto" - } - }, - } - ] - }, - // the tool call result appended to history - { - "role": "tool", - "content": "11 degrees celsius", - "tool_name": "get_temperature", - } - ], - "stream": false, - "tools": [ - { - "type": "function", - "function": { - "name": "get_weather", - "description": "Get the weather in a given city", - "parameters": { - "type": "object", - "properties": { - "city": { - "type": "string", - "description": "The city to get the weather for" - } - }, - "required": ["city"] - } - } - } - ] -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at": "2025-07-07T20:43:37.688511Z", - "message": { - "role": "assistant", - "content": "The current temperature in Toronto is 11°C." - }, - "done_reason": "stop", - "done": true, - "total_duration": 890771750, - "load_duration": 707634750, - "prompt_eval_count": 94, - "prompt_eval_duration": 91703208, - "eval_count": 11, - "eval_duration": 90282125 -} - -``` - - -#### Chat request (with images) - -##### Request - -Send a chat message with images. The images should be provided as an array, with the individual images encoded in Base64. - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llava", - "messages": [ - { - "role": "user", - "content": "what is in this image?", - "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"] - } - ] -}' -``` - -##### Response - -```json -{ - "model": "llava", - "created_at": "2023-12-13T22:42:50.203334Z", - "message": { - "role": "assistant", - "content": " The image features a cute, little pig with an angry facial expression. It's wearing a heart on its shirt and is waving in the air. This scene appears to be part of a drawing or sketching project.", - "images": null - }, - "done": true, - "total_duration": 1668506709, - "load_duration": 1986209, - "prompt_eval_count": 26, - "prompt_eval_duration": 359682000, - "eval_count": 83, - "eval_duration": 1303285000 -} -``` - -#### Chat request (Reproducible outputs) - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "Hello!" - } - ], - "options": { - "seed": 101, - "temperature": 0 - } -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at": "2023-12-12T14:13:43.416799Z", - "message": { - "role": "assistant", - "content": "Hello! How are you today?" - }, - "done": true, - "total_duration": 5191566416, - "load_duration": 2154458, - "prompt_eval_count": 26, - "prompt_eval_duration": 383809000, - "eval_count": 298, - "eval_duration": 4799921000 -} -``` - -#### Chat request (with tools) - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [ - { - "role": "user", - "content": "What is the weather today in Paris?" - } - ], - "stream": false, - "tools": [ - { - "type": "function", - "function": { - "name": "get_current_weather", - "description": "Get the current weather for a location", - "parameters": { - "type": "object", - "properties": { - "location": { - "type": "string", - "description": "The location to get the weather for, e.g. San Francisco, CA" - }, - "format": { - "type": "string", - "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'", - "enum": ["celsius", "fahrenheit"] - } - }, - "required": ["location", "format"] - } - } - } - ] -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at": "2024-07-22T20:33:28.123648Z", - "message": { - "role": "assistant", - "content": "", - "tool_calls": [ - { - "function": { - "name": "get_current_weather", - "arguments": { - "format": "celsius", - "location": "Paris, FR" - } - } - } - ] - }, - "done_reason": "stop", - "done": true, - "total_duration": 885095291, - "load_duration": 3753500, - "prompt_eval_count": 122, - "prompt_eval_duration": 328493000, - "eval_count": 33, - "eval_duration": 552222000 -} -``` - -#### Load a model - -If the messages array is empty, the model will be loaded into memory. - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [] -}' -``` - -##### Response - -```json -{ - "model": "llama3.2", - "created_at":"2024-09-12T21:17:29.110811Z", - "message": { - "role": "assistant", - "content": "" - }, - "done_reason": "load", - "done": true -} -``` - -#### Unload a model - -If the messages array is empty and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory. - -##### Request - -```shell -curl http://localhost:11434/api/chat -d '{ - "model": "llama3.2", - "messages": [], - "keep_alive": 0 -}' -``` - -##### Response - -A single JSON object is returned: - -```json -{ - "model": "llama3.2", - "created_at":"2024-09-12T21:33:17.547535Z", - "message": { - "role": "assistant", - "content": "" - }, - "done_reason": "unload", - "done": true -} -``` - -## Create a Model - -``` -POST /api/create -``` - -Create a model from: - * another model; - * a safetensors directory; or - * a GGUF file. - -If you are creating a model from a safetensors directory or from a GGUF file, you must [create a blob](#create-a-blob) for each of the files and then use the file name and SHA256 digest associated with each blob in the `files` field. - -### Parameters - -- `model`: name of the model to create -- `from`: (optional) name of an existing model to create the new model from -- `files`: (optional) a dictionary of file names to SHA256 digests of blobs to create the model from -- `adapters`: (optional) a dictionary of file names to SHA256 digests of blobs for LORA adapters -- `template`: (optional) the prompt template for the model -- `license`: (optional) a string or list of strings containing the license or licenses for the model -- `system`: (optional) a string containing the system prompt for the model -- `parameters`: (optional) a dictionary of parameters for the model (see [Modelfile](./modelfile.md#valid-parameters-and-values) for a list of parameters) -- `messages`: (optional) a list of message objects used to create a conversation -- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects -- `quantize` (optional): quantize a non-quantized (e.g. float16) model - -#### Quantization types - -| Type | Recommended | -| --- | :-: | -| q4_K_M | * | -| q4_K_S | | -| q8_0 | * | - -### Examples - -#### Create a new model - -Create a new model from an existing model. - -##### Request - -```shell -curl http://localhost:11434/api/create -d '{ - "model": "mario", - "from": "llama3.2", - "system": "You are Mario from Super Mario Bros." -}' -``` - -##### Response - -A stream of JSON objects is returned: - -```json -{"status":"reading model metadata"} -{"status":"creating system layer"} -{"status":"using already created layer sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2"} -{"status":"using already created layer sha256:8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b"} -{"status":"using already created layer sha256:7c23fb36d80141c4ab8cdbb61ee4790102ebd2bf7aeff414453177d4f2110e5d"} -{"status":"using already created layer sha256:2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988"} -{"status":"using already created layer sha256:2759286baa875dc22de5394b4a925701b1896a7e3f8e53275c36f75a877a82c9"} -{"status":"writing layer sha256:df30045fe90f0d750db82a058109cecd6d4de9c90a3d75b19c09e5f64580bb42"} -{"status":"writing layer sha256:f18a68eb09bf925bb1b669490407c1b1251c5db98dc4d3d81f3088498ea55690"} -{"status":"writing manifest"} -{"status":"success"} -``` - -#### Quantize a model - -Quantize a non-quantized model. - -##### Request - -```shell -curl http://localhost:11434/api/create -d '{ - "model": "llama3.2:quantized", - "from": "llama3.2:3b-instruct-fp16", - "quantize": "q4_K_M" -}' -``` - -##### Response - -A stream of JSON objects is returned: - -```json -{"status":"quantizing F16 model to Q4_K_M","digest":"0","total":6433687776,"completed":12302} -{"status":"quantizing F16 model to Q4_K_M","digest":"0","total":6433687776,"completed":6433687552} -{"status":"verifying conversion"} -{"status":"creating new layer sha256:fb7f4f211b89c6c4928ff4ddb73db9f9c0cfca3e000c3e40d6cf27ddc6ca72eb"} -{"status":"using existing layer sha256:966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396"} -{"status":"using existing layer sha256:fcc5a6bec9daf9b561a68827b67ab6088e1dba9d1fa2a50d7bbcc8384e0a265d"} -{"status":"using existing layer sha256:a70ff7e570d97baaf4e62ac6e6ad9975e04caa6d900d3742d37698494479e0cd"} -{"status":"using existing layer sha256:56bb8bd477a519ffa694fc449c2413c6f0e1d3b1c88fa7e3c9d88d3ae49d4dcb"} -{"status":"writing manifest"} -{"status":"success"} -``` - -#### Create a model from GGUF - -Create a model from a GGUF file. The `files` parameter should be filled out with the file name and SHA256 digest of the GGUF file you wish to use. Use [/api/blobs/:digest](#push-a-blob) to push the GGUF file to the server before calling this API. - - -##### Request - -```shell -curl http://localhost:11434/api/create -d '{ - "model": "my-gguf-model", - "files": { - "test.gguf": "sha256:432f310a77f4650a88d0fd59ecdd7cebed8d684bafea53cbff0473542964f0c3" - } -}' -``` - -##### Response - -A stream of JSON objects is returned: - -```json -{"status":"parsing GGUF"} -{"status":"using existing layer sha256:432f310a77f4650a88d0fd59ecdd7cebed8d684bafea53cbff0473542964f0c3"} -{"status":"writing manifest"} -{"status":"success"} -``` - - -#### Create a model from a Safetensors directory - -The `files` parameter should include a dictionary of files for the safetensors model which includes the file names and SHA256 digest of each file. Use [/api/blobs/:digest](#push-a-blob) to first push each of the files to the server before calling this API. Files will remain in the cache until the Ollama server is restarted. - -##### Request - -```shell -curl http://localhost:11434/api/create -d '{ - "model": "fred", - "files": { - "config.json": "sha256:dd3443e529fb2290423a0c65c2d633e67b419d273f170259e27297219828e389", - "generation_config.json": "sha256:88effbb63300dbbc7390143fbbdd9d9fa50587b37e8bfd16c8c90d4970a74a36", - "special_tokens_map.json": "sha256:b7455f0e8f00539108837bfa586c4fbf424e31f8717819a6798be74bef813d05", - "tokenizer.json": "sha256:bbc1904d35169c542dffbe1f7589a5994ec7426d9e5b609d07bab876f32e97ab", - "tokenizer_config.json": "sha256:24e8a6dc2547164b7002e3125f10b415105644fcf02bf9ad8b674c87b1eaaed6", - "model.safetensors": "sha256:1ff795ff6a07e6a68085d206fb84417da2f083f68391c2843cd2b8ac6df8538f" - } -}' -``` - -##### Response - -A stream of JSON objects is returned: - -```shell -{"status":"converting model"} -{"status":"creating new layer sha256:05ca5b813af4a53d2c2922933936e398958855c44ee534858fcfd830940618b6"} -{"status":"using autodetected template llama3-instruct"} -{"status":"using existing layer sha256:56bb8bd477a519ffa694fc449c2413c6f0e1d3b1c88fa7e3c9d88d3ae49d4dcb"} -{"status":"writing manifest"} -{"status":"success"} -``` - -## Check if a Blob Exists - -```shell -HEAD /api/blobs/:digest -``` - -Ensures that the file blob (Binary Large Object) used with create a model exists on the server. This checks your Ollama server and not ollama.com. - -### Query Parameters - -- `digest`: the SHA256 digest of the blob - -### Examples - -#### Request - -```shell -curl -I http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2 -``` - -#### Response - -Return 200 OK if the blob exists, 404 Not Found if it does not. - -## Push a Blob - -``` -POST /api/blobs/:digest -``` - -Push a file to the Ollama server to create a "blob" (Binary Large Object). - -### Query Parameters - -- `digest`: the expected SHA256 digest of the file - -### Examples - -#### Request - -```shell -curl -T model.gguf -X POST http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2 -``` - -#### Response - -Return 201 Created if the blob was successfully created, 400 Bad Request if the digest used is not expected. - -## List Local Models - -``` -GET /api/tags -``` - -List models that are available locally. - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/tags -``` - -#### Response - -A single JSON object will be returned. - -```json -{ - "models": [ - { - "name": "deepseek-r1:latest", - "model": "deepseek-r1:latest", - "modified_at": "2025-05-10T08:06:48.639712648-07:00", - "size": 4683075271, - "digest": "0a8c266910232fd3291e71e5ba1e058cc5af9d411192cf88b6d30e92b6e73163", - "details": { - "parent_model": "", - "format": "gguf", - "family": "qwen2", - "families": [ - "qwen2" - ], - "parameter_size": "7.6B", - "quantization_level": "Q4_K_M" - } - }, - { - "name": "llama3.2:latest", - "model": "llama3.2:latest", - "modified_at": "2025-05-04T17:37:44.706015396-07:00", - "size": 2019393189, - "digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72", - "details": { - "parent_model": "", - "format": "gguf", - "family": "llama", - "families": [ - "llama" - ], - "parameter_size": "3.2B", - "quantization_level": "Q4_K_M" - } - } - ] -} -``` - -## Show Model Information - -``` -POST /api/show -``` - -Show information about a model including details, modelfile, template, parameters, license, system prompt. - -### Parameters - -- `model`: name of the model to show -- `verbose`: (optional) if set to `true`, returns full data for verbose response fields - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/show -d '{ - "model": "llava" -}' -``` - -#### Response - -```json5 -{ - "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE \"\"\"{{ .System }}\nUSER: {{ .Prompt }}\nASSISTANT: \"\"\"\nPARAMETER num_ctx 4096\nPARAMETER stop \"\u003c/s\u003e\"\nPARAMETER stop \"USER:\"\nPARAMETER stop \"ASSISTANT:\"", - "parameters": "num_keep 24\nstop \"<|start_header_id|>\"\nstop \"<|end_header_id|>\"\nstop \"<|eot_id|>\"", - "template": "{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>", - "details": { - "parent_model": "", - "format": "gguf", - "family": "llama", - "families": [ - "llama" - ], - "parameter_size": "8.0B", - "quantization_level": "Q4_0" - }, - "model_info": { - "general.architecture": "llama", - "general.file_type": 2, - "general.parameter_count": 8030261248, - "general.quantization_version": 2, - "llama.attention.head_count": 32, - "llama.attention.head_count_kv": 8, - "llama.attention.layer_norm_rms_epsilon": 0.00001, - "llama.block_count": 32, - "llama.context_length": 8192, - "llama.embedding_length": 4096, - "llama.feed_forward_length": 14336, - "llama.rope.dimension_count": 128, - "llama.rope.freq_base": 500000, - "llama.vocab_size": 128256, - "tokenizer.ggml.bos_token_id": 128000, - "tokenizer.ggml.eos_token_id": 128009, - "tokenizer.ggml.merges": [], // populates if `verbose=true` - "tokenizer.ggml.model": "gpt2", - "tokenizer.ggml.pre": "llama-bpe", - "tokenizer.ggml.token_type": [], // populates if `verbose=true` - "tokenizer.ggml.tokens": [] // populates if `verbose=true` - }, - "capabilities": [ - "completion", - "vision" - ], -} -``` - -## Copy a Model - -``` -POST /api/copy -``` - -Copy a model. Creates a model with another name from an existing model. - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/copy -d '{ - "source": "llama3.2", - "destination": "llama3-backup" -}' -``` - -#### Response - -Returns a 200 OK if successful, or a 404 Not Found if the source model doesn't exist. - -## Delete a Model - -``` -DELETE /api/delete -``` - -Delete a model and its data. - -### Parameters - -- `model`: model name to delete - -### Examples - -#### Request - -```shell -curl -X DELETE http://localhost:11434/api/delete -d '{ - "model": "llama3:13b" -}' -``` - -#### Response - -Returns a 200 OK if successful, 404 Not Found if the model to be deleted doesn't exist. - -## Pull a Model - -``` -POST /api/pull -``` - -Download a model from the ollama library. Cancelled pulls are resumed from where they left off, and multiple calls will share the same download progress. - -### Parameters - -- `model`: name of the model to pull -- `insecure`: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development. -- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/pull -d '{ - "model": "llama3.2" -}' -``` - -#### Response - -If `stream` is not specified, or set to `true`, a stream of JSON objects is returned: - -The first object is the manifest: - -```json -{ - "status": "pulling manifest" -} -``` - -Then there is a series of downloading responses. Until any of the download is completed, the `completed` key may not be included. The number of files to be downloaded depends on the number of layers specified in the manifest. - -```json -{ - "status": "pulling digestname", - "digest": "digestname", - "total": 2142590208, - "completed": 241970 -} -``` - -After all the files are downloaded, the final responses are: - -```json -{ - "status": "verifying sha256 digest" -} -{ - "status": "writing manifest" -} -{ - "status": "removing any unused layers" -} -{ - "status": "success" -} -``` - -if `stream` is set to false, then the response is a single JSON object: - -```json -{ - "status": "success" -} -``` - -## Push a Model - -``` -POST /api/push -``` - -Upload a model to a model library. Requires registering for ollama.ai and adding a public key first. - -### Parameters - -- `model`: name of the model to push in the form of `/:` -- `insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development. -- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/push -d '{ - "model": "mattw/pygmalion:latest" -}' -``` - -#### Response - -If `stream` is not specified, or set to `true`, a stream of JSON objects is returned: - -```json -{ "status": "retrieving manifest" } -``` - -and then: - -```json -{ - "status": "starting upload", - "digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab", - "total": 1928429856 -} -``` - -Then there is a series of uploading responses: - -```json -{ - "status": "starting upload", - "digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab", - "total": 1928429856 -} -``` - -Finally, when the upload is complete: - -```json -{"status":"pushing manifest"} -{"status":"success"} -``` - -If `stream` is set to `false`, then the response is a single JSON object: - -```json -{ "status": "success" } -``` - -## Generate Embeddings - -``` -POST /api/embed -``` - -Generate embeddings from a model - -### Parameters - -- `model`: name of model to generate embeddings from -- `input`: text or list of text to generate embeddings for - -Advanced parameters: - -- `truncate`: truncates the end of each input to fit within context length. Returns error if `false` and context length is exceeded. Defaults to `true` -- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature` -- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`) -- `dimensions`: number of dimensions for the embedding - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/embed -d '{ - "model": "all-minilm", - "input": "Why is the sky blue?" -}' -``` - -#### Response - -```json -{ - "model": "all-minilm", - "embeddings": [[ - 0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814, - 0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348 - ]], - "total_duration": 14143917, - "load_duration": 1019500, - "prompt_eval_count": 8 -} -``` - -#### Request (Multiple input) - -```shell -curl http://localhost:11434/api/embed -d '{ - "model": "all-minilm", - "input": ["Why is the sky blue?", "Why is the grass green?"] -}' -``` - -#### Response - -```json -{ - "model": "all-minilm", - "embeddings": [[ - 0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814, - 0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348 - ],[ - -0.0098027075, 0.06042469, 0.025257962, -0.006364387, 0.07272725, - 0.017194884, 0.09032035, -0.051705178, 0.09951512, 0.09072481 - ]] -} -``` - -## List Running Models -``` -GET /api/ps -``` - -List models that are currently loaded into memory. - -#### Examples - -### Request - -```shell -curl http://localhost:11434/api/ps -``` - -#### Response - -A single JSON object will be returned. - -```json -{ - "models": [ - { - "name": "mistral:latest", - "model": "mistral:latest", - "size": 5137025024, - "digest": "2ae6f6dd7a3dd734790bbbf58b8909a606e0e7e97e94b7604e0aa7ae4490e6d8", - "details": { - "parent_model": "", - "format": "gguf", - "family": "llama", - "families": [ - "llama" - ], - "parameter_size": "7.2B", - "quantization_level": "Q4_0" - }, - "expires_at": "2024-06-04T14:38:31.83753-07:00", - "size_vram": 5137025024 - } - ] -} -``` - -## Generate Embedding - -> Note: this endpoint has been superseded by `/api/embed` - -``` -POST /api/embeddings -``` - -Generate embeddings from a model - -### Parameters - -- `model`: name of model to generate embeddings from -- `prompt`: text to generate embeddings for - -Advanced parameters: - -- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature` -- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`) - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/embeddings -d '{ - "model": "all-minilm", - "prompt": "Here is an article about llamas..." -}' -``` - -#### Response - -```json -{ - "embedding": [ - 0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313, - 0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281 - ] -} -``` - -## Version - -``` -GET /api/version -``` - -Retrieve the Ollama version - -### Examples - -#### Request - -```shell -curl http://localhost:11434/api/version -``` - -#### Response - -```json -{ - "version": "0.5.1" -} -``` +Several community-maintained libraries are available for Ollama. For a full list, see the [Ollama GitHub repository](https://github.com/ollama/ollama?tab=readme-ov-file#libraries-1). +## Versioning +Ollama's API isn't strictly versioned, but the API is expected to be stable and backwards compatible. Deprecations are rare and will be announced in the [release notes](https://github.com/ollama/ollama/releases). \ No newline at end of file diff --git a/docs/api/openai-compatibility.mdx b/docs/api/openai-compatibility.mdx index 26930124c..8329934af 100644 --- a/docs/api/openai-compatibility.mdx +++ b/docs/api/openai-compatibility.mdx @@ -1,9 +1,8 @@ -# OpenAI compatibility +--- +title: OpenAI compatibility +--- -> [!NOTE] -> OpenAI compatibility is experimental and is subject to major adjustments including breaking changes. For fully-featured access to the Ollama API, see the Ollama [Python library](https://github.com/ollama/ollama-python), [JavaScript library](https://github.com/ollama/ollama-js) and [REST API](https://github.com/ollama/ollama/blob/main/docs/api.md). - -Ollama provides experimental compatibility with parts of the [OpenAI API](https://platform.openai.com/docs/api-reference) to help connect existing applications to Ollama. +Ollama provides compatibility with parts of the [OpenAI API](https://platform.openai.com/docs/api-reference) to help connect existing applications to Ollama. ## Usage @@ -100,49 +99,50 @@ except Exception as e: ### OpenAI JavaScript library ```javascript -import OpenAI from 'openai' +import OpenAI from "openai"; const openai = new OpenAI({ - baseURL: 'http://localhost:11434/v1/', + baseURL: "http://localhost:11434/v1/", // required but ignored - apiKey: 'ollama', -}) + apiKey: "ollama", +}); const chatCompletion = await openai.chat.completions.create({ - messages: [{ role: 'user', content: 'Say this is a test' }], - model: 'llama3.2', -}) + messages: [{ role: "user", content: "Say this is a test" }], + model: "llama3.2", +}); const response = await openai.chat.completions.create({ - model: "llava", - messages: [ + model: "llava", + messages: [ + { + role: "user", + content: [ + { type: "text", text: "What's in this image?" }, { - role: "user", - content: [ - { type: "text", text: "What's in this image?" }, - { - type: "image_url", - image_url: "", - }, - ], + type: "image_url", + image_url: + "", }, - ], -}) + ], + }, + ], +}); const completion = await openai.completions.create({ - model: "llama3.2", - prompt: "Say this is a test.", -}) + model: "llama3.2", + prompt: "Say this is a test.", +}); -const listCompletion = await openai.models.list() +const listCompletion = await openai.models.list(); -const model = await openai.models.retrieve("llama3.2") +const model = await openai.models.retrieve("llama3.2"); const embedding = await openai.embeddings.create({ model: "all-minilm", input: ["why is the sky blue?", "why is the grass green?"], -}) +}); ``` ### `curl` @@ -306,8 +306,8 @@ curl http://localhost:11434/v1/embeddings \ - [x] array of strings - [ ] array of tokens - [ ] array of token arrays -- [ ] `encoding format` -- [ ] `dimensions` +- [x] `encoding format` +- [x] `dimensions` - [ ] `user` ## Models @@ -365,4 +365,4 @@ curl http://localhost:11434/v1/chat/completions \ } ] }' -``` +``` \ No newline at end of file diff --git a/docs/api/streaming.mdx b/docs/api/streaming.mdx new file mode 100644 index 000000000..ad77f810a --- /dev/null +++ b/docs/api/streaming.mdx @@ -0,0 +1,35 @@ +--- +title: Streaming +--- + +Certain API endpoints stream responses by default, such as `/api/generate`. These responses are provided in the newline-delimited JSON format (i.e. the `application/x-ndjson` content type). For example: + +```json +{"model":"gemma3","created_at":"2025-10-26T17:15:24.097767Z","response":"That","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:15:24.109172Z","response":"'","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:15:24.121485Z","response":"s","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:15:24.132802Z","response":" a","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:15:24.143931Z","response":" fantastic","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:15:24.155176Z","response":" question","done":false} +{"model":"gemma3","created_at":"2025-10-26T17:15:24.166576Z","response":"!","done":true, "done_reason": "stop"} +``` + +## Disabling streaming + +Streaming can be disabled by providing `{"stream": false}` in the request body for any endpoint that support streaming. This will cause responses to be returned in the `application/json` format instead: + +```json +{"model":"gemma3","created_at":"2025-10-26T17:15:24.166576Z","response":"That's a fantastic question!","done":true} +``` + +## When to use streaming vs non-streaming + +**Streaming (default)**: + - Real-time response generation + - Lower perceived latency + - Better for long generations + +**Non-streaming**: + - Simpler to process + - Better for short responses, or structured outputs + - Easier to handle in some applications \ No newline at end of file diff --git a/docs/api/usage.mdx b/docs/api/usage.mdx new file mode 100644 index 000000000..8317ca841 --- /dev/null +++ b/docs/api/usage.mdx @@ -0,0 +1,36 @@ +--- +title: Usage +--- + +Ollama's API responses include metrics that can be used for measuring performance and model usage: + +* `total_duration`: How long the response took to generate +* `load_duration`: How long the model took to load +* `prompt_eval_count`: How many input tokens were processed +* `prompt_eval_duration`: How long it took to evaluate the prompt +* `eval_count`: How many output tokens were processes +* `eval_duration`: How long it took to generate the output tokens + +All timing values are measured in nanoseconds. + +## Example response + +For endpoints that return usage metrics, the response body will include the usage fields. For example, a non-streaming call to `/api/generate` may return the following response: + +```json +{ + "model": "gemma3", + "created_at": "2025-10-17T23:14:07.414671Z", + "response": "Hello! How can I help you today?", + "done": true, + "done_reason": "stop", + "total_duration": 174560334, + "load_duration": 101397084, + "prompt_eval_count": 11, + "prompt_eval_duration": 13074791, + "eval_count": 18, + "eval_duration": 52479709 +} +``` + +For endpoints that return **streaming responses**, usage fields are included as part of the final chunk, where `done` is `true`. diff --git a/docs/benchmark.mdx b/docs/benchmark.mdx new file mode 100644 index 000000000..662fbe4dc --- /dev/null +++ b/docs/benchmark.mdx @@ -0,0 +1,71 @@ +--- +title: Benchmark +--- + +Go benchmark tests that measure end-to-end performance of a running Ollama server. Run these tests to evaluate model inference performance on your hardware and measure the impact of code changes. + +## When to use + +Run these benchmarks when: + +- Making changes to the model inference engine +- Modifying model loading/unloading logic +- Changing prompt processing or token generation code +- Implementing a new model architecture +- Testing performance across different hardware setups + +## Prerequisites + +- Ollama server running locally with `ollama serve` on `127.0.0.1:11434` + +## Usage and Examples + + + All commands must be run from the root directory of the Ollama project. + + +Basic syntax: + +```bash +go test -bench=. ./benchmark/... -m $MODEL_NAME +``` + +Required flags: + +- `-bench=.`: Run all benchmarks +- `-m`: Model name to benchmark + +Optional flags: + +- `-count N`: Number of times to run the benchmark (useful for statistical analysis) +- `-timeout T`: Maximum time for the benchmark to run (e.g. "10m" for 10 minutes) + +Common usage patterns: + +Single benchmark run with a model specified: + +```bash +go test -bench=. ./benchmark/... -m llama3.3 +``` + +## Output metrics + +The benchmark reports several key metrics: + +- `gen_tok/s`: Generated tokens per second +- `prompt_tok/s`: Prompt processing tokens per second +- `ttft_ms`: Time to first token in milliseconds +- `load_ms`: Model load time in milliseconds +- `gen_tokens`: Total tokens generated +- `prompt_tokens`: Total prompt tokens processed + +Each benchmark runs two scenarios: + +- Cold start: Model is loaded from disk for each test +- Warm start: Model is pre-loaded in memory + +Three prompt lengths are tested for each scenario: + +- Short prompt (100 tokens) +- Medium prompt (500 tokens) +- Long prompt (1000 tokens) diff --git a/docs/capabilities/embeddings.mdx b/docs/capabilities/embeddings.mdx new file mode 100644 index 000000000..99a577489 --- /dev/null +++ b/docs/capabilities/embeddings.mdx @@ -0,0 +1,113 @@ +--- +title: Embeddings +description: Generate text embeddings for semantic search, retrieval, and RAG. +--- + +Embeddings turn text into numeric vectors you can store in a vector database, search with cosine similarity, or use in RAG pipelines. The vector length depends on the model (typically 384–1024 dimensions). + +## Recommended models + +- [embeddinggemma](https://ollama.com/library/embeddinggemma) +- [qwen3-embedding](https://ollama.com/library/qwen3-embedding) +- [all-minilm](https://ollama.com/library/all-minilm) + +## Generate embeddings + +Use `/api/embed` with a single string. + + + + ```shell + curl -X POST http://localhost:11434/api/embed \ + -H "Content-Type: application/json" \ + -d '{ + "model": "embeddinggemma", + "input": "The quick brown fox jumps over the lazy dog." + }' + ``` + + + ```python + import ollama + + single = ollama.embed( + model='embeddinggemma', + input='The quick brown fox jumps over the lazy dog.' + ) + print(len(single['embeddings'][0])) # vector length + ``` + + + ```javascript + import ollama from 'ollama' + + const single = await ollama.embed({ + model: 'embeddinggemma', + input: 'The quick brown fox jumps over the lazy dog.', + }) + console.log(single.embeddings[0].length) // vector length + ``` + + + + + The `/api/embed` endpoint returns L2‑normalized (unit‑length) vectors. + + +## Generate a batch of embeddings + +Pass an array of strings to `input`. + + + + ```shell + curl -X POST http://localhost:11434/api/embed \ + -H "Content-Type: application/json" \ + -d '{ + "model": "embeddinggemma", + "input": [ + "First sentence", + "Second sentence", + "Third sentence" + ] + }' + ``` + + + ```python + import ollama + + batch = ollama.embed( + model='embeddinggemma', + input=[ + 'The quick brown fox jumps over the lazy dog.', + 'The five boxing wizards jump quickly.', + 'Jackdaws love my big sphinx of quartz.', + ] + ) + print(len(batch['embeddings'])) # number of vectors + ``` + + + ```javascript + import ollama from 'ollama' + + const batch = await ollama.embed({ + model: 'embeddinggemma', + input: [ + 'The quick brown fox jumps over the lazy dog.', + 'The five boxing wizards jump quickly.', + 'Jackdaws love my big sphinx of quartz.', + ], + }) + console.log(batch.embeddings.length) // number of vectors + ``` + + + +## Tips + +- Use cosine similarity for most semantic search use cases. +- Use the same embedding model for both indexing and querying. + + diff --git a/docs/capabilities/streaming.mdx b/docs/capabilities/streaming.mdx new file mode 100644 index 000000000..1467afcd8 --- /dev/null +++ b/docs/capabilities/streaming.mdx @@ -0,0 +1,99 @@ +--- +title: Streaming +--- + +Streaming allows you to render text as it is produced by the model. + +Streaming is enabled by default through the REST API, but disabled by default in the SDKs. + +To enable streaming in the SDKs, set the `stream` parameter to `True`. + +## Key streaming concepts +1. Chatting: Stream partial assistant messages. Each chunk includes the `content` so you can render messages as they arrive. +1. Thinking: Thinking-capable models emit a `thinking` field alongside regular content in each chunk. Detect this field in streaming chunks to show or hide reasoning traces before the final answer arrives. +1. Tool calling: Watch for streamed `tool_calls` in each chunk, execute the requested tool, and append tool outputs back into the conversation. + +## Handling streamed chunks + + + It is necessary to accumulate the partial fields in order to maintain the history of the conversation. This is particularly important for tool calling where the thinking, tool call from the model, and the executed tool result must be passed back to the model in the next request. + + + + + ```python + from ollama import chat + + stream = chat( + model='qwen3', + messages=[{'role': 'user', 'content': 'What is 17 × 23?'}], + stream=True, + ) + + in_thinking = False + content = '' + thinking = '' + for chunk in stream: + if chunk.message.thinking: + if not in_thinking: + in_thinking = True + print('Thinking:\n', end='', flush=True) + print(chunk.message.thinking, end='', flush=True) + # accumulate the partial thinking + thinking += chunk.message.thinking + elif chunk.message.content: + if in_thinking: + in_thinking = False + print('\n\nAnswer:\n', end='', flush=True) + print(chunk.message.content, end='', flush=True) + # accumulate the partial content + content += chunk.message.content + + # append the accumulated fields to the messages for the next request + new_messages = [{ role: 'assistant', thinking: thinking, content: content }] + ``` + + + + ```javascript + import ollama from 'ollama' + + async function main() { + const stream = await ollama.chat({ + model: 'qwen3', + messages: [{ role: 'user', content: 'What is 17 × 23?' }], + stream: true, + }) + + let inThinking = false + let content = '' + let thinking = '' + + for await (const chunk of stream) { + if (chunk.message.thinking) { + if (!inThinking) { + inThinking = true + process.stdout.write('Thinking:\n') + } + process.stdout.write(chunk.message.thinking) + // accumulate the partial thinking + thinking += chunk.message.thinking + } else if (chunk.message.content) { + if (inThinking) { + inThinking = false + process.stdout.write('\n\nAnswer:\n') + } + process.stdout.write(chunk.message.content) + // accumulate the partial content + content += chunk.message.content + } + } + + // append the accumulated fields to the messages for the next request + new_messages = [{ role: 'assistant', thinking: thinking, content: content }] + } + + main().catch(console.error) + ``` + + \ No newline at end of file diff --git a/docs/capabilities/structured-outputs.mdx b/docs/capabilities/structured-outputs.mdx new file mode 100644 index 000000000..da74e5970 --- /dev/null +++ b/docs/capabilities/structured-outputs.mdx @@ -0,0 +1,194 @@ +--- +title: Structured Outputs +--- + +Structured outputs let you enforce a JSON schema on model responses so you can reliably extract structured data, describe images, or keep every reply consistent. + +## Generating structured JSON + + + + ```shell + curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ + "model": "gpt-oss", + "messages": [{"role": "user", "content": "Tell me about Canada in one line"}], + "stream": false, + "format": "json" + }' + ``` + + + ```python + from ollama import chat + + response = chat( + model='gpt-oss', + messages=[{'role': 'user', 'content': 'Tell me about Canada.'}], + format='json' + ) + print(response.message.content) + ``` + + + ```javascript + import ollama from 'ollama' + + const response = await ollama.chat({ + model: 'gpt-oss', + messages: [{ role: 'user', content: 'Tell me about Canada.' }], + format: 'json' + }) + console.log(response.message.content) + ``` + + + +## Generating structured JSON with a schema + +Provide a JSON schema to the `format` field. + + + It is ideal to also pass the JSON schema as a string in the prompt to ground the model's response. + + + + + ```shell + curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ + "model": "gpt-oss", + "messages": [{"role": "user", "content": "Tell me about Canada."}], + "stream": false, + "format": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "capital": {"type": "string"}, + "languages": { + "type": "array", + "items": {"type": "string"} + } + }, + "required": ["name", "capital", "languages"] + } + }' + ``` + + + Use Pydantic models and pass `model_json_schema()` to `format`, then validate the response: + + ```python + from ollama import chat + from pydantic import BaseModel + + class Country(BaseModel): + name: str + capital: str + languages: list[str] + + response = chat( + model='gpt-oss', + messages=[{'role': 'user', 'content': 'Tell me about Canada.'}], + format=Country.model_json_schema(), + ) + + country = Country.model_validate_json(response.message.content) + print(country) + ``` + + + Serialize a Zod schema with `zodToJsonSchema()` and parse the structured response: + + ```javascript + import ollama from 'ollama' + import { z } from 'zod' + import { zodToJsonSchema } from 'zod-to-json-schema' + + const Country = z.object({ + name: z.string(), + capital: z.string(), + languages: z.array(z.string()), + }) + + const response = await ollama.chat({ + model: 'gpt-oss', + messages: [{ role: 'user', content: 'Tell me about Canada.' }], + format: zodToJsonSchema(Country), + }) + + const country = Country.parse(JSON.parse(response.message.content)) + console.log(country) + ``` + + + +## Example: Extract structured data + +Define the objects you want returned and let the model populate the fields: + +```python +from ollama import chat +from pydantic import BaseModel + +class Pet(BaseModel): + name: str + animal: str + age: int + color: str | None + favorite_toy: str | None + +class PetList(BaseModel): + pets: list[Pet] + +response = chat( + model='gpt-oss', + messages=[{'role': 'user', 'content': 'I have two cats named Luna and Loki...'}], + format=PetList.model_json_schema(), +) + +pets = PetList.model_validate_json(response.message.content) +print(pets) +``` + +## Example: Vision with structured outputs + +Vision models accept the same `format` parameter, enabling deterministic descriptions of images: + +```python +from ollama import chat +from pydantic import BaseModel +from typing import Literal, Optional + +class Object(BaseModel): + name: str + confidence: float + attributes: str + +class ImageDescription(BaseModel): + summary: str + objects: list[Object] + scene: str + colors: list[str] + time_of_day: Literal['Morning', 'Afternoon', 'Evening', 'Night'] + setting: Literal['Indoor', 'Outdoor', 'Unknown'] + text_content: Optional[str] = None + +response = chat( + model='gemma3', + messages=[{ + 'role': 'user', + 'content': 'Describe this photo and list the objects you detect.', + 'images': ['path/to/image.jpg'], + }], + format=ImageDescription.model_json_schema(), + options={'temperature': 0}, +) + +image_description = ImageDescription.model_validate_json(response.message.content) +print(image_description) +``` + +## Tips for reliable structured outputs + +- Define schemas with Pydantic (Python) or Zod (JavaScript) so they can be reused for validation. +- Lower the temperature (e.g., set it to `0`) for more deterministic completions. +- Structured outputs work through the OpenAI-compatible API via `response_format` diff --git a/docs/capabilities/thinking.mdx b/docs/capabilities/thinking.mdx new file mode 100644 index 000000000..388e98582 --- /dev/null +++ b/docs/capabilities/thinking.mdx @@ -0,0 +1,153 @@ +--- +title: Thinking +--- + +Thinking-capable models emit a `thinking` field that separates their reasoning trace from the final answer. + +Use this capability to audit model steps, animate the model *thinking* in a UI, or hide the trace entirely when you only need the final response. + +## Supported models + +- [Qwen 3](https://ollama.com/library/qwen3) +- [GPT-OSS](https://ollama.com/library/gpt-oss) *(use `think` levels: `low`, `medium`, `high` — the trace cannot be fully disabled)* +- [DeepSeek-v3.1](https://ollama.com/library/deepseek-v3.1) +- [DeepSeek R1](https://ollama.com/library/deepseek-r1) +- Browse the latest additions under [thinking models](https://ollama.com/search?c=thinking) + +## Enable thinking in API calls + +Set the `think` field on chat or generate requests. Most models accept booleans (`true`/`false`). + +GPT-OSS instead expects one of `low`, `medium`, or `high` to tune the trace length. + +The `message.thinking` (chat endpoint) or `thinking` (generate endpoint) field contains the reasoning trace while `message.content` / `response` holds the final answer. + + + + ```shell + curl http://localhost:11434/api/chat -d '{ + "model": "qwen3", + "messages": [{ + "role": "user", + "content": "How many letter r are in strawberry?" + }], + "think": true, + "stream": false + }' + ``` + + + ```python + from ollama import chat + + response = chat( + model='qwen3', + messages=[{'role': 'user', 'content': 'How many letter r are in strawberry?'}], + think=True, + stream=False, + ) + + print('Thinking:\n', response.message.thinking) + print('Answer:\n', response.message.content) + ``` + + + ```javascript + import ollama from 'ollama' + + const response = await ollama.chat({ + model: 'deepseek-r1', + messages: [{ role: 'user', content: 'How many letter r are in strawberry?' }], + think: true, + stream: false, + }) + + console.log('Thinking:\n', response.message.thinking) + console.log('Answer:\n', response.message.content) + ``` + + + + + GPT-OSS requires `think` to be set to `"low"`, `"medium"`, or `"high"`. Passing `true`/`false` is ignored for that model. + + +## Stream the reasoning trace + +Thinking streams interleave reasoning tokens before answer tokens. Detect the first `thinking` chunk to render a "thinking" section, then switch to the final reply once `message.content` arrives. + + + + ```python + from ollama import chat + + stream = chat( + model='qwen3', + messages=[{'role': 'user', 'content': 'What is 17 × 23?'}], + think=True, + stream=True, + ) + + in_thinking = False + + for chunk in stream: + if chunk.message.thinking and not in_thinking: + in_thinking = True + print('Thinking:\n', end='') + + if chunk.message.thinking: + print(chunk.message.thinking, end='') + elif chunk.message.content: + if in_thinking: + print('\n\nAnswer:\n', end='') + in_thinking = False + print(chunk.message.content, end='') + + ``` + + + ```javascript + import ollama from 'ollama' + + async function main() { + const stream = await ollama.chat({ + model: 'qwen3', + messages: [{ role: 'user', content: 'What is 17 × 23?' }], + think: true, + stream: true, + }) + + let inThinking = false + + for await (const chunk of stream) { + if (chunk.message.thinking && !inThinking) { + inThinking = true + process.stdout.write('Thinking:\n') + } + + if (chunk.message.thinking) { + process.stdout.write(chunk.message.thinking) + } else if (chunk.message.content) { + if (inThinking) { + process.stdout.write('\n\nAnswer:\n') + inThinking = false + } + process.stdout.write(chunk.message.content) + } + } + } + + main() + ``` + + + +## CLI quick reference + +- Enable thinking for a single run: `ollama run deepseek-r1 --think "Where should I visit in Lisbon?"` +- Disable thinking: `ollama run deepseek-r1 --think=false "Summarize this article"` +- Hide the trace while still using a thinking model: `ollama run deepseek-r1 --hidethinking "Is 9.9 bigger or 9.11?"` +- Inside interactive sessions, toggle with `/set think` or `/set nothink`. +- GPT-OSS only accepts levels: `ollama run gpt-oss --think=low "Draft a headline"` (replace `low` with `medium` or `high` as needed). + +Thinking is enabled by default in the CLI and API for supported models. diff --git a/docs/capabilities/tool-calling.mdx b/docs/capabilities/tool-calling.mdx new file mode 100644 index 000000000..ae1ff9599 --- /dev/null +++ b/docs/capabilities/tool-calling.mdx @@ -0,0 +1,777 @@ +--- +title: Tool calling +--- + +Ollama supports tool calling (also known as function calling) which allows a model to invoke tools and incorporate their results into its replies. + +## Calling a single tool +Invoke a single tool and include its response in a follow-up request. + +Also known as "single-shot" tool calling. + + + + + ```shell + curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ + "model": "qwen3", + "messages": [{"role": "user", "content": "What's the temperature in New York?"}], + "stream": false, + "tools": [ + { + "type": "function", + "function": { + "name": "get_temperature", + "description": "Get the current temperature for a city", + "parameters": { + "type": "object", + "required": ["city"], + "properties": { + "city": {"type": "string", "description": "The name of the city"} + } + } + } + } + ] + }' + ``` + + **Generate a response with a single tool result** + ```shell + curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ + "model": "qwen3", + "messages": [ + {"role": "user", "content": "What's the temperature in New York?"}, + { + "role": "assistant", + "tool_calls": [ + { + "type": "function", + "function": { + "index": 0, + "name": "get_temperature", + "arguments": {"city": "New York"} + } + } + ] + }, + {"role": "tool", "tool_name": "get_temperature", "content": "22°C"} + ], + "stream": false + }' + ``` + + + Install the Ollama Python SDK: + ```bash + # with pip + pip install ollama -U + + # with uv + uv add ollama + ``` + + ```python + from ollama import chat + + def get_temperature(city: str) -> str: + """Get the current temperature for a city + + Args: + city: The name of the city + + Returns: + The current temperature for the city + """ + temperatures = { + "New York": "22°C", + "London": "15°C", + "Tokyo": "18°C", + } + return temperatures.get(city, "Unknown") + + messages = [{"role": "user", "content": "What's the temperature in New York?"}] + + # pass functions directly as tools in the tools list or as a JSON schema + response = chat(model="qwen3", messages=messages, tools=[get_temperature], think=True) + + messages.append(response.message) + if response.message.tool_calls: + # only recommended for models which only return a single tool call + call = response.message.tool_calls[0] + result = get_temperature(**call.function.arguments) + # add the tool result to the messages + messages.append({"role": "tool", "tool_name": call.function.name, "content": str(result)}) + + final_response = chat(model="qwen3", messages=messages, tools=[get_temperature], think=True) + print(final_response.message.content) + ``` + + + Install the Ollama JavaScript library: + ```bash + # with npm + npm i ollama + + # with bun + bun i ollama + ``` + + ```typescript + import ollama from 'ollama' + + function getTemperature(city: string): string { + const temperatures: Record = { + 'New York': '22°C', + 'London': '15°C', + 'Tokyo': '18°C', + } + return temperatures[city] ?? 'Unknown' + } + + const tools = [ + { + type: 'function', + function: { + name: 'get_temperature', + description: 'Get the current temperature for a city', + parameters: { + type: 'object', + required: ['city'], + properties: { + city: { type: 'string', description: 'The name of the city' }, + }, + }, + }, + }, + ] + + const messages = [{ role: 'user', content: "What's the temperature in New York?" }] + + const response = await ollama.chat({ + model: 'qwen3', + messages, + tools, + think: true, + }) + + messages.push(response.message) + if (response.message.tool_calls?.length) { + // only recommended for models which only return a single tool call + const call = response.message.tool_calls[0] + const args = call.function.arguments as { city: string } + const result = getTemperature(args.city) + // add the tool result to the messages + messages.push({ role: 'tool', tool_name: call.function.name, content: result }) + + // generate the final response + const finalResponse = await ollama.chat({ model: 'qwen3', messages, tools, think: true }) + console.log(finalResponse.message.content) + } + ``` + + + +## Parallel tool calling + + + + Request multiple tool calls in parallel, then send all tool responses back to the model. + + ```shell + curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ + "model": "qwen3", + "messages": [{"role": "user", "content": "What are the current weather conditions and temperature in New York and London?"}], + "stream": false, + "tools": [ + { + "type": "function", + "function": { + "name": "get_temperature", + "description": "Get the current temperature for a city", + "parameters": { + "type": "object", + "required": ["city"], + "properties": { + "city": {"type": "string", "description": "The name of the city"} + } + } + } + }, + { + "type": "function", + "function": { + "name": "get_conditions", + "description": "Get the current weather conditions for a city", + "parameters": { + "type": "object", + "required": ["city"], + "properties": { + "city": {"type": "string", "description": "The name of the city"} + } + } + } + } + ] + }' + ``` + + **Generate a response with multiple tool results** + ```shell + curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ + "model": "qwen3", + "messages": [ + {"role": "user", "content": "What are the current weather conditions and temperature in New York and London?"}, + { + "role": "assistant", + "tool_calls": [ + { + "type": "function", + "function": { + "index": 0, + "name": "get_temperature", + "arguments": {"city": "New York"} + } + }, + { + "type": "function", + "function": { + "index": 1, + "name": "get_conditions", + "arguments": {"city": "New York"} + } + }, + { + "type": "function", + "function": { + "index": 2, + "name": "get_temperature", + "arguments": {"city": "London"} + } + }, + { + "type": "function", + "function": { + "index": 3, + "name": "get_conditions", + "arguments": {"city": "London"} + } + } + ] + }, + {"role": "tool", "tool_name": "get_temperature", "content": "22°C"}, + {"role": "tool", "tool_name": "get_conditions", "content": "Partly cloudy"}, + {"role": "tool", "tool_name": "get_temperature", "content": "15°C"}, + {"role": "tool", "tool_name": "get_conditions", "content": "Rainy"} + ], + "stream": false + }' + ``` + + + ```python + from ollama import chat + + def get_temperature(city: str) -> str: + """Get the current temperature for a city + + Args: + city: The name of the city + + Returns: + The current temperature for the city + """ + temperatures = { + "New York": "22°C", + "London": "15°C", + "Tokyo": "18°C" + } + return temperatures.get(city, "Unknown") + + def get_conditions(city: str) -> str: + """Get the current weather conditions for a city + + Args: + city: The name of the city + + Returns: + The current weather conditions for the city + """ + conditions = { + "New York": "Partly cloudy", + "London": "Rainy", + "Tokyo": "Sunny" + } + return conditions.get(city, "Unknown") + + + messages = [{'role': 'user', 'content': 'What are the current weather conditions and temperature in New York and London?'}] + + # The python client automatically parses functions as a tool schema so we can pass them directly + # Schemas can be passed directly in the tools list as well + response = chat(model='qwen3', messages=messages, tools=[get_temperature, get_conditions], think=True) + + # add the assistant message to the messages + messages.append(response.message) + if response.message.tool_calls: + # process each tool call + for call in response.message.tool_calls: + # execute the appropriate tool + if call.function.name == 'get_temperature': + result = get_temperature(**call.function.arguments) + elif call.function.name == 'get_conditions': + result = get_conditions(**call.function.arguments) + else: + result = 'Unknown tool' + # add the tool result to the messages + messages.append({'role': 'tool', 'tool_name': call.function.name, 'content': str(result)}) + + # generate the final response + final_response = chat(model='qwen3', messages=messages, tools=[get_temperature, get_conditions], think=True) + print(final_response.message.content) + ``` + + + ```typescript + import ollama from 'ollama' + + function getTemperature(city: string): string { + const temperatures: { [key: string]: string } = { + "New York": "22°C", + "London": "15°C", + "Tokyo": "18°C" + } + return temperatures[city] || "Unknown" + } + + function getConditions(city: string): string { + const conditions: { [key: string]: string } = { + "New York": "Partly cloudy", + "London": "Rainy", + "Tokyo": "Sunny" + } + return conditions[city] || "Unknown" + } + + const tools = [ + { + type: 'function', + function: { + name: 'get_temperature', + description: 'Get the current temperature for a city', + parameters: { + type: 'object', + required: ['city'], + properties: { + city: { type: 'string', description: 'The name of the city' }, + }, + }, + }, + }, + { + type: 'function', + function: { + name: 'get_conditions', + description: 'Get the current weather conditions for a city', + parameters: { + type: 'object', + required: ['city'], + properties: { + city: { type: 'string', description: 'The name of the city' }, + }, + }, + }, + } + ] + + const messages = [{ role: 'user', content: 'What are the current weather conditions and temperature in New York and London?' }] + + const response = await ollama.chat({ + model: 'qwen3', + messages, + tools, + think: true + }) + + // add the assistant message to the messages + messages.push(response.message) + if (response.message.tool_calls) { + // process each tool call + for (const call of response.message.tool_calls) { + // execute the appropriate tool + let result: string + if (call.function.name === 'get_temperature') { + const args = call.function.arguments as { city: string } + result = getTemperature(args.city) + } else if (call.function.name === 'get_conditions') { + const args = call.function.arguments as { city: string } + result = getConditions(args.city) + } else { + result = 'Unknown tool' + } + // add the tool result to the messages + messages.push({ role: 'tool', tool_name: call.function.name, content: result }) + } + + // generate the final response + const finalResponse = await ollama.chat({ model: 'qwen3', messages, tools, think: true }) + console.log(finalResponse.message.content) + } + ``` + + + + +## Multi-turn tool calling (Agent loop) + +An agent loop allows the model to decide when to invoke tools and incorporate their results into its replies. + +It also might help to tell the model that it is in a loop and can make multiple tool calls. + + + + ```python + from ollama import chat, ChatResponse + + + def add(a: int, b: int) -> int: + """Add two numbers""" + """ + Args: + a: The first number + b: The second number + + Returns: + The sum of the two numbers + """ + return a + b + + + def multiply(a: int, b: int) -> int: + """Multiply two numbers""" + """ + Args: + a: The first number + b: The second number + + Returns: + The product of the two numbers + """ + return a * b + + + available_functions = { + 'add': add, + 'multiply': multiply, + } + + messages = [{'role': 'user', 'content': 'What is (11434+12341)*412?'}] + while True: + response: ChatResponse = chat( + model='qwen3', + messages=messages, + tools=[add, multiply], + think=True, + ) + messages.append(response.message) + print("Thinking: ", response.message.thinking) + print("Content: ", response.message.content) + if response.message.tool_calls: + for tc in response.message.tool_calls: + if tc.function.name in available_functions: + print(f"Calling {tc.function.name} with arguments {tc.function.arguments}") + result = available_functions[tc.function.name](**tc.function.arguments) + print(f"Result: {result}") + # add the tool result to the messages + messages.append({'role': 'tool', 'tool_name': tc.function.name, 'content': str(result)}) + else: + # end the loop when there are no more tool calls + break + # continue the loop with the updated messages + ``` + + + ```typescript + import ollama from 'ollama' + + type ToolName = 'add' | 'multiply' + + function add(a: number, b: number): number { + return a + b + } + + function multiply(a: number, b: number): number { + return a * b + } + + const availableFunctions: Record number> = { + add, + multiply, + } + + const tools = [ + { + type: 'function', + function: { + name: 'add', + description: 'Add two numbers', + parameters: { + type: 'object', + required: ['a', 'b'], + properties: { + a: { type: 'integer', description: 'The first number' }, + b: { type: 'integer', description: 'The second number' }, + }, + }, + }, + }, + { + type: 'function', + function: { + name: 'multiply', + description: 'Multiply two numbers', + parameters: { + type: 'object', + required: ['a', 'b'], + properties: { + a: { type: 'integer', description: 'The first number' }, + b: { type: 'integer', description: 'The second number' }, + }, + }, + }, + }, + ] + + async function agentLoop() { + const messages = [{ role: 'user', content: 'What is (11434+12341)*412?' }] + + while (true) { + const response = await ollama.chat({ + model: 'qwen3', + messages, + tools, + think: true, + }) + + messages.push(response.message) + console.log('Thinking:', response.message.thinking) + console.log('Content:', response.message.content) + + const toolCalls = response.message.tool_calls ?? [] + if (toolCalls.length) { + for (const call of toolCalls) { + const fn = availableFunctions[call.function.name as ToolName] + if (!fn) { + continue + } + + const args = call.function.arguments as { a: number; b: number } + console.log(`Calling ${call.function.name} with arguments`, args) + const result = fn(args.a, args.b) + console.log(`Result: ${result}`) + messages.push({ role: 'tool', tool_name: call.function.name, content: String(result) }) + } + } else { + break + } + } + } + + agentLoop().catch(console.error) + ``` + + + + +## Tool calling with streaming + +When streaming, gather every chunk of `thinking`, `content`, and `tool_calls`, then return those fields together with any tool results in the follow-up request. + + + +```python +from ollama import chat + + +def get_temperature(city: str) -> str: + """Get the current temperature for a city + + Args: + city: The name of the city + + Returns: + The current temperature for the city + """ + temperatures = { + 'New York': '22°C', + 'London': '15°C', + } + return temperatures.get(city, 'Unknown') + + +messages = [{'role': 'user', 'content': "What's the temperature in New York?"}] + +while True: + stream = chat( + model='qwen3', + messages=messages, + tools=[get_temperature], + stream=True, + think=True, + ) + + thinking = '' + content = '' + tool_calls = [] + + done_thinking = False + # accumulate the partial fields + for chunk in stream: + if chunk.message.thinking: + thinking += chunk.message.thinking + print(chunk.message.thinking, end='', flush=True) + if chunk.message.content: + if not done_thinking: + done_thinking = True + print('\n') + content += chunk.message.content + print(chunk.message.content, end='', flush=True) + if chunk.message.tool_calls: + tool_calls.extend(chunk.message.tool_calls) + print(chunk.message.tool_calls) + + # append accumulated fields to the messages + if thinking or content or tool_calls: + messages.append({'role': 'assistant', 'thinking': thinking, 'content': content, 'tool_calls': tool_calls}) + + if not tool_calls: + break + + for call in tool_calls: + if call.function.name == 'get_temperature': + result = get_temperature(**call.function.arguments) + else: + result = 'Unknown tool' + messages.append({'role': 'tool', 'tool_name': call.function.name, 'content': result}) +``` + + + +```typescript +import ollama from 'ollama' + +function getTemperature(city: string): string { + const temperatures: Record = { + 'New York': '22°C', + 'London': '15°C', + } + return temperatures[city] ?? 'Unknown' +} + +const getTemperatureTool = { + type: 'function', + function: { + name: 'get_temperature', + description: 'Get the current temperature for a city', + parameters: { + type: 'object', + required: ['city'], + properties: { + city: { type: 'string', description: 'The name of the city' }, + }, + }, + }, +} + +async function agentLoop() { + const messages = [{ role: 'user', content: "What's the temperature in New York?" }] + + while (true) { + const stream = await ollama.chat({ + model: 'qwen3', + messages, + tools: [getTemperatureTool], + stream: true, + think: true, + }) + + let thinking = '' + let content = '' + const toolCalls: any[] = [] + let doneThinking = false + + for await (const chunk of stream) { + if (chunk.message.thinking) { + thinking += chunk.message.thinking + process.stdout.write(chunk.message.thinking) + } + if (chunk.message.content) { + if (!doneThinking) { + doneThinking = true + process.stdout.write('\n') + } + content += chunk.message.content + process.stdout.write(chunk.message.content) + } + if (chunk.message.tool_calls?.length) { + toolCalls.push(...chunk.message.tool_calls) + console.log(chunk.message.tool_calls) + } + } + + if (thinking || content || toolCalls.length) { + messages.push({ role: 'assistant', thinking, content, tool_calls: toolCalls } as any) + } + + if (!toolCalls.length) { + break + } + + for (const call of toolCalls) { + if (call.function.name === 'get_temperature') { + const args = call.function.arguments as { city: string } + const result = getTemperature(args.city) + messages.push({ role: 'tool', tool_name: call.function.name, content: result } ) + } else { + messages.push({ role: 'tool', tool_name: call.function.name, content: 'Unknown tool' } ) + } + } + } +} + +agentLoop().catch(console.error) + ``` + + + +This loop streams the assistant response, accumulates partial fields, passes them back together, and appends the tool results so the model can complete its answer. + + +## Using functions as tools with Ollama Python SDK +The Python SDK automatically parses functions as a tool schema so we can pass them directly. +Schemas can still be passed if needed. + +```python +from ollama import chat + +def get_temperature(city: str) -> str: + """Get the current temperature for a city + + Args: + city: The name of the city + + Returns: + The current temperature for the city + """ + temperatures = { + 'New York': '22°C', + 'London': '15°C', + } + return temperatures.get(city, 'Unknown') + +available_functions = { + 'get_temperature': get_temperature, +} +# directly pass the function as part of the tools list +response = chat(model='qwen3', messages=messages, tools=available_functions.values(), think=True) +``` diff --git a/docs/capabilities/vision.mdx b/docs/capabilities/vision.mdx new file mode 100644 index 000000000..3342eae25 --- /dev/null +++ b/docs/capabilities/vision.mdx @@ -0,0 +1,85 @@ +--- +title: Vision +--- + +Vision models accept images alongside text so the model can describe, classify, and answer questions about what it sees. + +## Quick start + +```shell +ollama run gemma3 ./image.png whats in this image? +``` + + +## Usage with Ollama's API +Provide an `images` array. SDKs accept file paths, URLs or raw bytes while the REST API expects base64-encoded image data. + + + + + ```shell + # 1. Download a sample image + curl -L -o test.jpg "https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg" + + # 2. Encode the image + IMG=$(base64 < test.jpg | tr -d '\n') + + # 3. Send it to Ollama + curl -X POST http://localhost:11434/api/chat \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gemma3", + "messages": [{ + "role": "user", + "content": "What is in this image?", + "images": ["'"$IMG"'"] + }], + "stream": false + }' + " + ``` + + + ```python + from ollama import chat + # from pathlib import Path + + # Pass in the path to the image + path = input('Please enter the path to the image: ') + + # You can also pass in base64 encoded image data + # img = base64.b64encode(Path(path).read_bytes()).decode() + # or the raw bytes + # img = Path(path).read_bytes() + + response = chat( + model='gemma3', + messages=[ + { + 'role': 'user', + 'content': 'What is in this image? Be concise.', + 'images': [path], + } + ], + ) + + print(response.message.content) + ``` + + + ```javascript + import ollama from 'ollama' + + const imagePath = '/absolute/path/to/image.jpg' + const response = await ollama.chat({ + model: 'gemma3', + messages: [ + { role: 'user', content: 'What is in this image?', images: [imagePath] } + ], + stream: false, + }) + + console.log(response.message.content) + ``` + + diff --git a/docs/capabilities/web-search.mdx b/docs/capabilities/web-search.mdx new file mode 100644 index 000000000..641ef3812 --- /dev/null +++ b/docs/capabilities/web-search.mdx @@ -0,0 +1,360 @@ +--- +title: Web search +--- + +Ollama's web search API can be used to augment models with the latest information to reduce hallucinations and improve accuracy. + +Web search is provided as a REST API with deeper tool integrations in the Python and JavaScript libraries. This also enables models like OpenAI’s gpt-oss models to conduct long-running research tasks. + +## Authentication + +For access to Ollama's web search API, create an [API key](https://ollama.com/settings/keys). A free Ollama account is required. + +## Web search API + +Performs a web search for a single query and returns relevant results. + +### Request + +`POST https://ollama.com/api/web_search` + +- `query` (string, required): the search query string +- `max_results` (integer, optional): maximum results to return (default 5, max 10) + +### Response + +Returns an object containing: + +- `results` (array): array of search result objects, each containing: + - `title` (string): the title of the web page + - `url` (string): the URL of the web page + - `content` (string): relevant content snippet from the web page + +### Examples + + + Ensure OLLAMA_API_KEY is set or it must be passed in the Authorization header. + + +#### cURL Request + +```bash +curl https://ollama.com/api/web_search \ + --header "Authorization: Bearer $OLLAMA_API_KEY" \ + -d '{ + "query":"what is ollama?" + }' +``` + +**Response** + +```json +{ + "results": [ + { + "title": "Ollama", + "url": "https://ollama.com/", + "content": "Cloud models are now available..." + }, + { + "title": "What is Ollama? Introduction to the AI model management tool", + "url": "https://www.hostinger.com/tutorials/what-is-ollama", + "content": "Ariffud M. 6min Read..." + }, + { + "title": "Ollama Explained: Transforming AI Accessibility and Language ...", + "url": "https://www.geeksforgeeks.org/artificial-intelligence/ollama-explained-transforming-ai-accessibility-and-language-processing/", + "content": "Data Science Data Science Projects Data Analysis..." + } + ] +} +``` + +#### Python library + +```python +import ollama +response = ollama.web_search("What is Ollama?") +print(response) +``` + +**Example output** + +```python + +results = [ + { + "title": "Ollama", + "url": "https://ollama.com/", + "content": "Cloud models are now available in Ollama..." + }, + { + "title": "What is Ollama? Features, Pricing, and Use Cases - Walturn", + "url": "https://www.walturn.com/insights/what-is-ollama-features-pricing-and-use-cases", + "content": "Our services..." + }, + { + "title": "Complete Ollama Guide: Installation, Usage & Code Examples", + "url": "https://collabnix.com/complete-ollama-guide-installation-usage-code-examples", + "content": "Join our Discord Server..." + } +] + +``` + +More Ollama [Python example](https://github.com/ollama/ollama-python/blob/main/examples/web-search.py) + +#### JavaScript Library + +```tsx +import { Ollama } from "ollama"; + +const client = new Ollama(); +const results = await client.webSearch({ query: "what is ollama?" }); +console.log(JSON.stringify(results, null, 2)); +``` + +**Example output** + +```json +{ + "results": [ + { + "title": "Ollama", + "url": "https://ollama.com/", + "content": "Cloud models are now available..." + }, + { + "title": "What is Ollama? Introduction to the AI model management tool", + "url": "https://www.hostinger.com/tutorials/what-is-ollama", + "content": "Ollama is an open-source tool..." + }, + { + "title": "Ollama Explained: Transforming AI Accessibility and Language Processing", + "url": "https://www.geeksforgeeks.org/artificial-intelligence/ollama-explained-transforming-ai-accessibility-and-language-processing/", + "content": "Ollama is a groundbreaking..." + } + ] +} +``` + +More Ollama [JavaScript example](https://github.com/ollama/ollama-js/blob/main/examples/websearch/websearch-tools.ts) + +## Web fetch API + +Fetches a single web page by URL and returns its content. + +### Request + +`POST https://ollama.com/api/web_fetch` + +- `url` (string, required): the URL to fetch + +### Response + +Returns an object containing: + +- `title` (string): the title of the web page +- `content` (string): the main content of the web page +- `links` (array): array of links found on the page + +### Examples + +#### cURL Request + +```python +curl --request POST \ + --url https://ollama.com/api/web_fetch \ + --header "Authorization: Bearer $OLLAMA_API_KEY" \ + --header 'Content-Type: application/json' \ + --data '{ + "url": "ollama.com" + }' +``` + +**Response** + +```json +{ + "title": "Ollama", + "content": "[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama...", + "links": [ + "http://ollama.com/", + "http://ollama.com/models", + "https://github.com/ollama/ollama" + ] + +``` + +#### Python SDK + +```python +from ollama import web_fetch + +result = web_fetch('https://ollama.com') +print(result) +``` + +**Result** + +```python +WebFetchResponse( + title='Ollama', + content='[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama\n\n**Chat & build +with open models**\n\n[Download](https://ollama.com/download) [Explore +models](https://ollama.com/models)\n\nAvailable for macOS, Windows, and Linux', + links=['https://ollama.com/', 'https://ollama.com/models', 'https://github.com/ollama/ollama'] +) +``` + +#### JavaScript SDK + +```tsx +import { Ollama } from "ollama"; + +const client = new Ollama(); +const fetchResult = await client.webFetch({ url: "https://ollama.com" }); +console.log(JSON.stringify(fetchResult, null, 2)); +``` + +**Result** + +```json +{ + "title": "Ollama", + "content": "[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama...", + "links": [ + "https://ollama.com/", + "https://ollama.com/models", + "https://github.com/ollama/ollama" + ] +} +``` + +## Building a search agent + +Use Ollama’s web search API as a tool to build a mini search agent. + +This example uses Alibaba’s Qwen 3 model with 4B parameters. + +```bash +ollama pull qwen3:4b +``` + +```python +from ollama import chat, web_fetch, web_search + +available_tools = {'web_search': web_search, 'web_fetch': web_fetch} + +messages = [{'role': 'user', 'content': "what is ollama's new engine"}] + +while True: + response = chat( + model='qwen3:4b', + messages=messages, + tools=[web_search, web_fetch], + think=True + ) + if response.message.thinking: + print('Thinking: ', response.message.thinking) + if response.message.content: + print('Content: ', response.message.content) + messages.append(response.message) + if response.message.tool_calls: + print('Tool calls: ', response.message.tool_calls) + for tool_call in response.message.tool_calls: + function_to_call = available_tools.get(tool_call.function.name) + if function_to_call: + args = tool_call.function.arguments + result = function_to_call(**args) + print('Result: ', str(result)[:200]+'...') + # Result is truncated for limited context lengths + messages.append({'role': 'tool', 'content': str(result)[:2000 * 4], 'tool_name': tool_call.function.name}) + else: + messages.append({'role': 'tool', 'content': f'Tool {tool_call.function.name} not found', 'tool_name': tool_call.function.name}) + else: + break +``` + +**Result** + +``` +Thinking: Okay, the user is asking about Ollama's new engine. I need to figure out what they're referring to. Ollama is a company that develops large language models, so maybe they've released a new model or an updated version of their existing engine.... + +Tool calls: [ToolCall(function=Function(name='web_search', arguments={'max_results': 3, 'query': 'Ollama new engine'}))] +Result: results=[WebSearchResult(content='# New model scheduling\n\n## September 23, 2025\n\nOllama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine + +Thinking: Okay, the user asked about Ollama's new engine. Let me look at the search results. + +First result is from September 23, 2025, talking about new model scheduling. It mentions improved memory management, reduced crashes, better GPU utilization, and multi-GPU performance. Examples show speed improvements and accurate memory reporting. Supported models include gemma3, llama4, qwen3, etc... + +Content: Ollama has introduced two key updates to its engine, both released in 2025: + +1. **Enhanced Model Scheduling (September 23, 2025)** + - **Precision Memory Management**: Exact memory allocation reduces out-of-memory crashes and optimizes GPU utilization. + - **Performance Gains**: Examples show significant speed improvements (e.g., 85.54 tokens/s vs 52.02 tokens/s) and full GPU layer utilization. + - **Multi-GPU Support**: Improved efficiency across multiple GPUs, with accurate memory reporting via tools like `nvidia-smi`. + - **Supported Models**: Includes `gemma3`, `llama4`, `qwen3`, `mistral-small3.2`, and more. + +2. **Multimodal Engine (May 15, 2025)** + - **Vision Support**: First-class support for vision models, including `llama4:scout` (109B parameters), `gemma3`, `qwen2.5vl`, and `mistral-small3.1`. + - **Multimodal Tasks**: Examples include identifying animals in multiple images, answering location-based questions from videos, and document scanning. + +These updates highlight Ollama's focus on efficiency, performance, and expanded capabilities for both text and vision tasks. +``` + +### Context length and agents + +Web search results can return thousands of tokens. It is recommended to increase the context length of the model to at least ~32000 tokens. Search agents work best with full context length. [Ollama's cloud models](https://docs.ollama.com/cloud) run at the full context length. + +## MCP Server + +You can enable web search in any MCP client through the [Python MCP server](https://github.com/ollama/ollama-python/blob/main/examples/web-search-mcp.py). + +### Cline + +Ollama's web search can be integrated with Cline easily using the MCP server configuration. + +`Manage MCP Servers` > `Configure MCP Servers` > Add the following configuration: + +```json +{ + "mcpServers": { + "web_search_and_fetch": { + "type": "stdio", + "command": "uv", + "args": ["run", "path/to/web-search-mcp.py"], + "env": { "OLLAMA_API_KEY": "your_api_key_here" } + } + } +} +``` + +![Cline MCP Configuration](/images/cline-mcp.png) + +### Codex + +Ollama works well with OpenAI's Codex tool. + +Add the following configuration to `~/.codex/config.toml` + +```python +[mcp_servers.web_search] +command = "uv" +args = ["run", "path/to/web-search-mcp.py"] +env = { "OLLAMA_API_KEY" = "your_api_key_here" } +``` + +![Codex MCP Configuration](/images/codex-mcp.png) + +### Goose + +Ollama can integrate with Goose via its MCP feature. + +![Goose MCP Configuration 1](/images/goose-mcp-1.png) + +![Goose MCP Configuration 2](/images/goose-mcp-2.png) + +### Other integrations + +Ollama can be integrated into most of the tools available either through direct integration of Ollama's API, Python / JavaScript libraries, OpenAI compatible API, and MCP server integration. diff --git a/docs/cli.mdx b/docs/cli.mdx new file mode 100644 index 000000000..3081838f9 --- /dev/null +++ b/docs/cli.mdx @@ -0,0 +1,91 @@ +--- +title: CLI Reference +--- + +### Run a model + +``` +ollama run gemma3 +``` + +#### Multiline input + +For multiline input, you can wrap text with `"""`: + +``` +>>> """Hello, +... world! +... """ +I'm a basic program that prints the famous "Hello, world!" message to the console. +``` + +#### Multimodal models + +``` +ollama run gemma3 "What's in this image? /Users/jmorgan/Desktop/smile.png" +``` + +### Download a model + +``` +ollama pull gemma3 +``` + +### Remove a model + +``` +ollama rm gemma3 +``` + +### List models + +``` +ollama ls +``` + +### Sign in to Ollama + +``` +ollama signin +``` + +### Sign out of Ollama + +``` +ollama signout +``` + +### Create a customized model + +First, create a `Modelfile` + +``` +FROM gemma3 +SYSTEM """You are a happy cat.""" +``` + +Then run `ollama create`: + +``` +ollama create -f Modelfile +``` + +### List running models + +``` +ollama ps +``` + +### Stop a running model + +``` +ollama stop gemma3 +``` + +### Start Ollama + +``` +ollama serve +``` + +To view a list of environment variables that can be set run `ollama serve --help` diff --git a/docs/cloud.mdx b/docs/cloud.mdx index 300e6f5e0..654420e44 100644 --- a/docs/cloud.mdx +++ b/docs/cloud.mdx @@ -1,19 +1,33 @@ -# Cloud +--- +title: Cloud +sidebarTitle: Cloud +--- -| Ollama's cloud is currently in preview. For full documentation, see [Ollama's documentation](https://docs.ollama.com/cloud). +Ollama's cloud is currently in preview. ## Cloud Models -[Cloud models](https://ollama.com/cloud) are a new kind of model in Ollama that can run without a powerful GPU. Instead, cloud models are automatically offloaded to Ollama's cloud while offering the same capabilities as local models, making it possible to keep using your local tools while running larger models that wouldn’t fit on a personal computer. +Ollama's cloud models are a new kind of model in Ollama that can run without a powerful GPU. Instead, cloud models are automatically offloaded to Ollama's cloud service while offering the same capabilities as local models, making it possible to keep using your local tools while running larger models that wouldn't fit on a personal computer. Ollama currently supports the following cloud models, with more coming soon: +- `deepseek-v3.1:671b-cloud` - `gpt-oss:20b-cloud` - `gpt-oss:120b-cloud` -- `deepseek-v3.1:671b-cloud` +- `kimi-k2:1t-cloud` - `qwen3-coder:480b-cloud` +- `glm-4.6:cloud` -### Get started +### Running Cloud models + +Ollama's cloud models require an account on [ollama.com](https://ollama.com). To sign in or create an account, run: + +``` +ollama signin +``` + + + To run a cloud model, open the terminal and run: @@ -21,20 +35,201 @@ To run a cloud model, open the terminal and run: ollama run gpt-oss:120b-cloud ``` -To run cloud models with integrations that work with Ollama, first download the cloud model: + + + +First, pull a cloud model so it can be accessed: ``` -ollama pull qwen3-coder:480b-cloud +ollama pull gpt-oss:120b-cloud ``` -Then sign in to Ollama: +Next, install [Ollama's Python library](https://github.com/ollama/ollama-python): ``` -ollama signin +pip install ollama ``` -Finally, access the model using the model name `qwen3-coder:480b-cloud` via Ollama's local API or tooling. +Next, create and run a simple Python script: + +```python +from ollama import Client + +client = Client() + +messages = [ + { + 'role': 'user', + 'content': 'Why is the sky blue?', + }, +] + +for part in client.chat('gpt-oss:120b-cloud', messages=messages, stream=True): + print(part['message']['content'], end='', flush=True) +``` + + + + +First, pull a cloud model so it can be accessed: + +``` +ollama pull gpt-oss:120b-cloud +``` + +Next, install [Ollama's JavaScript library](https://github.com/ollama/ollama-js): + +``` +npm i ollama +``` + +Then use the library to run a cloud model: + +```typescript +import { Ollama } from "ollama"; + +const ollama = new Ollama(); + +const response = await ollama.chat({ + model: "gpt-oss:120b-cloud", + messages: [{ role: "user", content: "Explain quantum computing" }], + stream: true, +}); + +for await (const part of response) { + process.stdout.write(part.message.content); +} +``` + + + + +First, pull a cloud model so it can be accessed: + +``` +ollama pull gpt-oss:120b-cloud +``` + +Run the following cURL command to run the command via Ollama's API: + +``` +curl http://localhost:11434/api/chat -d '{ + "model": "gpt-oss:120b-cloud", + "messages": [{ + "role": "user", + "content": "Why is the sky blue?" + }], + "stream": false +}' +``` + + + ## Cloud API access -Cloud models can also be accessed directly on ollama.com's API. For more information, see the [docs](https://docs.ollama.com/cloud). +Cloud models can also be accessed directly on ollama.com's API. In this mode, ollama.com acts as a remote Ollama host. + +### Authentication + +For direct access to ollama.com's API, first create an [API key](https://ollama.com/settings/keys). + +Then, set the `OLLAMA_API_KEY` environment variable to your API key. + +``` +export OLLAMA_API_KEY=your_api_key +``` + +### Listing models + +For models available directly via Ollama's API, models can be listed via: + +``` +curl https://ollama.com/api/tags +``` + +### Generating a response + + + + +First, install [Ollama's Python library](https://github.com/ollama/ollama-python) + +``` +pip install ollama +``` + +Then make a request + +```python +import os +from ollama import Client + +client = Client( + host="https://ollama.com", + headers={'Authorization': 'Bearer ' + os.environ.get('OLLAMA_API_KEY')} +) + +messages = [ + { + 'role': 'user', + 'content': 'Why is the sky blue?', + }, +] + +for part in client.chat('gpt-oss:120b', messages=messages, stream=True): + print(part['message']['content'], end='', flush=True) +``` + + + + +First, install [Ollama's JavaScript library](https://github.com/ollama/ollama-js): + +``` +npm i ollama +``` + +Next, make a request to the model: + +```typescript +import { Ollama } from "ollama"; + +const ollama = new Ollama({ + host: "https://ollama.com", + headers: { + Authorization: "Bearer " + process.env.OLLAMA_API_KEY, + }, +}); + +const response = await ollama.chat({ + model: "gpt-oss:120b", + messages: [{ role: "user", content: "Explain quantum computing" }], + stream: true, +}); + +for await (const part of response) { + process.stdout.write(part.message.content); +} +``` + + + + +Generate a response via Ollama's chat API: + +``` +curl https://ollama.com/api/chat \ + -H "Authorization: Bearer $OLLAMA_API_KEY" \ + -d '{ + "model": "gpt-oss:120b", + "messages": [{ + "role": "user", + "content": "Why is the sky blue?" + }], + "stream": false + }' +``` + + + diff --git a/docs/context-length.mdx b/docs/context-length.mdx new file mode 100644 index 000000000..43bcf0d31 --- /dev/null +++ b/docs/context-length.mdx @@ -0,0 +1,38 @@ +--- +title: Context length +--- + +Context length is the maximum number of tokens that the model has access to in memory. + + + The default context length in Ollama is 4096 tokens. + + +Tasks which require large context like web search, agents, and coding tools should be set to at least 32000 tokens. + +## Setting context length + +Setting a larger context length will increase the amount of memory required to run a model. Ensure you have enough VRAM available to increase the context length. + +Cloud models are set to their maximum context length by default. + +### App + +Change the slider in the Ollama app under settings to your desired context length. +![Context length in Ollama app](./images/ollama-settings.png) + +### CLI +If editing the context length for Ollama is not possible, the context length can also be updated when serving Ollama. +``` +OLLAMA_CONTEXT_LENGTH=32000 ollama serve +``` + +### Check allocated context length and model offloading +For best performance, use the maximum context length for a model, and avoid offloading the model to CPU. Verify the split under `PROCESSOR` using `ollama ps`. +``` +ollama ps +``` +``` +NAME ID SIZE PROCESSOR CONTEXT UNTIL +gemma3:latest a2af6cc3eb7f 6.6 GB 100% GPU 65536 2 minutes from now +``` diff --git a/docs/docker.mdx b/docs/docker.mdx index dce090a27..22d2bc339 100644 --- a/docs/docker.mdx +++ b/docs/docker.mdx @@ -1,21 +1,21 @@ -# Ollama Docker image - -### CPU only +## CPU only ```shell docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama ``` -### Nvidia GPU +## Nvidia GPU + Install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation). -#### Install with Apt +### Install with Apt + 1. Configure the repository ```shell curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \ | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \ + curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \ | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \ | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update @@ -27,37 +27,40 @@ Install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud- sudo apt-get install -y nvidia-container-toolkit ``` -#### Install with Yum or Dnf +### Install with Yum or Dnf + 1. Configure the repository ```shell - curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \ + curl -fsSL https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \ | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo ``` -2. Install the NVIDIA Container Toolkit packages +2. Install the NVIDIA Container Toolkit packages ```shell sudo yum install -y nvidia-container-toolkit ``` -#### Configure Docker to use Nvidia driver +### Configure Docker to use Nvidia driver ```shell sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` -#### Start the container +### Start the container ```shell docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama ``` -> [!NOTE] -> If you're running on an NVIDIA JetPack system, Ollama can't automatically discover the correct JetPack version. Pass the environment variable JETSON_JETPACK=5 or JETSON_JETPACK=6 to the container to select version 5 or 6. + + If you're running on an NVIDIA JetPack system, Ollama can't automatically discover the correct JetPack version. + Pass the environment variable `JETSON_JETPACK=5` or `JETSON_JETPACK=6` to the container to select version 5 or 6. + -### AMD GPU +## AMD GPU To run Ollama using Docker with AMD GPUs, use the `rocm` tag and the following command: @@ -65,7 +68,7 @@ To run Ollama using Docker with AMD GPUs, use the `rocm` tag and the following c docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm ``` -### Run model locally +## Run model locally Now you can run a model: @@ -73,6 +76,6 @@ Now you can run a model: docker exec -it ollama ollama run llama3.2 ``` -### Try different models +## Try different models More models can be found on the [Ollama library](https://ollama.com/library). diff --git a/docs/docs.json b/docs/docs.json new file mode 100644 index 000000000..18db6ed55 --- /dev/null +++ b/docs/docs.json @@ -0,0 +1,155 @@ +{ + "$schema": "https://mintlify.com/docs.json", + "name": "Ollama", + "colors": { + "primary": "#000", + "light": "#b5b5b5", + "dark": "#000" + }, + "favicon": "/images/favicon.png", + "logo": { + "light": "/images/logo.png", + "dark": "/images/logo-dark.png", + "href": "https://ollama.com" + }, + "theme": "maple", + "background": { + "color": { + "light": "#ffffff", + "dark": "#000000" + } + }, + "fonts": { + "family": "system-ui", + "heading": { + "family": "system-ui" + }, + "body": { + "family": "system-ui" + } + }, + "styling": { + "codeblocks": "system" + }, + "contextual": { + "options": ["copy"] + }, + "navbar": { + "links": [ + { + "label": "Sign in", + "href": "https://ollama.com/signin" + } + ], + "primary": { + "type": "button", + "label": "Download", + "href": "https://ollama.com/download" + } + }, + "api": { + "playground": { + "display": "simple" + }, + "examples": { + "languages": ["curl"] + } + }, + "redirects": [ + { + "source": "/openai", + "destination": "/api/openai" + } + ], + "navigation": { + "tabs": [ + { + "tab": "Documentation", + "groups": [ + { + "group": "Get started", + "pages": [ + "index", + "quickstart", + "/cloud" + ] + }, + { + "group": "Capabilities", + "pages": [ + "/capabilities/streaming", + "/capabilities/thinking", + "/capabilities/structured-outputs", + "/capabilities/vision", + "/capabilities/embeddings", + "/capabilities/tool-calling", + "/capabilities/web-search" + ] + }, + { + "group": "Integrations", + "pages": [ + "/integrations/vscode", + "/integrations/jetbrains", + "/integrations/codex", + "/integrations/cline", + "/integrations/droid", + "/integrations/goose", + "/integrations/zed", + "/integrations/roo-code", + "/integrations/n8n", + "/integrations/xcode" + ] + }, + { + "group": "More information", + "pages": [ + "/cli", + "/modelfile", + "/context-length", + "/linux", + "/docker", + "/faq", + "/gpu", + "/troubleshooting" + ] + } + ] + }, + { + "tab": "API Reference", + "openapi": "/openapi.yaml", + "groups": [ + { + "group": "API Reference", + "pages": [ + "/api/index", + "/api/authentication", + "/api/streaming", + "/api/usage", + "/api/errors", + "/api/openai-compatibility" + ] + }, + { + "group": "Endpoints", + "pages": [ + "POST /api/generate", + "POST /api/chat", + "POST /api/embed", + "GET /api/tags", + "GET /api/ps", + "POST /api/show", + "POST /api/create", + "POST /api/copy", + "POST /api/pull", + "POST /api/push", + "DELETE /api/delete", + "GET /api/version" + ] + } + ] + } + ] + } +} diff --git a/docs/faq.mdx b/docs/faq.mdx index 900ffba42..18a80b705 100644 --- a/docs/faq.mdx +++ b/docs/faq.mdx @@ -1,4 +1,6 @@ -# FAQ +--- +title: FAQ +--- ## How can I upgrade Ollama? @@ -20,9 +22,9 @@ Please refer to the [GPU docs](./gpu.md). ## How can I specify the context window size? -By default, Ollama uses a context window size of 4096 tokens for most models. The `gpt-oss` model has a default context window size of 8192 tokens. +By default, Ollama uses a context window size of 2048 tokens. -This can be overridden in Settings in the Windows and macOS App, or with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use: +This can be overridden with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use: ```shell OLLAMA_CONTEXT_LENGTH=8192 ollama serve @@ -46,8 +48,6 @@ curl http://localhost:11434/api/generate -d '{ }' ``` -Setting the context length higher may cause the model to not be able to fit onto the GPU which make the model run more slowly. - ## How can I tell if my model was loaded onto the GPU? Use the `ollama ps` command to see what models are currently loaded into memory. @@ -56,17 +56,16 @@ Use the `ollama ps` command to see what models are currently loaded into memory. ollama ps ``` -> **Output**: -> -> ``` -> NAME ID SIZE PROCESSOR CONTEXT UNTIL -> gpt-oss:20b 05afbac4bad6 16 GB 100% GPU 8192 4 minutes from now -> ``` + + **Output**: ``` NAME ID SIZE PROCESSOR UNTIL llama3:70b bcfb190ca3a7 42 GB + 100% GPU 4 minutes from now ``` + The `Processor` column will show which memory the model was loaded in to: -* `100% GPU` means the model was loaded entirely into the GPU -* `100% CPU` means the model was loaded entirely in system memory -* `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory + +- `100% GPU` means the model was loaded entirely into the GPU +- `100% CPU` means the model was loaded entirely in system memory +- `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory ## How do I configure Ollama server? @@ -78,9 +77,9 @@ If Ollama is run as a macOS application, environment variables should be set usi 1. For each environment variable, call `launchctl setenv`. - ```bash - launchctl setenv OLLAMA_HOST "0.0.0.0:11434" - ``` + ```bash + launchctl setenv OLLAMA_HOST "0.0.0.0:11434" + ``` 2. Restart Ollama application. @@ -92,10 +91,10 @@ If Ollama is run as a systemd service, environment variables should be set using 2. For each environment variable, add a line `Environment` under section `[Service]`: - ```ini - [Service] - Environment="OLLAMA_HOST=0.0.0.0:11434" - ``` + ```ini + [Service] + Environment="OLLAMA_HOST=0.0.0.0:11434" + ``` 3. Save and exit. @@ -126,8 +125,10 @@ On Windows, Ollama inherits your user and system environment variables. Ollama pulls models from the Internet and may require a proxy server to access the models. Use `HTTPS_PROXY` to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform. -> [!NOTE] -> Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server. + + Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only + HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server. + ### How do I use Ollama behind a proxy in Docker? @@ -150,11 +151,9 @@ docker build -t ollama-with-ca . docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca ``` -## Does Ollama send my prompts and responses back to ollama.com? +## Does Ollama send my prompts and answers back to ollama.com? -If you're running a model locally, your prompts and responses will always stay on your machine. Ollama Turbo in the App allows you to run your queries on Ollama's servers if you don't have a powerful enough GPU. Web search lets a model query the web, giving you more accurate and up-to-date information. Both Turbo and web search require sending your prompts and responses to Ollama.com. This data is neither logged nor stored. - -If you don't want to see the Turbo and web search options in the app, you can disable them in Settings by turning on Airplane mode. In Airplane mode, all models will run locally, and your prompts and responses will stay on your machine. +No. Ollama runs locally, and conversation data does not leave your machine. ## How can I expose Ollama on my network? @@ -216,7 +215,9 @@ Refer to the section [above](#how-do-i-configure-ollama-server) for how to set e If a different directory needs to be used, set the environment variable `OLLAMA_MODELS` to the chosen directory. -> Note: on Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama `. + + On Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama `. + Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform. @@ -235,7 +236,7 @@ GPU acceleration is not available for Docker Desktop in macOS due to the lack of This can impact both installing Ollama, as well as downloading models. Open `Control Panel > Networking and Internet > View network status and tasks` and click on `Change adapter settings` on the left panel. Find the `vEthernel (WSL)` adapter, right click and select `Properties`. -Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. *Disable* both of these +Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. _Disable_ both of these properties. ## How can I preload a model into Ollama to get faster response times? @@ -269,10 +270,11 @@ ollama stop llama3.2 ``` If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to: -* a duration string (such as "10m" or "24h") -* a number in seconds (such as 3600) -* any negative number which will keep the model loaded in memory (e.g. -1 or "-1m") -* '0' which will unload the model immediately after generating a response + +- a duration string (such as "10m" or "24h") +- a number in seconds (such as 3600) +- any negative number which will keep the model loaded in memory (e.g. -1 or "-1m") +- '0' which will unload the model immediately after generating a response For example, to preload a model and leave it in memory use: @@ -292,31 +294,31 @@ The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endp ## How do I manage the maximum number of requests the Ollama server can queue? -If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`. +If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`. ## How does Ollama handle concurrent requests? -Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it can be configured to allow parallel request processing. +Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing. -If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads. +If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads. -Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation. +Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation. The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms: -- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference. -- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default is 1, and will handle 1 request per model at a time. +- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 \* the number of GPUs or 3 for CPU inference. +- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory. - `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512 -Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM. +Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM. ## How does Ollama load models on multiple GPUs? -When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs. +When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs. ## How can I enable Flash Attention? -Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server. +Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server. ## How can I set the quantization type for the K/V cache? @@ -324,9 +326,12 @@ The K/V context cache can be quantized to significantly reduce memory usage when To use quantized K/V cache with Ollama you can set the following environment variable: -- `OLLAMA_KV_CACHE_TYPE` - The quantization type for the K/V cache. Default is `f16`. +- `OLLAMA_KV_CACHE_TYPE` - The quantization type for the K/V cache. Default is `f16`. -> Note: Currently this is a global option - meaning all models will run with the specified quantization type. + + Currently this is a global option - meaning all models will run with the + specified quantization type. + The currently available K/V cache quantization types are: @@ -334,19 +339,40 @@ The currently available K/V cache quantization types are: - `q8_0` - 8-bit quantization, uses approximately 1/2 the memory of `f16` with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16). - `q4_0` - 4-bit quantization, uses approximately 1/4 the memory of `f16` with a small-medium loss in precision that may be more noticeable at higher context sizes. -How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count. +How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count. You may need to experiment with different quantization types to find the best balance between memory usage and quality. -## How can I stop Ollama from starting when I login to my computer +## Where can I find my Ollama Public Key? -Ollama for Windows and macOS register as a login item during installation. You can disable this if you prefer not to have Ollama automatically start. Ollama will respect this setting across upgrades, unless you uninstall the application. +Your **Ollama Public Key** is the public part of the key pair that lets your local Ollama instance talk to [ollama.com](https://ollama.com). -**Windows** -- Remove `%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\Ollama.lnk` +You'll need it to: +* Push models to Ollama +* Pull private models from Ollama to your machine +* Run models hosted in [Ollama Cloud](https://ollama.com/cloud) -**MacOS Monterey (v12)** -- Open `Settings` -> `Users & Groups` -> `Login Items` and find the `Ollama` entry, then click the `-` (minus) to remove +### How to Add the Key -**MacOS Ventura (v13) and later** -- Open `Settings` and search for "Login Items", find the `Ollama` entry under "Allow in the Background`, then click the slider to disable. +* **Sign-in via the Settings page** in the **Mac** and **Windows App** + +* **Sign‑in via CLI** + +```shell +ollama signin +``` + +* **Manually copy & paste** the key on the **Ollama Keys** page: +[https://ollama.com/settings/keys](https://ollama.com/settings/keys) + +### Where the Ollama Public Key lives + +| OS | Path to `id_ed25519.pub` | +| :- | :- | +| macOS | `~/.ollama/id_ed25519.pub` | +| Linux | `/usr/share/ollama/.ollama/id_ed25519.pub` | +| Windows | `C:\Users\\.ollama\id_ed25519.pub` | + + + Replace <username> with your actual Windows user name. + diff --git a/docs/favicon-dark.svg b/docs/favicon-dark.svg new file mode 100644 index 000000000..672ecd01c --- /dev/null +++ b/docs/favicon-dark.svg @@ -0,0 +1,3 @@ + + + diff --git a/docs/favicon.svg b/docs/favicon.svg new file mode 100644 index 000000000..99d6b5e03 --- /dev/null +++ b/docs/favicon.svg @@ -0,0 +1,3 @@ + + + diff --git a/docs/gpu.mdx b/docs/gpu.mdx index 910f82d14..84ef2a496 100644 --- a/docs/gpu.mdx +++ b/docs/gpu.mdx @@ -1,39 +1,36 @@ -# GPU +--- +title: Hardware support +--- + ## Nvidia -Ollama supports Nvidia GPUs with compute capability 5.0+ and driver version 531 and newer. + +Ollama supports Nvidia GPUs with compute capability 5.0+. Check your compute compatibility to see if your card is supported: [https://developer.nvidia.com/cuda-gpus](https://developer.nvidia.com/cuda-gpus) -| Compute Capability | Family | Cards | -| ------------------ | ------------------- | ----------------------------------------------------------------------------------------------------------- | -| 12.0 | GeForce RTX 50xx | `RTX 5060` `RTX 5060 Ti` `RTX 5070` `RTX 5070 Ti` `RTX 5080` `RTX 5090` | -| | NVIDIA Professioal | `RTX PRO 4000 Blackwell` `RTX PRO 4500 Blackwell` `RTX PRO 5000 Blackwell` `RTX PRO 6000 Blackwell` | -| 11.0 | Jetson | `T4000` `T5000` (Requires driver 580 or newer) | -| 10.3 | NVIDIA Professioal | `B300` `GB300` (Requires driver 580 or newer) | -| 10.0 | NVIDIA Professioal | `B200` `GB200` (Requires driver 580 or newer) | -| 9.0 | NVIDIA | `H200` `H100` `GH200` | -| 8.9 | GeForce RTX 40xx | `RTX 4090` `RTX 4080 SUPER` `RTX 4080` `RTX 4070 Ti SUPER` `RTX 4070 Ti` `RTX 4070 SUPER` `RTX 4070` `RTX 4060 Ti` `RTX 4060` | -| | NVIDIA Professional | `L4` `L40` `RTX 6000` | -| 8.7 | Jetson | `Orin Nano` `Orin NX` `AGX Orin` | -| 8.6 | GeForce RTX 30xx | `RTX 3090 Ti` `RTX 3090` `RTX 3080 Ti` `RTX 3080` `RTX 3070 Ti` `RTX 3070` `RTX 3060 Ti` `RTX 3060` `RTX 3050 Ti` `RTX 3050` | -| | NVIDIA Professional | `A40` `RTX A6000` `RTX A5000` `RTX A4000` `RTX A3000` `RTX A2000` `A10` `A16` `A2` | -| 8.0 | NVIDIA | `A100` `A30` | -| 7.5 | GeForce GTX/RTX | `GTX 1650 Ti` `TITAN RTX` `RTX 2080 Ti` `RTX 2080` `RTX 2070` `RTX 2060` | -| | NVIDIA Professional | `T4` `RTX 5000` `RTX 4000` `RTX 3000` `T2000` `T1200` `T1000` `T600` `T500` | -| | Quadro | `RTX 8000` `RTX 6000` `RTX 5000` `RTX 4000` | -| 7.2 | Jetson | `Xavier NX` `AGX Xavier` (Jetpack 5) | -| 7.0 | NVIDIA | `TITAN V` `V100` `Quadro GV100` | -| 6.1 | NVIDIA TITAN | `TITAN Xp` `TITAN X` | -| | GeForce GTX | `GTX 1080 Ti` `GTX 1080` `GTX 1070 Ti` `GTX 1070` `GTX 1060` `GTX 1050 Ti` `GTX 1050` | -| | Quadro | `P6000` `P5200` `P4200` `P3200` `P5000` `P4000` `P3000` `P2200` `P2000` `P1000` `P620` `P600` `P500` `P520` | -| | Tesla | `P40` `P4` | -| 6.0 | NVIDIA | `Tesla P100` `Quadro GP100` | -| 5.2 | GeForce GTX | `GTX TITAN X` `GTX 980 Ti` `GTX 980` `GTX 970` `GTX 960` `GTX 950` | -| | Quadro | `M6000 24GB` `M6000` `M5000` `M5500M` `M4000` `M2200` `M2000` `M620` | -| | Tesla | `M60` `M40` | -| 5.0 | GeForce GTX | `GTX 750 Ti` `GTX 750` `NVS 810` | -| | Quadro | `K2200` `K1200` `K620` `M1200` `M520` `M5000M` `M4000M` `M3000M` `M2000M` `M1000M` `K620M` `M600M` `M500M` | +| Compute Capability | Family | Cards | +| ------------------ | ------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| 9.0 | NVIDIA | `H200` `H100` | +| 8.9 | GeForce RTX 40xx | `RTX 4090` `RTX 4080 SUPER` `RTX 4080` `RTX 4070 Ti SUPER` `RTX 4070 Ti` `RTX 4070 SUPER` `RTX 4070` `RTX 4060 Ti` `RTX 4060` | +| | NVIDIA Professional | `L4` `L40` `RTX 6000` | +| 8.6 | GeForce RTX 30xx | `RTX 3090 Ti` `RTX 3090` `RTX 3080 Ti` `RTX 3080` `RTX 3070 Ti` `RTX 3070` `RTX 3060 Ti` `RTX 3060` `RTX 3050 Ti` `RTX 3050` | +| | NVIDIA Professional | `A40` `RTX A6000` `RTX A5000` `RTX A4000` `RTX A3000` `RTX A2000` `A10` `A16` `A2` | +| 8.0 | NVIDIA | `A100` `A30` | +| 7.5 | GeForce GTX/RTX | `GTX 1650 Ti` `TITAN RTX` `RTX 2080 Ti` `RTX 2080` `RTX 2070` `RTX 2060` | +| | NVIDIA Professional | `T4` `RTX 5000` `RTX 4000` `RTX 3000` `T2000` `T1200` `T1000` `T600` `T500` | +| | Quadro | `RTX 8000` `RTX 6000` `RTX 5000` `RTX 4000` | +| 7.0 | NVIDIA | `TITAN V` `V100` `Quadro GV100` | +| 6.1 | NVIDIA TITAN | `TITAN Xp` `TITAN X` | +| | GeForce GTX | `GTX 1080 Ti` `GTX 1080` `GTX 1070 Ti` `GTX 1070` `GTX 1060` `GTX 1050 Ti` `GTX 1050` | +| | Quadro | `P6000` `P5200` `P4200` `P3200` `P5000` `P4000` `P3000` `P2200` `P2000` `P1000` `P620` `P600` `P500` `P520` | +| | Tesla | `P40` `P4` | +| 6.0 | NVIDIA | `Tesla P100` `Quadro GP100` | +| 5.2 | GeForce GTX | `GTX TITAN X` `GTX 980 Ti` `GTX 980` `GTX 970` `GTX 960` `GTX 950` | +| | Quadro | `M6000 24GB` `M6000` `M5000` `M5500M` `M4000` `M2200` `M2000` `M620` | +| | Tesla | `M60` `M40` | +| 5.0 | GeForce GTX | `GTX 750 Ti` `GTX 750` `NVS 810` | +| | Quadro | `K2200` `K1200` `K620` `M1200` `M520` `M5000M` `M4000M` `M3000M` `M2000M` `M1000M` `K620M` `M600M` `M500M` | For building locally to support older GPUs, see [developer.md](./development.md#linux-cuda-nvidia) @@ -48,51 +45,53 @@ ignore the GPUs and force CPU usage, use an invalid GPU ID (e.g., "-1") ### Linux Suspend Resume On linux, after a suspend/resume cycle, sometimes Ollama will fail to discover -your NVIDIA GPU, and fallback to running on the CPU. You can workaround this +your NVIDIA GPU, and fallback to running on the CPU. You can workaround this driver bug by reloading the NVIDIA UVM driver with `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm` ## AMD Radeon + Ollama supports the following AMD GPUs: ### Linux Support -| Family | Cards and accelerators | -| -------------- | -------------------------------------------------------------------------------------------------------------------- | -| AMD Radeon RX | `7900 XTX` `7900 XT` `7900 GRE` `7800 XT` `7700 XT` `7600 XT` `7600` `6950 XT` `6900 XTX` `6900XT` `6800 XT` `6800` | -| AMD Radeon PRO | `W7900` `W7800` `W7700` `W7600` `W7500` `W6900X` `W6800X Duo` `W6800X` `W6800` `V620` `V420` `V340` `V320` | -| AMD Instinct | `MI300X` `MI300A` `MI300` `MI250X` `MI250` `MI210` `MI200` `MI100` | + +| Family | Cards and accelerators | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| AMD Radeon RX | `7900 XTX` `7900 XT` `7900 GRE` `7800 XT` `7700 XT` `7600 XT` `7600` `6950 XT` `6900 XTX` `6900XT` `6800 XT` `6800` `Vega 64` `Vega 56` | +| AMD Radeon PRO | `W7900` `W7800` `W7700` `W7600` `W7500` `W6900X` `W6800X Duo` `W6800X` `W6800` `V620` `V420` `V340` `V320` `Vega II Duo` `Vega II` `VII` `SSG` | +| AMD Instinct | `MI300X` `MI300A` `MI300` `MI250X` `MI250` `MI210` `MI200` `MI100` `MI60` `MI50` | ### Windows Support -With ROCm v6.2, the following GPUs are supported on Windows. -| Family | Cards and accelerators | -| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | -| AMD Radeon RX | `7900 XTX` `7900 XT` `7900 GRE` `7800 XT` `7700 XT` `7600 XT` `7600` `6950 XT` `6900 XTX` `6900XT` `6800 XT` `6800` | -| AMD Radeon PRO | `W7900` `W7800` `W7700` `W7600` `W7500` `W6900X` `W6800X Duo` `W6800X` `W6800` `V620` | +With ROCm v6.1, the following GPUs are supported on Windows. -### Known Workarounds - -- The RX Vega 56 requires `HSA_ENABLE_SDMA=0` to disable SDMA +| Family | Cards and accelerators | +| -------------- | ------------------------------------------------------------------------------------------------------------------- | +| AMD Radeon RX | `7900 XTX` `7900 XT` `7900 GRE` `7800 XT` `7700 XT` `7600 XT` `7600` `6950 XT` `6900 XTX` `6900XT` `6800 XT` `6800` | +| AMD Radeon PRO | `W7900` `W7800` `W7700` `W7600` `W7500` `W6900X` `W6800X Duo` `W6800X` `W6800` `V620` | ### Overrides on Linux + Ollama leverages the AMD ROCm library, which does not support all AMD GPUs. In some cases you can force the system to try to use a similar LLVM target that is -close. For example The Radeon RX 5400 is `gfx1034` (also known as 10.3.4) +close. For example The Radeon RX 5400 is `gfx1034` (also known as 10.3.4) however, ROCm does not currently support this target. The closest support is -`gfx1030`. You can use the environment variable `HSA_OVERRIDE_GFX_VERSION` with -`x.y.z` syntax. So for example, to force the system to run on the RX 5400, you +`gfx1030`. You can use the environment variable `HSA_OVERRIDE_GFX_VERSION` with +`x.y.z` syntax. So for example, to force the system to run on the RX 5400, you would set `HSA_OVERRIDE_GFX_VERSION="10.3.0"` as an environment variable for the -server. If you have an unsupported AMD GPU you can experiment using the list of +server. If you have an unsupported AMD GPU you can experiment using the list of supported types below. If you have multiple GPUs with different GFX versions, append the numeric device -number to the environment variable to set them individually. For example, -`HSA_OVERRIDE_GFX_VERSION_0=10.3.0` and `HSA_OVERRIDE_GFX_VERSION_1=11.0.0` +number to the environment variable to set them individually. For example, +`HSA_OVERRIDE_GFX_VERSION_0=10.3.0` and `HSA_OVERRIDE_GFX_VERSION_1=11.0.0` At this time, the known supported GPU types on linux are the following LLVM Targets. This table shows some example GPUs that map to these LLVM targets: | **LLVM Target** | **An Example GPU** | |-----------------|---------------------| +| gfx900 | Radeon RX Vega 56 | +| gfx906 | Radeon Instinct MI50 | | gfx908 | Radeon Instinct MI100 | | gfx90a | Radeon Instinct MI210 | | gfx940 | Radeon Instinct MI300 | @@ -113,15 +112,16 @@ Reach out on [Discord](https://discord.gg/ollama) or file an If you have multiple AMD GPUs in your system and want to limit Ollama to use a subset, you can set `ROCR_VISIBLE_DEVICES` to a comma separated list of GPUs. -You can see the list of devices with `rocminfo`. If you want to ignore the GPUs -and force CPU usage, use an invalid GPU ID (e.g., "-1"). When available, use the +You can see the list of devices with `rocminfo`. If you want to ignore the GPUs +and force CPU usage, use an invalid GPU ID (e.g., "-1"). When available, use the `Uuid` to uniquely identify the device instead of numeric value. ### Container Permission In some Linux distributions, SELinux can prevent containers from -accessing the AMD GPU devices. On the host system you can run +accessing the AMD GPU devices. On the host system you can run `sudo setsebool container_use_devices=1` to allow containers to use devices. ### Metal (Apple GPUs) + Ollama supports GPU acceleration on Apple devices via the Metal API. diff --git a/docs/images/cline-mcp.png b/docs/images/cline-mcp.png new file mode 100644 index 000000000..9d2c746c1 Binary files /dev/null and b/docs/images/cline-mcp.png differ diff --git a/docs/images/cline-settings.png b/docs/images/cline-settings.png new file mode 100644 index 000000000..4c5c61584 Binary files /dev/null and b/docs/images/cline-settings.png differ diff --git a/docs/images/codex-mcp.png b/docs/images/codex-mcp.png new file mode 100644 index 000000000..f37c9a15b Binary files /dev/null and b/docs/images/codex-mcp.png differ diff --git a/docs/images/favicon.png b/docs/images/favicon.png new file mode 100644 index 000000000..e1130b238 Binary files /dev/null and b/docs/images/favicon.png differ diff --git a/docs/images/goose-cli.png b/docs/images/goose-cli.png new file mode 100644 index 000000000..89ac37ac4 Binary files /dev/null and b/docs/images/goose-cli.png differ diff --git a/docs/images/goose-mcp-1.png b/docs/images/goose-mcp-1.png new file mode 100644 index 000000000..6bee203db Binary files /dev/null and b/docs/images/goose-mcp-1.png differ diff --git a/docs/images/goose-mcp-2.png b/docs/images/goose-mcp-2.png new file mode 100644 index 000000000..bfe6d0d22 Binary files /dev/null and b/docs/images/goose-mcp-2.png differ diff --git a/docs/images/goose-settings.png b/docs/images/goose-settings.png new file mode 100644 index 000000000..edac26840 Binary files /dev/null and b/docs/images/goose-settings.png differ diff --git a/docs/images/intellij-chat-sidebar.png b/docs/images/intellij-chat-sidebar.png new file mode 100644 index 000000000..2c24e562f Binary files /dev/null and b/docs/images/intellij-chat-sidebar.png differ diff --git a/docs/images/intellij-current-model.png b/docs/images/intellij-current-model.png new file mode 100644 index 000000000..96c5f2edb Binary files /dev/null and b/docs/images/intellij-current-model.png differ diff --git a/docs/images/intellij-local-models.png b/docs/images/intellij-local-models.png new file mode 100644 index 000000000..846a3786a Binary files /dev/null and b/docs/images/intellij-local-models.png differ diff --git a/docs/images/logo-dark.png b/docs/images/logo-dark.png new file mode 100644 index 000000000..e50ee0dc3 Binary files /dev/null and b/docs/images/logo-dark.png differ diff --git a/docs/images/logo.png b/docs/images/logo.png new file mode 100644 index 000000000..827de1b90 Binary files /dev/null and b/docs/images/logo.png differ diff --git a/docs/images/n8n-chat-model.png b/docs/images/n8n-chat-model.png new file mode 100644 index 000000000..cafbc7a84 Binary files /dev/null and b/docs/images/n8n-chat-model.png differ diff --git a/docs/images/n8n-chat-node.png b/docs/images/n8n-chat-node.png new file mode 100644 index 000000000..89768e20b Binary files /dev/null and b/docs/images/n8n-chat-node.png differ diff --git a/docs/images/n8n-credential-creation.png b/docs/images/n8n-credential-creation.png new file mode 100644 index 000000000..1eeb50102 Binary files /dev/null and b/docs/images/n8n-credential-creation.png differ diff --git a/docs/images/n8n-models.png b/docs/images/n8n-models.png new file mode 100644 index 000000000..c1c70aca4 Binary files /dev/null and b/docs/images/n8n-models.png differ diff --git a/docs/images/n8n-ollama-form.png b/docs/images/n8n-ollama-form.png new file mode 100644 index 000000000..2f9174de1 Binary files /dev/null and b/docs/images/n8n-ollama-form.png differ diff --git a/docs/images/ollama-settings.png b/docs/images/ollama-settings.png new file mode 100644 index 000000000..a3470f7a5 Binary files /dev/null and b/docs/images/ollama-settings.png differ diff --git a/docs/images/vscode-model-options.png b/docs/images/vscode-model-options.png new file mode 100644 index 000000000..b1cca5d08 Binary files /dev/null and b/docs/images/vscode-model-options.png differ diff --git a/docs/images/vscode-models.png b/docs/images/vscode-models.png new file mode 100644 index 000000000..af250eac6 Binary files /dev/null and b/docs/images/vscode-models.png differ diff --git a/docs/images/vscode-sidebar.png b/docs/images/vscode-sidebar.png new file mode 100644 index 000000000..aa4a0735f Binary files /dev/null and b/docs/images/vscode-sidebar.png differ diff --git a/docs/images/welcome.png b/docs/images/welcome.png new file mode 100644 index 000000000..88ce37b2e Binary files /dev/null and b/docs/images/welcome.png differ diff --git a/docs/images/xcode-chat-icon.png b/docs/images/xcode-chat-icon.png new file mode 100644 index 000000000..3396a8a0e Binary files /dev/null and b/docs/images/xcode-chat-icon.png differ diff --git a/docs/images/xcode-intelligence-window.png b/docs/images/xcode-intelligence-window.png new file mode 100644 index 000000000..599d2f8b8 Binary files /dev/null and b/docs/images/xcode-intelligence-window.png differ diff --git a/docs/images/xcode-locally-hosted.png b/docs/images/xcode-locally-hosted.png new file mode 100644 index 000000000..e8efd7dbc Binary files /dev/null and b/docs/images/xcode-locally-hosted.png differ diff --git a/docs/images/zed-ollama-dropdown.png b/docs/images/zed-ollama-dropdown.png new file mode 100644 index 000000000..7cacd1588 Binary files /dev/null and b/docs/images/zed-ollama-dropdown.png differ diff --git a/docs/images/zed-settings.png b/docs/images/zed-settings.png new file mode 100644 index 000000000..913882b24 Binary files /dev/null and b/docs/images/zed-settings.png differ diff --git a/docs/import.mdx b/docs/import.mdx index 104b4162c..b19596894 100644 --- a/docs/import.mdx +++ b/docs/import.mdx @@ -1,11 +1,13 @@ -# Importing a model +--- +title: Importing a Model +--- ## Table of Contents - * [Importing a Safetensors adapter](#Importing-a-fine-tuned-adapter-from-Safetensors-weights) - * [Importing a Safetensors model](#Importing-a-model-from-Safetensors-weights) - * [Importing a GGUF file](#Importing-a-GGUF-based-model-or-adapter) - * [Sharing models on ollama.com](#Sharing-your-model-on-ollamacom) +- [Importing a Safetensors adapter](#Importing-a-fine-tuned-adapter-from-Safetensors-weights) +- [Importing a Safetensors model](#Importing-a-model-from-Safetensors-weights) +- [Importing a GGUF file](#Importing-a-GGUF-based-model-or-adapter) +- [Sharing models on ollama.com](#Sharing-your-model-on-ollamacom) ## Importing a fine tuned adapter from Safetensors weights @@ -32,16 +34,15 @@ ollama run my-model Ollama supports importing adapters based on several different model architectures including: - * Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2); - * Mistral (including Mistral 1, Mistral 2, and Mixtral); and - * Gemma (including Gemma 1 and Gemma 2) +- Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2); +- Mistral (including Mistral 1, Mistral 2, and Mixtral); and +- Gemma (including Gemma 1 and Gemma 2) You can create the adapter using a fine tuning framework or tool which can output adapters in the Safetensors format, such as: - * Hugging Face [fine tuning framework](https://huggingface.co/docs/transformers/en/training) - * [Unsloth](https://github.com/unslothai/unsloth) - * [MLX](https://github.com/ml-explore/mlx) - +- Hugging Face [fine tuning framework](https://huggingface.co/docs/transformers/en/training) +- [Unsloth](https://github.com/unslothai/unsloth) +- [MLX](https://github.com/ml-explore/mlx) ## Importing a model from Safetensors weights @@ -53,8 +54,6 @@ FROM /path/to/safetensors/directory If you create the Modelfile in the same directory as the weights, you can use the command `FROM .`. -If you do not create the Modelfile, ollama will act as if there was a Modelfile with the command `FROM .`. - Now run the `ollama create` command from the directory where you created the `Modelfile`: ```shell @@ -69,19 +68,20 @@ ollama run my-model Ollama supports importing models for several different architectures including: - * Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2); - * Mistral (including Mistral 1, Mistral 2, and Mixtral); - * Gemma (including Gemma 1 and Gemma 2); and - * Phi3 +- Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2); +- Mistral (including Mistral 1, Mistral 2, and Mixtral); +- Gemma (including Gemma 1 and Gemma 2); and +- Phi3 This includes importing foundation models as well as any fine tuned models which have been _fused_ with a foundation model. + ## Importing a GGUF based model or adapter If you have a GGUF based model or adapter it is possible to import it into Ollama. You can obtain a GGUF model or adapter by: - * converting a Safetensors model with the `convert_hf_to_gguf.py` from Llama.cpp; - * converting a Safetensors adapter with the `convert_lora_to_gguf.py` from Llama.cpp; or - * downloading a model or adapter from a place such as HuggingFace +- converting a Safetensors model with the `convert_hf_to_gguf.py` from Llama.cpp; +- converting a Safetensors adapter with the `convert_lora_to_gguf.py` from Llama.cpp; or +- downloading a model or adapter from a place such as HuggingFace To import a GGUF model, create a `Modelfile` containing: @@ -98,9 +98,9 @@ ADAPTER /path/to/file.gguf When importing a GGUF adapter, it's important to use the same base model as the base model that the adapter was created with. You can use: - * a model from Ollama - * a GGUF file - * a Safetensors based model +- a model from Ollama +- a GGUF file +- a Safetensors based model Once you have created your `Modelfile`, use the `ollama create` command to build the model. @@ -134,13 +134,22 @@ success ### Supported Quantizations +- `q4_0` +- `q4_1` +- `q5_0` +- `q5_1` - `q8_0` #### K-means Quantizations +- `q3_K_S` +- `q3_K_M` +- `q3_K_L` - `q4_K_S` - `q4_K_M` - +- `q5_K_S` +- `q5_K_M` +- `q6_K` ## Sharing your model on ollama.com @@ -148,7 +157,7 @@ You can share any model you have created by pushing it to [ollama.com](https://o First, use your browser to go to the [Ollama Sign-Up](https://ollama.com/signup) page. If you already have an account, you can skip this step. -Sign-Up +Sign-Up The `Username` field will be used as part of your model's name (e.g. `jmorganca/mymodel`), so make sure you are comfortable with the username that you have selected. @@ -156,7 +165,7 @@ Now that you have created an account and are signed-in, go to the [Ollama Keys S Follow the directions on the page to determine where your Ollama Public Key is located. -Ollama Keys +Ollama Keys Click on the `Add Ollama Public Key` button, and copy and paste the contents of your Ollama Public Key into the text field. @@ -173,4 +182,3 @@ Once your model has been pushed, other users can pull and run it by using the co ```shell ollama run myuser/mymodel ``` - diff --git a/docs/index.mdx b/docs/index.mdx new file mode 100644 index 000000000..669d30cfb --- /dev/null +++ b/docs/index.mdx @@ -0,0 +1,58 @@ +--- +title: Ollama's documentation +sidebarTitle: Welcome +--- + + + +[Ollama](https://ollama.com) is the easiest way to get up and running with large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more. + + + + Get up and running with your first model + + + Download Ollama on macOS, Windows or Linux + + + Ollama's cloud models offer larger models with better performance. + + + View Ollama's API reference + + + +## Libraries + + + + The official library for using Ollama with Python + + + + The official library for using Ollama with JavaScript or TypeScript. + + + View a list of 20+ community-supported libraries for Ollama + + + +## Community + + + + Join our Discord community + + + + Join our Reddit community + + diff --git a/docs/integrations/cline.mdx b/docs/integrations/cline.mdx new file mode 100644 index 000000000..371fc6288 --- /dev/null +++ b/docs/integrations/cline.mdx @@ -0,0 +1,38 @@ +--- +title: Cline +--- + +## Install + +Install [Cline](https://docs.cline.bot/getting-started/installing-cline) in your IDE. + + +## Usage with Ollama + +1. Open Cline settings > `API Configuration` and set `API Provider` to `Ollama` +2. Select a model under `Model` or type one (e.g. `qwen3`) +3. Update the context window to at least 32K tokens under `Context Window` + +Coding tools require a larger context window. It is recommended to use a context window of at least 32K tokens. See [Context length](/context-length) for more information. + +
+ Cline settings configuration showing API Provider set to Ollama +
+ + + +## Connecting to ollama.com +1. Create an [API key](https://ollama.com/settings/keys) from ollama.com +2. Click on `Use custom base URL` and set it to `https://ollama.com` +3. Enter your **Ollama API Key** +4. Select a model from the list + + +### Recommended Models + +- `qwen3-coder:480b` +- `deepseek-v3.1:671b` diff --git a/docs/integrations/codex.mdx b/docs/integrations/codex.mdx new file mode 100644 index 000000000..f9df1b858 --- /dev/null +++ b/docs/integrations/codex.mdx @@ -0,0 +1,56 @@ +--- +title: Codex +--- + + +## Install + +Install the [Codex CLI](https://developers.openai.com/codex/cli/): + +``` +npm install -g @openai/codex +``` + +## Usage with Ollama + +Codex requires a larger context window. It is recommended to use a context window of at least 32K tokens. + +To use `codex` with Ollama, use the `--oss` flag: + +``` +codex --oss +``` + +### Changing Models + +By default, codex will use the local `gpt-oss:20b` model. However, you can specify a different model with the `-m` flag: + +``` +codex --oss -m gpt-oss:120b +``` + +### Cloud Models + +``` +codex --oss -m gpt-oss:120b-cloud +``` + + +## Connecting to ollama.com + + +Create an [API key](https://ollama.com/settings/keys) from ollama.com and export it as `OLLAMA_API_KEY`. + +To use ollama.com directly, edit your `~/.codex/config.toml` file to point to ollama.com. + +```toml +model = "gpt-oss:120b" +model_provider = "ollama" + +[model_providers.ollama] +name = "Ollama" +base_url = "https://ollama.com/v1" +env_key = "OLLAMA_API_KEY" +``` + +Run `codex` in a new terminal to load the new settings. diff --git a/docs/integrations/droid.mdx b/docs/integrations/droid.mdx new file mode 100644 index 000000000..b1ba37710 --- /dev/null +++ b/docs/integrations/droid.mdx @@ -0,0 +1,76 @@ +--- +title: Droid +--- + + +## Install + +Install the [Droid CLI](https://factory.ai/): + +```bash +curl -fsSL https://app.factory.ai/cli | sh +``` + +Droid requires a larger context window. It is recommended to use a context window of at least 32K tokens. See [Context length](/context-length) for more information. + +## Usage with Ollama + +Add a local configuration block to `~/.factory/config.json`: + +```json +{ + "custom_models": [ + { + "model_display_name": "qwen3-coder [Ollama]", + "model": "qwen3-coder", + "base_url": "http://localhost:11434/v1/", + "api_key": "not-needed", + "provider": "generic-chat-completion-api", + "max_tokens": 32000 + } + ] +} +``` + + +## Cloud Models +`qwen3-coder:480b-cloud` is the recommended model for use with Droid. + +Add the cloud configuration block to `~/.factory/config.json`: + +```json +{ + "custom_models": [ + { + "model_display_name": "qwen3-coder [Ollama Cloud]", + "model": "qwen3-coder:480b-cloud", + "base_url": "http://localhost:11434/v1/", + "api_key": "not-needed", + "provider": "generic-chat-completion-api", + "max_tokens": 128000 + } + ] +} +``` + +## Connecting to ollama.com + +1. Create an [API key](https://ollama.com/settings/keys) from ollama.com and export it as `OLLAMA_API_KEY`. +2. Add the cloud configuration block to `~/.factory/config.json`: + + ```json + { + "custom_models": [ + { + "model_display_name": "qwen3-coder [Ollama Cloud]", + "model": "qwen3-coder:480b", + "base_url": "https://ollama.com/v1/", + "api_key": "OLLAMA_API_KEY", + "provider": "generic-chat-completion-api", + "max_tokens": 128000 + } + ] + } + ``` + +Run `droid` in a new terminal to load the new settings. \ No newline at end of file diff --git a/docs/integrations/goose.mdx b/docs/integrations/goose.mdx new file mode 100644 index 000000000..35099a3b1 --- /dev/null +++ b/docs/integrations/goose.mdx @@ -0,0 +1,49 @@ +--- +title: Goose +--- + +## Goose Desktop + +Install [Goose](https://block.github.io/goose/docs/getting-started/installation/) Desktop. + +### Usage with Ollama +1. In Goose, open **Settings** → **Configure Provider**. +
+ Goose settings Panel +
+2. Find **Ollama**, click **Configure** +3. Confirm **API Host** is `http://localhost:11434` and click Submit + + +### Connecting to ollama.com + +1. Create an [API key](https://ollama.com/settings/keys) on ollama.com and save it in your `.env` +2. In Goose, set **API Host** to `https://ollama.com` + + +## Goose CLI + +Install [Goose](https://block.github.io/goose/docs/getting-started/installation/) CLI + +### Usage with Ollama +1. Run `goose configure` +2. Select **Configure Providers** and select **Ollama** +
+ Goose CLI +
+3. Enter model name (e.g `qwen3`) + +### Connecting to ollama.com + +1. Create an [API key](https://ollama.com/settings/keys) on ollama.com and save it in your `.env` +2. Run `goose configure` +3. Select **Configure Providers** and select **Ollama** +4. Update **OLLAMA_HOST** to `https://ollama.com` diff --git a/docs/integrations/jetbrains.mdx b/docs/integrations/jetbrains.mdx new file mode 100644 index 000000000..29fbd95b4 --- /dev/null +++ b/docs/integrations/jetbrains.mdx @@ -0,0 +1,47 @@ +--- +title: JetBrains +--- + +This example uses **IntelliJ**; same steps apply to other JetBrains IDEs (e.g., PyCharm). + +## Install + +Install [IntelliJ](https://www.jetbrains.com/idea/). + +## Usage with Ollama + + + To use **Ollama**, you will need a [JetBrains AI Subscription](https://www.jetbrains.com/ai-ides/buy/?section=personal&billing=yearly). + + +1. In Intellij, click the **chat icon** located in the right sidebar + +
+ Intellij Sidebar Chat +
+ +2. Select the **current model** in the sidebar, then click **Set up Local Models** + +
+ Intellij model bottom right corner +
+ +3. Under **Third Party AI Providers**, choose **Ollama** +4. Confirm the **Host URL** is `http://localhost:11434`, then click **Ok** +5. Once connected, select a model under **Local models by Ollama** + +
+ Zed star icon in bottom right corner +
diff --git a/docs/integrations/n8n.mdx b/docs/integrations/n8n.mdx new file mode 100644 index 000000000..c58967fa5 --- /dev/null +++ b/docs/integrations/n8n.mdx @@ -0,0 +1,53 @@ +--- +title: n8n +--- + +## Install + +Install [n8n](https://docs.n8n.io/choose-n8n/). + +## Using Ollama Locally + +1. In the top right corner, click the dropdown and select **Create Credential** +
+ Create a n8n Credential +
+ +2. Under **Add new credential** select **Ollama** +
+ Select Ollama under Credential +
+3. Confirm Base URL is set to `http://localhost:11434` and click **Save** + If connecting to `http://localhost:11434` fails, use `http://127.0.0.1:11434` +4. When creating a new workflow, select **Add a first step** and select an **Ollama node** +
+ Add a first step with Ollama node +
+5. Select your model of choice (e.g. `qwen3-coder`) +
+ Set up Ollama credentials +
+ +## Connecting to ollama.com +1. Create an [API key](https://ollama.com/settings/keys) on **ollama.com**. +2. In n8n, click **Create Credential** and select **Ollama** +4. Set the **API URL** to `https://ollama.com` +5. Enter your **API Key** and click **Save** + + diff --git a/docs/integrations/roo-code.mdx b/docs/integrations/roo-code.mdx new file mode 100644 index 000000000..61c91a717 --- /dev/null +++ b/docs/integrations/roo-code.mdx @@ -0,0 +1,30 @@ +--- +title: Roo Code +--- + + +## Install + +Install [Roo Code](https://marketplace.visualstudio.com/items?itemName=RooVeterinaryInc.roo-cline) from the VS Code Marketplace. + +## Usage with Ollama + +1. Open Roo Code in VS Code and click the **gear icon** on the top right corner of the Roo Code window to open **Provider Settings** +2. Set `API Provider` to `Ollama` +3. (Optional) Update `Base URL` if your Ollama instance is running remotely. The default is `http://localhost:11434` +4. Enter a valid `Model ID` (for example `qwen3` or `qwen3-coder:480b-cloud`) +5. Adjust the `Context Window` to at least 32K tokens for coding tasks + +Coding tools require a larger context window. It is recommended to use a context window of at least 32K tokens. See [Context length](/context-length) for more information. + +## Connecting to ollama.com + +1. Create an [API key](https://ollama.com/settings/keys) from ollama.com +2. Enable `Use custom base URL` and set it to `https://ollama.com` +3. Enter your **Ollama API Key** +4. Select a model from the list + +### Recommended Models + +- `qwen3-coder:480b` +- `deepseek-v3.1:671b` diff --git a/docs/integrations/vscode.mdx b/docs/integrations/vscode.mdx new file mode 100644 index 000000000..c68f91999 --- /dev/null +++ b/docs/integrations/vscode.mdx @@ -0,0 +1,34 @@ +--- +title: VS Code +--- + +## Install + +Install [VSCode](https://code.visualstudio.com/download). + +## Usage with Ollama + +1. Open Copilot side bar found in top right window +
+ VSCode chat Sidebar +
+2. Select the model drowpdown > **Manage models** +
+ VSCode model picker +
+3. Enter **Ollama** under **Provider Dropdown** and select desired models (e.g `qwen3, qwen3-coder:480b-cloud`) +
+ VSCode model options dropdown +
diff --git a/docs/integrations/xcode.mdx b/docs/integrations/xcode.mdx new file mode 100644 index 000000000..7d10317ab --- /dev/null +++ b/docs/integrations/xcode.mdx @@ -0,0 +1,45 @@ +--- +title: Xcode +--- + +## Install + +Install [XCode](https://developer.apple.com/xcode/) + + +## Usage with Ollama + Ensure Apple Intelligence is setup and the latest XCode version is v26.0 + +1. Click **XCode** in top left corner > **Settings** +
+ Xcode Intelligence window +
+ +2. Select **Locally Hosted**, enter port **11434** and click **Add** +
+ Xcode settings +
+ +3. Select the **star icon** on the top left corner and click the **dropdown** +
+ Xcode settings +
+4. Click **My Account** and select your desired model + + +## Connecting to ollama.com directly +1. Create an [API key](https://ollama.com/settings/keys) from ollama.com +2. Select **Internet Hosted** and enter URL as `https://ollama.com` +3. Enter your **Ollama API Key** and click **Add** \ No newline at end of file diff --git a/docs/integrations/zed.mdx b/docs/integrations/zed.mdx new file mode 100644 index 000000000..478d3bc8b --- /dev/null +++ b/docs/integrations/zed.mdx @@ -0,0 +1,38 @@ +--- +title: Zed +--- + +## Install + +Install [Zed](https://zed.dev/download). + +## Usage with Ollama + +1. In Zed, click the **star icon** in the bottom-right corner, then select **Configure**. + +
+ Zed star icon in bottom right corner +
+ +2. Under **LLM Providers**, choose **Ollama** +3. Confirm the **Host URL** is `http://localhost:11434`, then click **Connect** +4. Once connected, select a model under **Ollama** + +
+ Zed star icon in bottom right corner +
+ +## Connecting to ollama.com +1. Create an [API key](https://ollama.com/settings/keys) on **ollama.com** +2. In Zed, open the **star icon** → **Configure** +3. Under **LLM Providers**, select **Ollama** +4. Set the **API URL** to `https://ollama.com` + diff --git a/docs/linux.mdx b/docs/linux.mdx index ce5ed860b..c40ab0545 100644 --- a/docs/linux.mdx +++ b/docs/linux.mdx @@ -1,4 +1,6 @@ -# Linux +--- +title: Linux +--- ## Install @@ -10,15 +12,16 @@ curl -fsSL https://ollama.com/install.sh | sh ## Manual install -> [!NOTE] -> If you are upgrading from a prior version, you **MUST** remove the old libraries with `sudo rm -rf /usr/lib/ollama` first. + + If you are upgrading from a prior version, you should remove the old libraries + with `sudo rm -rf /usr/lib/ollama` first. + Download and extract the package: ```shell -curl -LO https://ollama.com/download/ollama-linux-amd64.tgz -sudo rm -rf /usr/lib/ollama -sudo tar -C /usr -xzf ollama-linux-amd64.tgz +curl -fsSL https://ollama.com/download/ollama-linux-amd64.tgz \ + | sudo tar zx -C /usr ``` Start Ollama: @@ -35,15 +38,11 @@ ollama -v ### AMD GPU install -If you have an AMD GPU, **also** download and extract the additional ROCm package: - -> [!IMPORTANT] -> The ROCm tgz contains only AMD dependent libraries. You must extract **both** `ollama-linux-amd64.tgz` and `ollama-linux-amd64-rocm.tgz` into the same location. - +If you have an AMD GPU, also download and extract the additional ROCm package: ```shell -curl -L https://ollama.com/download/ollama-linux-amd64-rocm.tgz -o ollama-linux-amd64-rocm.tgz -sudo tar -C /usr -xzf ollama-linux-amd64-rocm.tgz +curl -fsSL https://ollama.com/download/ollama-linux-amd64-rocm.tgz \ + | sudo tar zx -C /usr ``` ### ARM64 install @@ -51,8 +50,8 @@ sudo tar -C /usr -xzf ollama-linux-amd64-rocm.tgz Download and extract the ARM64-specific package: ```shell -curl -L https://ollama.com/download/ollama-linux-arm64.tgz -o ollama-linux-arm64.tgz -sudo tar -C /usr -xzf ollama-linux-arm64.tgz +curl -fsSL https://ollama.com/download/ollama-linux-arm64.tgz \ + | sudo tar zx -C /usr ``` ### Adding Ollama as a startup service (recommended) @@ -113,12 +112,13 @@ sudo systemctl start ollama sudo systemctl status ollama ``` -> [!NOTE] -> While AMD has contributed the `amdgpu` driver upstream to the official linux -> kernel source, the version is older and may not support all ROCm features. We -> recommend you install the latest driver from -> [AMD](https://www.amd.com/en/support/download/linux-drivers.html) for best support -> of your Radeon GPU. + + While AMD has contributed the `amdgpu` driver upstream to the official linux + kernel source, the version is older and may not support all ROCm features. We + recommend you install the latest driver from + https://www.amd.com/en/support/linux-drivers for best support of your Radeon + GPU. + ## Customizing @@ -146,8 +146,8 @@ curl -fsSL https://ollama.com/install.sh | sh Or by re-downloading Ollama: ```shell -curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz -sudo tar -C /usr -xzf ollama-linux-amd64.tgz +curl -fsSL https://ollama.com/download/ollama-linux-amd64.tgz \ + | sudo tar zx -C /usr ``` ## Installing specific versions @@ -178,6 +178,12 @@ sudo systemctl disable ollama sudo rm /etc/systemd/system/ollama.service ``` +Remove ollama libraries from your lib directory (either `/usr/local/lib`, `/usr/lib`, or `/lib`): + +```shell +sudo rm -r $(which ollama | tr 'bin' 'lib') +``` + Remove the ollama binary from your bin directory (either `/usr/local/bin`, `/usr/bin`, or `/bin`): ```shell @@ -187,13 +193,7 @@ sudo rm $(which ollama) Remove the downloaded models and Ollama service user and group: ```shell -sudo rm -r /usr/share/ollama sudo userdel ollama sudo groupdel ollama -``` - -Remove installed libraries: - -```shell -sudo rm -rf /usr/local/lib/ollama +sudo rm -r /usr/share/ollama ``` diff --git a/docs/logo.svg b/docs/logo.svg new file mode 100644 index 000000000..2b410d09f --- /dev/null +++ b/docs/logo.svg @@ -0,0 +1,3 @@ + + + diff --git a/docs/modelfile.mdx b/docs/modelfile.mdx index 53a217141..c91d7310c 100644 --- a/docs/modelfile.mdx +++ b/docs/modelfile.mdx @@ -1,9 +1,8 @@ -# Ollama Model File +--- +title: Modelfile Reference +--- -> [!NOTE] -> `Modelfile` syntax is in development - -A model file is the blueprint to create and share models with Ollama. +A Modelfile is the blueprint to create and share customized models using Ollama. ## Table of Contents @@ -73,26 +72,23 @@ To view the Modelfile of a given model, use the `ollama show --modelfile` comman ollama show --modelfile llama3.2 ``` -> **Output**: -> -> ``` -> # Modelfile generated by "ollama show" -> # To build a new Modelfile based on this one, replace the FROM line with: -> # FROM llama3.2:latest -> FROM /Users/pdevine/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 -> TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> -> -> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> -> -> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> -> -> {{ .Response }}<|eot_id|>""" -> PARAMETER stop "<|start_header_id|>" -> PARAMETER stop "<|end_header_id|>" -> PARAMETER stop "<|eot_id|>" -> PARAMETER stop "<|reserved_special_token" -> ``` +``` +# Modelfile generated by "ollama show" +# To build a new Modelfile based on this one, replace the FROM line with: +# FROM llama3.2:latest +FROM /Users/pdevine/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 +TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> +{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> + +{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> + +{{ .Response }}<|eot_id|>""" +PARAMETER stop "<|start_header_id|>" +PARAMETER stop "<|end_header_id|>" +PARAMETER stop "<|eot_id|>" +PARAMETER stop "<|reserved_special_token" +``` ## Instructions @@ -110,10 +106,13 @@ FROM : FROM llama3.2 ``` -A list of available base models: - -Additional models can be found at: - + + A list of available base models + + + + Additional models can be found at + #### Build from a Safetensors model @@ -124,10 +123,11 @@ FROM The model directory should contain the Safetensors weights for a supported architecture. Currently supported model architectures: - * Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2) - * Mistral (including Mistral 1, Mistral 2, and Mixtral) - * Gemma (including Gemma 1 and Gemma 2) - * Phi3 + +- Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2) +- Mistral (including Mistral 1, Mistral 2, and Mixtral) +- Gemma (including Gemma 1 and Gemma 2) +- Phi3 #### Build from a GGUF file @@ -137,7 +137,6 @@ FROM ./ollama-model.gguf The GGUF file location should be specified as an absolute path or relative to the `Modelfile` location. - ### PARAMETER The `PARAMETER` instruction defines a parameter that can be set when the model is run. @@ -148,18 +147,21 @@ PARAMETER #### Valid Parameters and Values -| Parameter | Description | Value Type | Example Usage | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | -------------------- | -| num_ctx | Sets the size of the context window used to generate the next token. (Default: 4096) | int | num_ctx 4096 | -| repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 | -| repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 | -| temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 | -| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 | -| stop | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile. | string | stop "AI assistant:" | -| num_predict | Maximum number of tokens to predict when generating text. (Default: -1, infinite generation) | int | num_predict 42 | -| top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 | -| top_p | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9) | float | top_p 0.9 | -| min_p | Alternative to the top_p, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with *p*=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out. (Default: 0.0) | float | min_p 0.05 | +| Parameter | Description | Value Type | Example Usage | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | -------------------- | +| mirostat | Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) | int | mirostat 0 | +| mirostat_eta | Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. (Default: 0.1) | float | mirostat_eta 0.1 | +| mirostat_tau | Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. (Default: 5.0) | float | mirostat_tau 5.0 | +| num_ctx | Sets the size of the context window used to generate the next token. (Default: 2048) | int | num_ctx 4096 | +| repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 | +| repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 | +| temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 | +| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 | +| stop | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile. | string | stop "AI assistant:" | +| num_predict | Maximum number of tokens to predict when generating text. (Default: -1, infinite generation) | int | num_predict 42 | +| top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 | +| top_p | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9) | float | top_p 0.9 | +| min_p | Alternative to the top*p, and aims to ensure a balance of quality and variety. The parameter \_p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with _p_=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out. (Default: 0.0) | float | min_p 0.05 | ### TEMPLATE @@ -201,9 +203,10 @@ ADAPTER ``` Currently supported Safetensor adapters: - * Llama (including Llama 2, Llama 3, and Llama 3.1) - * Mistral (including Mistral 1, Mistral 2, and Mixtral) - * Gemma (including Gemma 1 and Gemma 2) + +- Llama (including Llama 2, Llama 3, and Llama 3.1) +- Mistral (including Mistral 1, Mistral 2, and Mixtral) +- Gemma (including Gemma 1 and Gemma 2) #### GGUF adapter @@ -237,7 +240,6 @@ MESSAGE | user | An example message of what the user could have asked. | | assistant | An example message of how the model should respond. | - #### Example conversation ``` @@ -249,7 +251,6 @@ MESSAGE user Is Ontario in Canada? MESSAGE assistant yes ``` - ## Notes - the **`Modelfile` is not case sensitive**. In the examples, uppercase instructions are used to make it easier to distinguish it from arguments. diff --git a/docs/ollama-logo.svg b/docs/ollama-logo.svg new file mode 100644 index 000000000..b215c89b9 --- /dev/null +++ b/docs/ollama-logo.svg @@ -0,0 +1,3 @@ + + + diff --git a/docs/ollama.png b/docs/ollama.png new file mode 100644 index 000000000..8cd2cf1ed Binary files /dev/null and b/docs/ollama.png differ diff --git a/docs/openapi.yaml b/docs/openapi.yaml new file mode 100644 index 000000000..28e4eed61 --- /dev/null +++ b/docs/openapi.yaml @@ -0,0 +1,1413 @@ +openapi: 3.1.0 +info: + title: Ollama API + version: 0.1.0 + description: | + OpenAPI specification for the Ollama HTTP API + +servers: + - url: http://localhost:11434 + description: Local Ollama instance +components: + securitySchemes: + bearerAuth: + type: http + scheme: bearer + bearerFormat: API Key + parameters: + DigestParam: + name: digest + in: path + required: true + description: SHA256 digest identifier, prefixed with `sha256:` + schema: + type: string + schemas: + ModelOptions: + type: object + description: Runtime options that control text generation + properties: + # Sampling Options + seed: + type: integer + description: Random seed used for reproducible outputs + temperature: + type: number + format: float + description: Controls randomness in generation (higher = more random) + top_k: + type: integer + description: Limits next token selection to the K most likely + top_p: + type: number + format: float + description: Cumulative probability threshold for nucleus sampling + min_p: + type: number + format: float + description: Minimum probability threshold for token selection + stop: + oneOf: + - type: string + - type: array + items: + type: string + description: Stop sequences that will halt generation + + # Runtime Options + num_ctx: + type: integer + description: Context length size (number of tokens) + num_predict: + type: integer + description: Maximum number of tokens to generate + additionalProperties: true + GenerateRequest: + type: object + required: [model] + properties: + model: + type: string + description: Model name + prompt: + type: string + description: Text for the model to generate a response from + suffix: + type: string + description: Used for fill-in-the-middle models, text that appears after the user prompt and before the model response + images: + type: array + items: + type: string + description: Base64-encoded images for models that support image input + format: + description: Structured output format for the model to generate a response from. Supports either the string `"json"` or a JSON schema object. + oneOf: + - type: string + - type: object + system: + description: System prompt for the model to generate a response from + type: string + stream: + description: When true, returns a stream of partial responses + type: boolean + default: true + think: + type: boolean + description: When true, returns separate thinking output in addition to content + raw: + type: boolean + description: When true, returns the raw response from the model without any prompt templating + keep_alive: + oneOf: + - type: string + - type: number + description: Model keep-alive duration (for example `5m` or `0` to unload immediately) + options: + $ref: "#/components/schemas/ModelOptions" + GenerateResponse: + type: object + properties: + model: + type: string + description: Model name + created_at: + type: string + description: ISO 8601 timestamp of response creation + response: + type: string + description: The model's generated text response + thinking: + type: string + description: The model's generated thinking output + done: + type: boolean + description: Indicates whether generation has finished + done_reason: + type: string + description: Reason the generation stopped + total_duration: + type: integer + description: Time spent generating the response in nanoseconds + load_duration: + type: integer + description: Time spent loading the model in nanoseconds + prompt_eval_count: + type: integer + description: Number of input tokens in the prompt + prompt_eval_duration: + type: integer + description: Time spent evaluating the prompt in nanoseconds + eval_count: + type: integer + description: Number of output tokens generated in the response + eval_duration: + type: integer + description: Time spent generating tokens in nanoseconds + GenerateStreamEvent: + type: object + properties: + model: + type: string + description: Model name + created_at: + type: string + description: ISO 8601 timestamp of response creation + response: + type: string + description: The model's generated text response for this chunk + thinking: + type: string + description: The model's generated thinking output for this chunk + done: + type: boolean + description: Indicates whether the stream has finished + done_reason: + type: string + description: Reason streaming finished + total_duration: + type: integer + description: Time spent generating the response in nanoseconds + load_duration: + type: integer + description: Time spent loading the model in nanoseconds + prompt_eval_count: + type: integer + description: Number of input tokens in the prompt + prompt_eval_duration: + type: integer + description: Time spent evaluating the prompt in nanoseconds + eval_count: + type: integer + description: Number of output tokens generated in the response + eval_duration: + type: integer + description: Time spent generating tokens in nanoseconds + ChatMessage: + type: object + required: [role, content] + properties: + role: + type: string + enum: [system, user, assistant, tool] + description: Author of the message. + content: + type: string + description: Message text content + images: + type: array + items: + type: string + description: Base64-encoded image content + description: Optional list of inline images for multimodal models + tool_calls: + type: array + items: + $ref: "#/components/schemas/ToolCall" + description: Tool call requests produced by the model + ToolCall: + type: object + properties: + function: + type: object + required: [name] + properties: + name: + type: string + description: Name of the function to call + description: + type: string + description: What the function does + arguments: + type: object + description: JSON object of arguments to pass to the function + ToolDefinition: + type: object + required: [type, function] + properties: + type: + type: string + enum: [function] + description: Type of tool (always `function`) + function: + type: object + required: [name, parameters] + properties: + name: + type: string + description: Function name exposed to the model + description: + type: string + description: Human-readable description of the function + parameters: + type: object + description: JSON Schema for the function parameters + ChatRequest: + type: object + required: [model, messages] + properties: + model: + type: string + description: Model name + messages: + type: array + description: Chat history as an array of message objects (each with a role and content) + items: + $ref: "#/components/schemas/ChatMessage" + tools: + type: array + description: Optional list of function tools the model may call during the chat + items: + $ref: "#/components/schemas/ToolDefinition" + format: + oneOf: + - type: string + enum: [json] + - type: object + description: Format to return a response in. Can be `json` or a JSON schema + options: + $ref: "#/components/schemas/ModelOptions" + stream: + type: boolean + default: true + think: + type: boolean + description: When true, returns separate thinking output in addition to content + keep_alive: + oneOf: + - type: string + - type: number + description: Model keep-alive duration (for example `5m` or `0` to unload immediately) + ChatResponse: + type: object + properties: + model: + type: string + description: Model name used to generate this message + created_at: + type: string + format: date-time + description: Timestamp of response creation (ISO 8601) + message: + type: object + properties: + role: + type: string + enum: [assistant] + description: Always `assistant` for model responses + content: + type: string + description: Assistant message text + thinking: + type: string + description: Optional deliberate thinking trace when `think` is enabled + tool_calls: + type: array + items: + $ref: "#/components/schemas/ToolCall" + description: Tool calls requested by the assistant + images: + type: array + items: + type: string + nullable: true + description: Optional base64-encoded images in the response + done: + type: boolean + description: Indicates whether the chat response has finished + done_reason: + type: string + description: Reason the response finished + total_duration: + type: integer + description: Total time spent generating in nanoseconds + load_duration: + type: integer + description: Time spent loading the model in nanoseconds + prompt_eval_count: + type: integer + description: Number of tokens in the prompt + prompt_eval_duration: + type: integer + description: Time spent evaluating the prompt in nanoseconds + eval_count: + type: integer + description: Number of tokens generated in the response + eval_duration: + type: integer + description: Time spent generating tokens in nanoseconds + ChatStreamEvent: + type: object + properties: + model: + type: string + description: Model name used for this stream event + created_at: + type: string + format: date-time + description: When this chunk was created (ISO 8601) + message: + type: object + properties: + role: + type: string + description: Role of the message for this chunk + content: + type: string + description: Partial assistant message text + thinking: + type: string + description: Partial thinking text when `think` is enabled + tool_calls: + type: array + items: + $ref: "#/components/schemas/ToolCall" + description: Partial tool calls, if any + images: + type: array + items: + type: string + nullable: true + description: Partial base64-encoded images, when present + done: + type: boolean + description: True for the final event in the stream + StatusEvent: + type: object + properties: + status: + type: string + description: Human-readable status message + digest: + type: string + description: Content digest associated with the status, if applicable + total: + type: integer + description: Total number of bytes expected for the operation + completed: + type: integer + description: Number of bytes transferred so far + StatusResponse: + type: object + properties: + status: + type: string + description: Current status message + EmbedRequest: + type: object + required: [model, input] + properties: + model: + type: string + description: Model name + input: + oneOf: + - type: string + - type: array + items: + type: string + description: Text or array of texts to generate embeddings for + truncate: + type: boolean + default: true + description: If true, truncate inputs that exceed the context window. If false, returns an error. + dimensions: + type: integer + description: Number of dimensions to generate embeddings for + keep_alive: + type: string + description: Model keep-alive duration + options: + $ref: "#/components/schemas/ModelOptions" + EmbedResponse: + type: object + properties: + model: + type: string + description: Model that produced the embeddings + embeddings: + type: array + items: + type: array + items: + type: number + description: Array of vector embeddings + total_duration: + type: integer + description: Total time spent generating in nanoseconds + load_duration: + type: integer + description: Load time in nanoseconds + prompt_eval_count: + type: integer + description: Number of input tokens processed to generate embeddings + CreateRequest: + type: object + required: [model] + properties: + model: + type: string + description: Name for the model to create + from: + type: string + description: Existing model to create from + template: + type: string + description: Prompt template to use for the model + license: + oneOf: + - type: string + - type: array + items: + type: string + description: License string or list of licenses for the model + system: + type: string + description: System prompt to embed in the model + parameters: + type: object + description: Key-value parameters for the model + messages: + description: Message history to use for the model + type: array + items: + $ref: "#/components/schemas/ChatMessage" + quantize: + type: string + description: Quantization level to apply (e.g. `q4_K_M`, `q8_0`) + stream: + type: boolean + default: true + description: Stream status updates + CopyRequest: + type: object + required: [source, destination] + properties: + source: + type: string + description: Existing model name to copy from + destination: + type: string + description: New model name to create + DeleteRequest: + type: object + required: [model] + properties: + model: + type: string + description: Model name to delete + PullRequest: + type: object + required: [model] + properties: + model: + type: string + description: Name of the model to download + insecure: + type: boolean + description: Allow downloading over insecure connections + stream: + type: boolean + default: true + description: Stream progress updates + PushRequest: + type: object + required: [model] + properties: + model: + type: string + description: Name of the model to publish + insecure: + type: boolean + description: Allow publishing over insecure connections + stream: + type: boolean + default: true + description: Stream progress updates + ShowRequest: + type: object + required: [model] + properties: + model: + type: string + description: Model name to show + verbose: + type: boolean + description: If true, includes large verbose fields in the response. + ShowResponse: + type: object + properties: + parameters: + type: string + description: Model parameter settings serialized as text + license: + type: string + description: The license of the model + details: + type: object + description: High-level model details + template: + type: string + description: The template used by the model to render prompts + capabilities: + type: array + items: + type: string + description: List of supported features + model_info: + type: object + description: Additional model metadata + ModelSummary: + type: object + description: Summary information for a locally available model + properties: + name: + type: string + description: Model name + modified_at: + type: string + description: Last modified timestamp in ISO 8601 format + size: + type: integer + description: Total size of the model on disk in bytes + digest: + type: string + description: SHA256 digest identifier of the model contents + details: + type: object + description: Additional information about the model's format and family + properties: + format: + type: string + description: Model file format (for example `gguf`) + family: + type: string + description: Primary model family (for example `llama`) + families: + type: array + items: + type: string + description: All families the model belongs to, when applicable + parameter_size: + type: string + description: Approximate parameter count label (for example `7B`, `13B`) + quantization_level: + type: string + description: Quantization level used (for example `Q4_0`) + ListResponse: + type: object + properties: + models: + type: array + items: + $ref: "#/components/schemas/ModelSummary" + Ps: + type: object + properties: + model: + type: string + description: Name of the running model + size: + type: integer + description: Size of the model in bytes + digest: + type: string + description: SHA256 digest of the model + details: + type: object + description: Model details such as format and family + expires_at: + type: string + description: Time when the model will be unloaded + size_vram: + type: integer + description: VRAM usage in bytes + PsResponse: + type: object + properties: + models: + type: array + items: + $ref: "#/components/schemas/Ps" + description: Currently running models + WebSearchRequest: + type: object + required: [query] + properties: + query: + type: string + description: Search query string + max_results: + type: integer + minimum: 1 + maximum: 10 + default: 5 + description: Maximum number of results to return + WebSearchResult: + type: object + properties: + title: + type: string + description: Page title of the result + url: + type: string + format: uri + description: Resolved URL for the result + content: + type: string + description: Extracted text content snippet + WebSearchResponse: + type: object + properties: + results: + type: array + items: + $ref: "#/components/schemas/WebSearchResult" + description: Array of matching search results + WebFetchRequest: + type: object + required: [url] + properties: + url: + type: string + format: uri + description: The URL to fetch + WebFetchResponse: + type: object + properties: + title: + type: string + description: Title of the fetched page + content: + type: string + description: Extracted page content + links: + type: array + items: + type: string + format: uri + description: Links found on the page + VersionResponse: + type: object + properties: + version: + type: string + description: Version of Ollama + ErrorResponse: + type: object + properties: + error: + type: string + description: Error message describing what went wrong +paths: + /api/generate: + post: + summary: Generate a response + description: Generates a response for the provided prompt + operationId: generate + x-mint: + href: /api/generate + x-codeSamples: + - lang: bash + label: Default + source: | + curl http://localhost:11434/api/generate -d '{ + "model": "gemma3", + "prompt": "Why is the sky blue?" + }' + - lang: bash + label: Non-streaming + source: | + curl http://localhost:11434/api/generate -d '{ + "model": "gemma3", + "prompt": "Why is the sky blue?", + "stream": false + }' + - lang: bash + label: With options + source: | + curl http://localhost:11434/api/generate -d '{ + "model": "gemma3", + "prompt": "Why is the sky blue?", + "options": { + "temperature": 0.8, + "top_p": 0.9, + "seed": 42 + } + }' + - lang: bash + label: Structured outputs + source: | + curl http://localhost:11434/api/generate -d '{ + "model": "gemma3", + "prompt": "What are the populations of the United States and Canada?", + "stream": false, + "format": { + "type": "object", + "properties": { + "countries": { + "type": "array", + "items": { + "type": "object", + "properties": { + "country": {"type": "string"}, + "population": {"type": "integer"} + }, + "required": ["country", "population"] + } + } + }, + "required": ["countries"] + } + }' + - lang: bash + label: With images + source: | + curl http://localhost:11434/api/generate -d '{ + "model": "gemma3", + "prompt": "What is in this picture?", + "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"] + }' + - lang: bash + label: Load model + source: | + curl http://localhost:11434/api/generate -d '{ + "model": "gemma3" + }' + - lang: bash + label: Unload model + source: | + curl http://localhost:11434/api/generate -d '{ + "model": "gemma3", + "keep_alive": 0 + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/GenerateRequest" + example: + model: gemma3 + prompt: Why is the sky blue? + responses: + "200": + description: Generation responses + content: + application/json: + schema: + $ref: "#/components/schemas/GenerateResponse" + example: + model: "gemma3" + created_at: "2025-10-17T23:14:07.414671Z" + response: "Hello! How can I help you today?" + done: true + done_reason: "stop" + total_duration: 174560334 + load_duration: 101397084 + prompt_eval_count: 11 + prompt_eval_duration: 13074791 + eval_count: 18 + eval_duration: 52479709 + application/x-ndjson: + schema: + $ref: "#/components/schemas/GenerateStreamEvent" + /api/chat: + post: + summary: Generate a chat message + description: Generate the next chat message in a conversation between a user and an assistant. + operationId: chat + x-mint: + href: /api/chat + x-codeSamples: + - lang: bash + label: Default + source: | + curl http://localhost:11434/api/chat -d '{ + "model": "gemma3", + "messages": [ + { + "role": "user", + "content": "why is the sky blue?" + } + ] + }' + - lang: bash + label: Non-streaming + source: | + curl http://localhost:11434/api/chat -d '{ + "model": "gemma3", + "messages": [ + { + "role": "user", + "content": "why is the sky blue?" + } + ], + "stream": false + }' + - lang: bash + label: Structured outputs + source: | + curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ + "model": "gemma3", + "messages": [ + { + "role": "user", + "content": "What are the populations of the United States and Canada?" + } + ], + "stream": false, + "format": { + "type": "object", + "properties": { + "countries": { + "type": "array", + "items": { + "type": "object", + "properties": { + "country": {"type": "string"}, + "population": {"type": "integer"} + }, + "required": ["country", "population"] + } + } + }, + "required": ["countries"] + } + }' + - lang: bash + label: Tool calling + source: | + curl http://localhost:11434/api/chat -d '{ + "model": "qwen3", + "messages": [ + { + "role": "user", + "content": "What is the weather today in Paris?" + } + ], + "stream": false, + "tools": [ + { + "type": "function", + "function": { + "name": "get_current_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The location to get the weather for, e.g. San Francisco, CA" + }, + "format": { + "type": "string", + "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'", + "enum": ["celsius", "fahrenheit"] + } + }, + "required": ["location", "format"] + } + } + } + ] + }' + - lang: bash + label: Thinking + source: | + curl http://localhost:11434/api/chat -d '{ + "model": "gpt-oss", + "messages": [ + { + "role": "user", + "content": "What is 1+1?" + } + ], + "think": "low" + }' + - lang: bash + label: Images + source: | + curl http://localhost:11434/api/chat -d '{ + "model": "gemma3", + "messages": [ + { + "role": "user", + "content": "What is in this image?", + "images": [ + "iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC" + ] + } + ] + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/ChatRequest" + responses: + "200": + description: Chat response + content: + application/json: + schema: + $ref: "#/components/schemas/ChatResponse" + example: + model: "gemma3" + created_at: "2025-10-17T23:14:07.414671Z" + message: + role: "assistant" + content: "Hello! How can I help you today?" + done: true + done_reason: "stop" + total_duration: 174560334 + load_duration: 101397084 + prompt_eval_count: 11 + prompt_eval_duration: 13074791 + eval_count: 18 + eval_duration: 52479709 + application/x-ndjson: + schema: + $ref: "#/components/schemas/ChatStreamEvent" + /api/embed: + post: + summary: Generate embeddings + description: Creates vector embeddings representing the input text + operationId: embed + x-mint: + href: /api/embed + x-codeSamples: + - lang: bash + label: Default + source: | + curl http://localhost:11434/api/embed -d '{ + "model": "embeddinggemma", + "input": "Why is the sky blue?" + }' + - lang: bash + label: Multiple inputs + source: | + curl http://localhost:11434/api/embed -d '{ + "model": "embeddinggemma", + "input": [ + "Why is the sky blue?", + "Why is the grass green?" + ] + }' + - lang: bash + label: Truncation + source: | + curl http://localhost:11434/api/embed -d '{ + "model": "embeddinggemma", + "input": "Generate embeddings for this text", + "truncate": true + }' + - lang: bash + label: Dimensions + source: | + curl http://localhost:11434/api/embed -d '{ + "model": "embeddinggemma", + "input": "Generate embeddings for this text", + "dimensions": 128 + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/EmbedRequest" + example: + model: embeddinggemma + input: "Generate embeddings for this text" + responses: + "200": + description: Vector embeddings for the input text + content: + application/json: + schema: + $ref: "#/components/schemas/EmbedResponse" + example: + model: "embeddinggemma" + embeddings: + - [ + 0.010071029, + -0.0017594862, + 0.05007221, + 0.04692972, + 0.054916814, + 0.008599704, + 0.105441414, + -0.025878139, + 0.12958129, + 0.031952348, + ] + total_duration: 14143917 + load_duration: 1019500 + prompt_eval_count: 8 + /api/tags: + get: + summary: List models + description: Fetch a list of models and their details + operationId: list + x-mint: + href: /api/tags + x-codeSamples: + - lang: bash + label: List models + source: | + curl http://localhost:11434/api/tags + responses: + "200": + description: List available models + content: + application/json: + schema: + $ref: "#/components/schemas/ListResponse" + example: + models: + - name: "gemma3" + modified_at: "2025-10-03T23:34:03.409490317-07:00" + size: 3338801804 + digest: "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a" + details: + format: "gguf" + family: "gemma" + families: + - "gemma" + parameter_size: "4.3B" + quantization_level: "Q4_K_M" + /api/ps: + get: + summary: List running models + description: Retrieve a list of models that are currently running + operationId: ps + x-mint: + href: /api/ps + x-codeSamples: + - lang: bash + label: List running models + source: | + curl http://localhost:11434/api/ps + responses: + "200": + description: Models currently loaded into memory + content: + application/json: + schema: + $ref: "#/components/schemas/PsResponse" + example: + models: + - model: "gemma3" + size: 6591830464 + digest: "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a" + details: + parent_model: "" + format: "gguf" + family: "gemma3" + families: + - "gemma3" + parameter_size: "4.3B" + quantization_level: "Q4_K_M" + expires_at: "2025-10-17T16:47:07.93355-07:00" + size_vram: 5333539264 + context_length: 4096 + /api/show: + post: + summary: Show model details + operationId: show + x-codeSamples: + - lang: bash + label: Default + source: | + curl http://localhost:11434/api/show -d '{ + "model": "gemma3" + }' + - lang: bash + label: Verbose + source: | + curl http://localhost:11434/api/show -d '{ + "model": "gemma3", + "verbose": true + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/ShowRequest" + example: + model: gemma3 + responses: + "200": + description: Model information + content: + application/json: + schema: + $ref: "#/components/schemas/ShowResponse" + example: + parameters: "temperature 0.7\nnum_ctx 2048" + license: "Gemma Terms of Use \n\nLast modified: February 21, 2024..." + capabilities: + - "completion" + - "vision" + modified_at: "2025-08-14T15:49:43.634137516-07:00" + details: + parent_model: "" + format: "gguf" + family: "gemma3" + families: + - "gemma3" + parameter_size: "4.3B" + quantization_level: "Q4_K_M" + model_info: + gemma3.attention.head_count: 8 + gemma3.attention.head_count_kv: 4 + gemma3.attention.key_length: 256 + gemma3.attention.sliding_window: 1024 + gemma3.attention.value_length: 256 + gemma3.block_count: 34 + gemma3.context_length: 131072 + gemma3.embedding_length: 2560 + gemma3.feed_forward_length: 10240 + gemma3.mm.tokens_per_image: 256 + gemma3.vision.attention.head_count: 16 + gemma3.vision.attention.layer_norm_epsilon: 0.000001 + gemma3.vision.block_count: 27 + gemma3.vision.embedding_length: 1152 + gemma3.vision.feed_forward_length: 4304 + gemma3.vision.image_size: 896 + gemma3.vision.num_channels: 3 + gemma3.vision.patch_size: 14 + general.architecture: "gemma3" + general.file_type: 15 + general.parameter_count: 4299915632 + general.quantization_version: 2 + tokenizer.ggml.add_bos_token: true + tokenizer.ggml.add_eos_token: false + tokenizer.ggml.add_padding_token: false + tokenizer.ggml.add_unknown_token: false + tokenizer.ggml.bos_token_id: 2 + tokenizer.ggml.eos_token_id: 1 + tokenizer.ggml.merges: null + tokenizer.ggml.model: "llama" + tokenizer.ggml.padding_token_id: 0 + tokenizer.ggml.pre: "default" + tokenizer.ggml.scores: null + tokenizer.ggml.token_type: null + tokenizer.ggml.tokens: null + tokenizer.ggml.unknown_token_id: 3 + /api/create: + post: + summary: Create a model + operationId: create + x-mint: + href: /api/create + x-codeSamples: + - lang: bash + label: Default + source: | + curl http://localhost:11434/api/create -d '{ + "from": "gemma3", + "model": "alpaca", + "system": "You are Alpaca, a helpful AI assistant. You only answer with Emojis." + }' + - lang: bash + label: Create from existing + source: | + curl http://localhost:11434/api/create -d '{ + "model": "ollama", + "from": "gemma3", + "system": "You are Ollama the llama." + }' + - lang: bash + label: Quantize + source: | + curl http://localhost:11434/api/create -d '{ + "model": "llama3.1:8b-instruct-Q4_K_M", + "from": "llama3.1:8b-instruct-fp16", + "quantize": "q4_K_M" + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/CreateRequest" + example: + model: mario + from: gemma3 + system: "You are Mario from Super Mario Bros." + responses: + "200": + description: Stream of create status updates + content: + application/json: + schema: + $ref: "#/components/schemas/StatusResponse" + example: + status: "success" + application/x-ndjson: + schema: + $ref: "#/components/schemas/StatusEvent" + example: + status: "success" + /api/copy: + post: + summary: Copy a model + operationId: copy + x-mint: + href: /api/copy + x-codeSamples: + - lang: bash + label: Copy a model to a new name + source: | + curl http://localhost:11434/api/copy -d '{ + "source": "gemma3", + "destination": "gemma3-backup" + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/CopyRequest" + example: + source: gemma3 + destination: gemma3-backup + /api/pull: + post: + summary: Pull a model + operationId: pull + x-mint: + href: /api/pull + x-codeSamples: + - lang: bash + label: Default + source: | + curl http://localhost:11434/api/pull -d '{ + "model": "gemma3" + }' + - lang: bash + label: Non-streaming + source: | + curl http://localhost:11434/api/pull -d '{ + "model": "gemma3", + "stream": false + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/PullRequest" + example: + model: gemma3 + responses: + "200": + description: Pull status updates. + content: + application/json: + schema: + $ref: "#/components/schemas/StatusResponse" + example: + status: "success" + application/x-ndjson: + schema: + $ref: "#/components/schemas/StatusEvent" + example: + status: "success" + /api/push: + post: + summary: Push a model + operationId: push + x-mint: + href: /api/push + x-codeSamples: + - lang: bash + label: Push model + source: | + curl http://localhost:11434/api/push -d '{ + "model": "my-username/my-model" + }' + - lang: bash + label: Non-streaming + source: | + curl http://localhost:11434/api/push -d '{ + "model": "my-username/my-model", + "stream": false + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/PushRequest" + example: + model: my-username/my-model + responses: + "200": + description: Push status updates. + content: + application/json: + schema: + $ref: "#/components/schemas/StatusResponse" + example: + status: "success" + application/x-ndjson: + schema: + $ref: "#/components/schemas/StatusEvent" + example: + status: "success" + /api/delete: + delete: + summary: Delete a model + operationId: delete + x-mint: + href: /api/delete + x-codeSamples: + - lang: bash + label: Delete model + source: | + curl -X DELETE http://localhost:11434/api/delete -d '{ + "model": "gemma3" + }' + requestBody: + required: true + content: + application/json: + schema: + $ref: "#/components/schemas/DeleteRequest" + example: + model: gemma3 + responses: + "200": + description: Deletion status updates. + content: + application/json: + schema: + $ref: "#/components/schemas/StatusResponse" + example: + status: "success" + application/x-ndjson: + schema: + $ref: "#/components/schemas/StatusEvent" + /api/version: + get: + summary: Get version + description: Retrieve the version of the Ollama + operationId: version + x-codeSamples: + - lang: bash + label: Default + source: | + curl http://localhost:11434/api/version + responses: + "200": + description: Version information + content: + application/json: + schema: + $ref: "#/components/schemas/VersionResponse" + example: + version: "0.12.6" diff --git a/docs/quickstart.mdx b/docs/quickstart.mdx new file mode 100644 index 000000000..5ef9fa825 --- /dev/null +++ b/docs/quickstart.mdx @@ -0,0 +1,103 @@ +--- +title: Quickstart +--- + +This quickstart will walk your through running your first model with Ollama. To get started, download Ollama on macOS, Windows or Linux. + + + Download Ollama + + +## Run a model + + + + Open a terminal and run the command: + + ``` + ollama run gemma3 + ``` + + + + ``` + ollama pull gemma3 + ``` + + Lastly, chat with the model: + + ```shell + curl http://localhost:11434/api/chat -d '{ + "model": "gemma3", + "messages": [{ + "role": "user", + "content": "Hello there!" + }], + "stream": false + }' + ``` + + + + Start by downloading a model: + + ``` + ollama pull gemma3 + ``` + + Then install Ollama's Python library: + + ``` + pip install ollama + ``` + + Lastly, chat with the model: + + ```python + from ollama import chat + from ollama import ChatResponse + + response: ChatResponse = chat(model='gemma3', messages=[ + { + 'role': 'user', + 'content': 'Why is the sky blue?', + }, + ]) + print(response['message']['content']) + # or access fields directly from the response object + print(response.message.content) + ``` + + + + Start by downloading a model: + + ``` + ollama pull gemma3 + ``` + + Then install the Ollama JavaScript library: + ``` + npm i ollama + ``` + + Lastly, chat with the model: + + ```shell + import ollama from 'ollama' + + const response = await ollama.chat({ + model: 'gemma3', + messages: [{ role: 'user', content: 'Why is the sky blue?' }], + }) + console.log(response.message.content) + ``` + + + + +See a full list of available models [here](https://ollama.com/models). diff --git a/docs/styling.css b/docs/styling.css new file mode 100644 index 000000000..e63b6be85 --- /dev/null +++ b/docs/styling.css @@ -0,0 +1,16 @@ +body { + font-family: ui-sans-serif, system-ui, sans-serif, Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji; +} + +pre, code, .font-mono { + font-family: ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace; +} + +.nav-logo { + height: 44px; +} + +.eyebrow { + color: #666; + font-weight: 400; +} diff --git a/docs/template.mdx b/docs/template.mdx index 636d72f0d..9ebac8c08 100644 --- a/docs/template.mdx +++ b/docs/template.mdx @@ -1,4 +1,6 @@ -# Template +--- +title: Template +--- Ollama provides a powerful templating engine backed by Go's built-in templating engine to construct prompts for your large language model. This feature is a valuable tool to get the most out of your models. @@ -6,13 +8,13 @@ Ollama provides a powerful templating engine backed by Go's built-in templating A basic Go template consists of three main parts: -* **Layout**: The overall structure of the template. -* **Variables**: Placeholders for dynamic data that will be replaced with actual values when the template is rendered. -* **Functions**: Custom functions or logic that can be used to manipulate the template's content. +- **Layout**: The overall structure of the template. +- **Variables**: Placeholders for dynamic data that will be replaced with actual values when the template is rendered. +- **Functions**: Custom functions or logic that can be used to manipulate the template's content. Here's an example of a simple chat template: -```go +```gotmpl {{- range .Messages }} {{ .Role }}: {{ .Content }} {{- end }} @@ -20,9 +22,9 @@ Here's an example of a simple chat template: In this example, we have: -* A basic messages structure (layout) -* Three variables: `Messages`, `Role`, and `Content` (variables) -* A custom function (action) that iterates over an array of items (`range .Messages`) and displays each item +- A basic messages structure (layout) +- Three variables: `Messages`, `Role`, and `Content` (variables) +- A custom function (action) that iterates over an array of items (`range .Messages`) and displays each item ## Adding templates to your model @@ -61,7 +63,7 @@ TEMPLATE """{{- if .System }}<|start_header_id|>system<|end_header_id|> `Messages[].Role` (string): role which can be one of `system`, `user`, `assistant`, or `tool` -`Messages[].Content` (string): message content +`Messages[].Content` (string): message content `Messages[].ToolCalls` (list): list of tools the model wants to call @@ -99,9 +101,9 @@ TEMPLATE """{{- if .System }}<|start_header_id|>system<|end_header_id|> Keep the following tips and best practices in mind when working with Go templates: -* **Be mindful of dot**: Control flow structures like `range` and `with` changes the value `.` -* **Out-of-scope variables**: Use `$.` to reference variables not currently in scope, starting from the root -* **Whitespace control**: Use `-` to trim leading (`{{-`) and trailing (`-}}`) whitespace +- **Be mindful of dot**: Control flow structures like `range` and `with` changes the value `.` +- **Out-of-scope variables**: Use `$.` to reference variables not currently in scope, starting from the root +- **Whitespace control**: Use `-` to trim leading (`{{-`) and trailing (`-}}`) whitespace ## Examples @@ -155,13 +157,14 @@ CodeLlama [7B](https://ollama.com/library/codellama:7b-code) and [13B](https://o
 {{ .Prompt }} {{ .Suffix }} 
 ```
 
-> [!NOTE]
-> CodeLlama 34B and 70B code completion and all instruct and Python fine-tuned models do not support fill-in-middle.
+
+  CodeLlama 34B and 70B code completion and all instruct and Python fine-tuned models do not support fill-in-middle.
+
 
 #### Codestral
 
 Codestral [22B](https://ollama.com/library/codestral:22b) supports fill-in-middle.
 
-```go
+```gotmpl
 [SUFFIX]{{ .Suffix }}[PREFIX] {{ .Prompt }}
 ```
diff --git a/docs/troubleshooting.mdx b/docs/troubleshooting.mdx
index 18c014d19..ec662572a 100644
--- a/docs/troubleshooting.mdx
+++ b/docs/troubleshooting.mdx
@@ -1,4 +1,7 @@
-# How to troubleshoot issues
+---
+title: Troubleshooting
+description: How to troubleshoot issues encountered with Ollama
+---
 
 Sometimes Ollama may not perform as expected. One of the best ways to figure out what happened is to take a look at the logs. Find the logs on **Mac** by running the command:
 
@@ -23,9 +26,11 @@ docker logs 
 If manually running `ollama serve` in a terminal, the logs will be on that terminal.
 
 When you run Ollama on **Windows**, there are a few different locations. You can view them in the explorer window by hitting `+R` and type in:
-- `explorer %LOCALAPPDATA%\Ollama` to view logs.  The most recent server logs will be in `server.log` and older logs will be in `server-#.log`
+
+- `explorer %LOCALAPPDATA%\Ollama` to view logs. The most recent server logs will be in `server.log` and older logs will be in `server-#.log`
 - `explorer %LOCALAPPDATA%\Programs\Ollama` to browse the binaries (The installer adds this to your user PATH)
 - `explorer %HOMEPATH%\.ollama` to browse where models and configuration is stored
+- `explorer %TEMP%` where temporary executable files are stored in one or more `ollama*` directories
 
 To enable additional debug logging to help troubleshoot problems, first **Quit the running app from the tray menu** then in a powershell terminal
 
@@ -38,14 +43,26 @@ Join the [Discord](https://discord.gg/ollama) for help interpreting the logs.
 
 ## LLM libraries
 
-Ollama includes multiple LLM libraries compiled for different GPU libraries and versions. Ollama tries to pick the best one based on the capabilities of your system. If this autodetection has problems, or you run into other problems (e.g. crashes in your GPU) you can workaround this by forcing a specific LLM library.
+Ollama includes multiple LLM libraries compiled for different GPUs and CPU vector features. Ollama tries to pick the best one based on the capabilities of your system. If this autodetection has problems, or you run into other problems (e.g. crashes in your GPU) you can workaround this by forcing a specific LLM library. `cpu_avx2` will perform the best, followed by `cpu_avx` an the slowest but most compatible is `cpu`. Rosetta emulation under MacOS will work with the `cpu` library.
+
+In the server log, you will see a message that looks something like this (varies from release to release):
+
+```
+Dynamic LLM libraries [rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11 rocm_v5]
+```
 
 **Experimental LLM Library Override**
 
-You can set OLLAMA_LLM_LIBRARY to any of the available LLM libraries to limit autodetection, so for example, if you have both CUDA and AMD GPUs, but want to force the CUDA v13 only, use:
+You can set OLLAMA_LLM_LIBRARY to any of the available LLM libraries to bypass autodetection, so for example, if you have a CUDA card, but want to force the CPU LLM library with AVX2 vector support, use:
 
 ```shell
-OLLAMA_LLM_LIBRARY="cuda_v13" ollama serve
+OLLAMA_LLM_LIBRARY="cpu_avx2" ollama serve
+```
+
+You can see what features your CPU has with the following.
+
+```shell
+cat /proc/cpuinfo| grep flags | head -1
 ```
 
 ## Installing older or pre-release versions on Linux
@@ -56,13 +73,17 @@ If you run into problems on Linux and want to install an older version, or you'd
 curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.7 sh
 ```
 
+## Linux tmp noexec
+
+If your system is configured with the "noexec" flag where Ollama stores its temporary executable files, you can specify an alternate location by setting OLLAMA_TMPDIR to a location writable by the user ollama runs as. For example OLLAMA_TMPDIR=/usr/share/ollama/
+
 ## Linux docker
 
-If Ollama initially works on the GPU in a docker container, but then switches to running on CPU after some period of time with errors in the server log reporting GPU discovery failures, this can be resolved by disabling systemd cgroup management in Docker.  Edit `/etc/docker/daemon.json` on the host and add `"exec-opts": ["native.cgroupdriver=cgroupfs"]` to the docker configuration.
+If Ollama initially works on the GPU in a docker container, but then switches to running on CPU after some period of time with errors in the server log reporting GPU discovery failures, this can be resolved by disabling systemd cgroup management in Docker. Edit `/etc/docker/daemon.json` on the host and add `"exec-opts": ["native.cgroupdriver=cgroupfs"]` to the docker configuration.
 
 ## NVIDIA GPU Discovery
 
-When Ollama starts up, it takes inventory of the GPUs present in the system to determine compatibility and how much VRAM is available.  Sometimes this discovery can fail to find your GPUs.  In general, running the latest driver will yield the best results.
+When Ollama starts up, it takes inventory of the GPUs present in the system to determine compatibility and how much VRAM is available. Sometimes this discovery can fail to find your GPUs. In general, running the latest driver will yield the best results.
 
 ### Linux NVIDIA Troubleshooting
 
@@ -70,28 +91,26 @@ If you are using a container to run Ollama, make sure you've set up the containe
 
 Sometimes the Ollama can have difficulties initializing the GPU. When you check the server logs, this can show up as various error codes, such as "3" (not initialized), "46" (device unavailable), "100" (no device), "999" (unknown), or others. The following troubleshooting techniques may help resolve the problem
 
-- If you are using a container, is the container runtime working?  Try `docker run --gpus all ubuntu nvidia-smi` - if this doesn't work, Ollama won't be able to see your NVIDIA GPU.
+- If you are using a container, is the container runtime working? Try `docker run --gpus all ubuntu nvidia-smi` - if this doesn't work, Ollama won't be able to see your NVIDIA GPU.
 - Is the uvm driver loaded? `sudo nvidia-modprobe -u`
 - Try reloading the nvidia_uvm driver - `sudo rmmod nvidia_uvm` then `sudo modprobe nvidia_uvm`
 - Try rebooting
 - Make sure you're running the latest nvidia drivers
 
 If none of those resolve the problem, gather additional information and file an issue:
+
 - Set `CUDA_ERROR_LEVEL=50` and try again to get more diagnostic logs
 - Check dmesg for any errors `sudo dmesg | grep -i nvrm` and `sudo dmesg | grep -i nvidia`
 
-You may get more details for initialization failures by enabling debug prints in the uvm driver.  You should only use this temporarily while troubleshooting
-- `sudo rmmod nvidia_uvm` then `sudo modprobe nvidia_uvm uvm_debug_prints=1`
-
-
 ## AMD GPU Discovery
 
-On linux, AMD GPU access typically requires `video` and/or `render` group membership to access the `/dev/kfd` device.  If permissions are not set up correctly, Ollama will detect this and report an error in the server log.
+On linux, AMD GPU access typically requires `video` and/or `render` group membership to access the `/dev/kfd` device. If permissions are not set up correctly, Ollama will detect this and report an error in the server log.
 
-When running in a container, in some Linux distributions and container runtimes, the ollama process may be unable to access the GPU.  Use `ls -lnd /dev/kfd /dev/dri /dev/dri/*` on the host system to determine the **numeric** group IDs on your system, and pass additional `--group-add ...` arguments to the container so it can access the required devices.   For example, in the following output `crw-rw---- 1 0  44 226,   0 Sep 16 16:55 /dev/dri/card0` the group ID column is `44`
+When running in a container, in some Linux distributions and container runtimes, the ollama process may be unable to access the GPU. Use `ls -lnd /dev/kfd /dev/dri /dev/dri/*` on the host system to determine the **numeric** group IDs on your system, and pass additional `--group-add ...` arguments to the container so it can access the required devices. For example, in the following output `crw-rw---- 1 0  44 226,   0 Sep 16 16:55 /dev/dri/card0` the group ID column is `44`
 
 If you are experiencing problems getting Ollama to correctly discover or use your GPU for inference, the following may help isolate the failure.
-- `AMD_LOG_LEVEL=3` Enable info log levels in the AMD HIP/ROCm libraries.  This can help show more detailed error codes that can help troubleshoot problems
+
+- `AMD_LOG_LEVEL=3` Enable info log levels in the AMD HIP/ROCm libraries. This can help show more detailed error codes that can help troubleshoot problems
 - `OLLAMA_DEBUG=1` During GPU discovery additional information will be reported
 - Check dmesg for any errors from amdgpu or kfd drivers `sudo dmesg | grep -i amdgpu` and `sudo dmesg | grep -i kfd`
 
@@ -103,4 +122,4 @@ If you experience gibberish responses when models load across multiple AMD GPUs
 
 ## Windows Terminal Errors
 
-Older versions of Windows 10 (e.g., 21H1) are known to have a bug where the standard terminal program does not display control characters correctly.  This can result in a long string of strings like `←[?25h←[?25l` being displayed, sometimes erroring with `The parameter is incorrect`  To resolve this problem, please update to Win 10 22H1 or newer.
+Older versions of Windows 10 (e.g., 21H1) are known to have a bug where the standard terminal program does not display control characters correctly. This can result in a long string of strings like `←[?25h←[?25l` being displayed, sometimes erroring with `The parameter is incorrect` To resolve this problem, please update to Win 10 22H1 or newer.
diff --git a/docs/windows.mdx b/docs/windows.mdx
index eb067ed04..b76e37c88 100644
--- a/docs/windows.mdx
+++ b/docs/windows.mdx
@@ -1,4 +1,6 @@
-# Ollama Windows
+---
+title: Windows
+---
 
 Welcome to Ollama for Windows.
 
@@ -7,20 +9,20 @@ No more WSL required!
 Ollama now runs as a native Windows application, including NVIDIA and AMD Radeon GPU support.
 After installing Ollama for Windows, Ollama will run in the background and
 the `ollama` command line is available in `cmd`, `powershell` or your favorite
-terminal application. As usual the Ollama [api](./api.md) will be served on
+terminal application. As usual the Ollama [API](/api) will be served on
 `http://localhost:11434`.
 
 ## System Requirements
 
-* Windows 10 22H2 or newer, Home or Pro
-* NVIDIA 452.39 or newer Drivers if you have an NVIDIA card
-* AMD Radeon Driver https://www.amd.com/en/support if you have a Radeon card
+- Windows 10 22H2 or newer, Home or Pro
+- NVIDIA 452.39 or newer Drivers if you have an NVIDIA card
+- AMD Radeon Driver https://www.amd.com/en/support if you have a Radeon card
 
 Ollama uses unicode characters for progress indication, which may render as unknown squares in some older terminal fonts in Windows 10. If you see this, try changing your terminal font settings.
 
 ## Filesystem Requirements
 
-The Ollama install does not require Administrator, and installs in your home directory by default.  You'll need at least 4GB of space for the binary install.  Once you've installed Ollama, you'll need additional space for storing the Large Language models, which can be tens to hundreds of GB in size.  If your home directory doesn't have enough space, you can change where the binaries are installed, and where the models are stored.
+The Ollama install does not require Administrator, and installs in your home directory by default. You'll need at least 4GB of space for the binary install. Once you've installed Ollama, you'll need additional space for storing the Large Language models, which can be tens to hundreds of GB in size. If your home directory doesn't have enough space, you can change where the binaries are installed, and where the models are stored.
 
 ### Changing Install Location
 
@@ -30,6 +32,20 @@ To install the Ollama application in a location different than your home directo
 OllamaSetup.exe /DIR="d:\some\location"
 ```
 
+### Changing Model Location
+
+To change where Ollama stores the downloaded models instead of using your home directory, set the environment variable `OLLAMA_MODELS` in your user account.
+
+1. Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for _environment variables_.
+
+2. Click on _Edit environment variables for your account_.
+
+3. Edit or create a new variable for your user account for `OLLAMA_MODELS` where you want the models stored
+
+4. Click OK/Apply to save.
+
+If Ollama is already running, Quit the tray application and relaunch it from the Start menu, or a new terminal started after you saved the environment variables.
+
 ## API Access
 
 Here's a quick example showing API access from `powershell`
@@ -40,22 +56,24 @@ Here's a quick example showing API access from `powershell`
 
 ## Troubleshooting
 
-Ollama on Windows stores files in a few different locations.  You can view them in
+Ollama on Windows stores files in a few different locations. You can view them in
 the explorer window by hitting `+R` and type in:
+
 - `explorer %LOCALAPPDATA%\Ollama` contains logs, and downloaded updates
-    - *app.log* contains most resent logs from the GUI application
-    - *server.log* contains the most recent server logs
-    - *upgrade.log* contains log output for upgrades
+  - _app.log_ contains most resent logs from the GUI application
+  - _server.log_ contains the most recent server logs
+  - _upgrade.log_ contains log output for upgrades
 - `explorer %LOCALAPPDATA%\Programs\Ollama` contains the binaries (The installer adds this to your user PATH)
 - `explorer %HOMEPATH%\.ollama` contains models and configuration
+- `explorer %TEMP%` contains temporary executable files in one or more `ollama*` directories
 
 ## Uninstall
 
-The Ollama Windows installer registers an Uninstaller application.  Under `Add or remove programs` in Windows Settings, you can uninstall Ollama.
-
-> [!NOTE]
-> If you have [changed the OLLAMA_MODELS location](#changing-model-location), the installer will not remove your downloaded models
+The Ollama Windows installer registers an Uninstaller application. Under `Add or remove programs` in Windows Settings, you can uninstall Ollama.
 
+
+  If you have [changed the OLLAMA_MODELS location](#changing-model-location), the installer will not remove your downloaded models
+
 
 ## Standalone CLI
 
@@ -66,11 +84,12 @@ help you keep up to date.
 
 If you'd like to install or integrate Ollama as a service, a standalone
 `ollama-windows-amd64.zip` zip file is available containing only the Ollama CLI
-and GPU library dependencies for Nvidia.  If you have an AMD GPU, also download
+and GPU library dependencies for Nvidia. If you have an AMD GPU, also download
 and extract the additional ROCm package `ollama-windows-amd64-rocm.zip` into the
-same directory.  Both zip files are necessary for a complete AMD installation.
-This allows for embedding Ollama in existing applications, or running it as a
-system service via `ollama serve` with tools such as [NSSM](https://nssm.cc/). 
+same directory. This allows for embedding Ollama in existing applications, or
+running it as a system service via `ollama serve` with tools such as
+[NSSM](https://nssm.cc/).
 
-> [!NOTE]  
-> If you are upgrading from a prior version, you should remove the old directories first.
+
+  If you are upgrading from a prior version, you should remove the old directories first.
+