Commit Graph

200 Commits

Author SHA1 Message Date
Jeffrey Morgan 15c2d8fe14
server: parallelize embeddings in API web handler instead of in subprocess runner (#6220)
For simplicity, perform parallelization of embedding requests in the API handler instead of offloading this to the subprocess runner. This keeps the scheduling story simpler as it builds on existing parallel requests, similar to existing text completion functionality.
2024-08-11 11:57:10 -07:00
Daniel Hiltgen 25906d72d1
llm: prevent loading too large models on windows (#5926)
Don't allow loading models that would lead to memory exhaustion (across vram, system memory and disk paging). This check was already applied on Linux but should also be applied on Windows as well.
2024-08-11 11:30:20 -07:00
Jeffrey Morgan de4fc29773
llm: reserve required number of slots for embeddings (#6219) 2024-08-06 23:20:49 -04:00
Daniel Hiltgen f457d63400 Implement linux NUMA detection
If the system has multiple numa nodes, enable numa support in llama.cpp
If we detect numactl in the path, use that, else use the basic "distribute" mode.
2024-08-05 12:56:20 -07:00
Michael Yang b732beba6a lint 2024-08-01 17:06:06 -07:00
Michael Yang 5c1912769e
Merge pull request #5473 from ollama/mxyng/environ
fix: environ lookup
2024-07-31 10:18:05 -07:00
royjhan 1b44d873e7
Add Metrics to `api\embed` response (#5709)
* add prompt tokens to embed response

* rm slog

* metrics

* types

* prompt n

* clean up

* reset submodule

* update tests

* test name

* list metrics
2024-07-30 13:12:21 -07:00
Tibor Schmidt f3d7a481b7
feat: add support for min_p (resolve #1142) (#1825) 2024-07-27 14:37:40 -07:00
Daniel Hiltgen e12fff8810 Enable windows error dialog for subprocess startup
Make sure if something goes wrong spawning the process, the user gets
enough info to be able to try to self correct, or at least file a bug
with details so we can fix it.  Once the process starts, we immediately
change back to the recommended setting to prevent the blocking dialog.
This ensures if the model fails to load (OOM, unsupported model type,
etc.) the process will exit quickly and we can scan the stdout/stderr
of the subprocess for the reason to report via API.
2024-07-22 14:07:27 -07:00
Michael Yang e2c3f6b3e2 string 2024-07-22 11:27:52 -07:00
Michael Yang 55cd3ddcca bool 2024-07-22 11:27:21 -07:00
Michael Yang 35b89b2eab rfc: dynamic environ lookup 2024-07-22 11:25:30 -07:00
Daniel Hiltgen a3c20e3f18 Refine error reporting for subprocess crash
On windows, the exit status winds up being the search term many
users search for and end up piling in on issues that are unrelated.
This refines the reporting so that if we have a more detailed message
we'll suppress the exit status portion of the message.
2024-07-22 08:52:16 -07:00
Daniel Hiltgen 283948c83b Adjust windows ROCm discovery
The v5 hip library returns unsupported GPUs which wont enumerate at
inference time in the runner so this makes sure we align discovery.  The
gfx906 cards are no longer supported so we shouldn't compile with that
GPU type as it wont enumerate at runtime.
2024-07-20 15:17:50 -07:00
royjhan b9f5e16c80
Introduce `/api/embed` endpoint supporting batch embedding (#5127)
* Initial Batch Embedding

* Revert "Initial Batch Embedding"

This reverts commit c22d54895a.

* Initial Draft

* mock up notes

* api/embed draft

* add server function

* check normalization

* clean up

* normalization

* playing around with truncate stuff

* Truncation

* Truncation

* move normalization to go

* Integration Test Template

* Truncation Integration Tests

* Clean up

* use float32

* move normalize

* move normalize test

* refactoring

* integration float32

* input handling and handler testing

* Refactoring of legacy and new

* clear comments

* merge conflicts

* touches

* embedding type 64

* merge conflicts

* fix hanging on single string

* refactoring

* test values

* set context length

* clean up

* testing clean up

* testing clean up

* remove function closure

* Revert "remove function closure"

This reverts commit 55d48c6ed1.

* remove function closure

* remove redundant error check

* clean up

* more clean up

* clean up
2024-07-15 12:14:24 -07:00
Jeffrey Morgan ef98803d63
llm: looser checks for minimum memory (#5677) 2024-07-13 09:20:05 -07:00
Jeffrey Morgan c4cf8ad559
llm: avoid loading model if system memory is too small (#5637)
* llm: avoid loading model if system memory is too small

* update log

* Instrument swap free space

On linux and windows, expose how much swap space is available
so we can take that into consideration when scheduling models

* use `systemSwapFreeMemory` in check

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2024-07-11 16:42:57 -07:00
Jeffrey Morgan 791650ddef
sched: only error when over-allocating system memory (#5626) 2024-07-11 00:53:12 -07:00
Daniel Hiltgen 22c81f62ec Remove duplicate merge glitch 2024-07-10 09:01:33 -07:00
Michael Yang 9bbddc37a7
Merge pull request #5126 from ollama/mxyng/messages
update message processing
2024-07-09 09:20:44 -07:00
Jeffrey Morgan 53da2c6965
llm: remove ambiguous comment when putting upper limit on predictions to avoid infinite generation (#5535) 2024-07-07 14:32:05 -04:00
Michael Yang ac7a842e55 fix model reloading
ensure runtime model changes (template, system prompt, messages,
options) are captured on model updates without needing to reload the
server
2024-07-05 13:17:25 -07:00
Daniel Hiltgen ccd7785859
Merge pull request #5243 from dhiltgen/modelfile_use_mmap
Fix use_mmap for modefiles
2024-07-03 13:59:42 -07:00
Daniel Hiltgen 0e982bc1f4 Fix corner cases on tmp cleaner on mac
When ollama is running a long time, tmp cleaners can remove the
runners.  This tightens up a few corner cases on arm macs where
we failed with "server cpu not listed in available servers map[]"
2024-07-03 13:10:14 -07:00
Josh Yan 33a65e3ba3 error 2024-07-01 16:04:13 -07:00
Daniel Hiltgen 97c9e11768 Switch use_mmap to a pointer type
This uses nil as undefined for a cleaner implementation.
2024-07-01 08:44:59 -07:00
Daniel Hiltgen 3518aaef33
Merge pull request #4218 from dhiltgen/auto_parallel
Enable concurrency by default
2024-07-01 08:32:29 -07:00
Blake Mizerany cb42e607c5
llm: speed up gguf decoding by a lot (#5246)
Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.
2024-06-24 21:47:52 -07:00
Daniel Hiltgen 17b7186cd7 Enable concurrency by default
This adjusts our default settings to enable multiple models and parallel
requests to a single model.  Users can still override these by the same
env var settings as before.  Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s).  As before, multiple models will only load
concurrently if they fully fit in VRAM.
2024-06-21 15:45:05 -07:00
Daniel Hiltgen 5bf5aeec01 Refine mmap default logic on linux
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
2024-06-20 11:07:04 -07:00
Daniel Hiltgen 96624aa412
Merge pull request #5072 from dhiltgen/windows_path
Move libraries out of users path
2024-06-19 09:13:39 -07:00
Daniel Hiltgen 7784ca33ce Tighten up memory prediction logging
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
2024-06-18 09:15:35 -07:00
Daniel Hiltgen 171796791f Adjust mmap logic for cuda windows for faster model load
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off.  This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
2024-06-17 16:54:30 -07:00
Daniel Hiltgen b2799f111b Move libraries out of users path
We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.
2024-06-17 13:12:18 -07:00
Daniel Hiltgen da3bf23354 Workaround gfx900 SDMA bugs
Implement support for GPU env var workarounds, and leverage
this for the Vega RX 56 which needs
HSA_ENABLE_SDMA=0 set to work properly
2024-06-14 15:38:13 -07:00
Daniel Hiltgen 6f351bf586 review comments and coverage 2024-06-14 14:55:50 -07:00
Daniel Hiltgen fc37c192ae Refine CPU load behavior with system memory visibility 2024-06-14 14:51:40 -07:00
Daniel Hiltgen 6fd04ca922 Improve multi-gpu handling at the limit
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
2024-06-14 14:51:40 -07:00
Craig Hughes b84aea1685
Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782) 2024-06-09 10:57:09 -07:00
Michael Yang e40145a39d lint 2024-06-04 11:13:30 -07:00
Michael Yang c895a7d13f some gocritic 2024-06-04 11:13:30 -07:00
Michael Yang 829ff87bd1
revert tokenize ffi (#4761)
* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65dbb.

* Revert "vocab only"

This reverts commit bf54c845e9.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a0410.
2024-05-31 18:54:21 -07:00
Jeffrey Morgan a50a87a7b8
partial offloading: allow flash attention and disable mmap (#4734)
* partial offloading: allow flash attention and disable mmap

* allow mmap with num_gpu=0
2024-05-30 16:58:01 -07:00
Michael Yang 26a00a0410 use ffi for tokenizing/detokenizing 2024-05-29 11:26:47 -07:00
Daniel Hiltgen 92c81e8117 Give the final model loading more time
On some systems, 1 minute isn't sufficient to finish the load after it
hits 100% This creates 2 distinct timers, although they're both set to
the same value for now so we can refine the timeouts further.
2024-05-28 09:08:10 -07:00
Lei Jitang 7487229c34
llm/server.go: Fix 2 minor typos (#4661)
Signed-off-by: Lei Jitang <leijitang@outlook.com>
2024-05-27 17:21:10 -07:00
Daniel Hiltgen 0165ba1651
Merge pull request #4638 from dhiltgen/better_error
Report better warning on client closed abort of load
2024-05-25 14:32:28 -07:00
Daniel Hiltgen c4209d6d21 Report better warning on client closed abort of load
If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode
2024-05-25 09:23:28 -07:00
Patrick Devine 4cc3be3035
Move envconfig and consolidate env vars (#4608) 2024-05-24 14:57:15 -07:00
Daniel Hiltgen b37b496a12 Wire up load progress
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Jeffrey Morgan 38255d2af1
Use flash attention flag for now (#4580)
* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests
2024-05-22 21:52:09 -07:00
Sam e15307fdf4
feat: add support for flash_attn (#4120)
* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Patrick Devine d1692fd3e0
fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461) 2024-05-15 15:43:16 -07:00
Daniel Hiltgen 853ae490e1 Sanitize the env var debug log
Only dump env vars we care about in the logs
2024-05-15 14:42:57 -07:00
Patrick Devine 6845988807
Ollama `ps` command for showing currently loaded models (#4327) 2024-05-13 17:17:36 -07:00
jmorganca 92ca2cca95 Revert "only forward some env vars"
This reverts commit ce3b212d12.
2024-05-10 22:53:21 -07:00
Daniel Hiltgen c4014e73a2 Fall back to CPU runner with zero layers 2024-05-10 15:09:48 -07:00
Jeffrey Morgan bb6fd02298
Don't clamp ctx size in `PredictServerFit` (#4317)
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
2024-05-10 10:17:12 -07:00
Michael Yang cf442cd57e fix typo 2024-05-09 16:23:37 -07:00
Michael Yang ce3b212d12 only forward some env vars 2024-05-09 15:16:09 -07:00
Michael Yang 58876091f7 log clean up 2024-05-09 14:55:36 -07:00
Daniel Hiltgen d0425f26cf
Merge pull request #4294 from dhiltgen/harden_subprocess_reaping
Harden subprocess reaping
2024-05-09 14:02:16 -07:00
Bruce MacDonald cfa84b8470
add done_reason to the api (#4235) 2024-05-09 13:30:14 -07:00
Daniel Hiltgen 84ac7ce139 Refine subprocess reaping 2024-05-09 11:21:31 -07:00
Daniel Hiltgen 920a4b0794 Merge remote-tracking branch 'upstream/main' into pr3702 2024-05-08 16:44:35 -07:00
Daniel Hiltgen ee49844d09
Merge pull request #4153 from dhiltgen/gpu_verbose_response
Add GPU usage
2024-05-08 16:39:11 -07:00
Daniel Hiltgen bee2f4a3b0 Record GPU usage information
This records more GPU usage information for eventual UX inclusion.
2024-05-08 14:45:39 -07:00
Daniel Hiltgen 72700279e2 Detect noexec and report a better error
This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess
2024-05-07 16:46:15 -07:00
Daniel Hiltgen 380378cc80 Use our libraries first
Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly
2024-05-06 14:23:29 -07:00
Jeffrey Morgan ed740a2504
Fix `no slots available` error with concurrent requests (#4160) 2024-05-06 14:22:53 -07:00
Jeffrey Morgan 1b0e6c9c0e
Fix llava models not working after first request (#4164)
* fix llava models not working after first request

* individual requests only for llava models
2024-05-05 20:50:31 -07:00
Daniel Hiltgen f56aa20014 Centralize server config handling
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
Mark Ward 321d57e1a0 Removing go routine calling .wait from load. 2024-05-01 18:51:10 +00:00
Mark Ward ba26c7aa00 it will always return an error due to Kill() discarding Wait() errors 2024-05-01 18:51:10 +00:00
Mark Ward 63c763685f log when the waiting for the process to stop to help debug when other tasks execute during this wait.
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
2024-05-01 18:51:10 +00:00
Mark Ward 948114e3e3 fix sched to wait for the runner to terminate to ensure following vram check will be more accurate 2024-05-01 18:51:10 +00:00
Jeffrey Morgan 7aa08a77ca
llm: dont cap context window limit to training context window (#3988) 2024-04-29 10:07:30 -04:00
Jeffrey Morgan bb31def011
return code `499` when user cancels request while a model is loading (#3955) 2024-04-26 17:38:29 -04:00
Jeffrey Morgan 993cf8bf55
llm: limit generation to 10x context size to avoid run on generations (#3918)
* llm: limit generation to 10x context size to avoid run on generations

* add comment

* simplify condition statement
2024-04-25 19:02:30 -04:00
Daniel Hiltgen 6e76348df7
Merge pull request #3834 from dhiltgen/not_found_in_path
Report errors on server lookup instead of path lookup failure
2024-04-24 10:50:48 -07:00
Daniel Hiltgen 58888a74bc Detect and recover if runner removed
Tmp cleaners can nuke the file out from underneath us.  This detects the missing
runner, and re-initializes the payloads.
2024-04-23 10:05:26 -07:00
Daniel Hiltgen 34b9db5afc Request and model concurrency
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00
Daniel Hiltgen 8711d03df7 Report errors on server lookup instead of path lookup failure 2024-04-22 19:08:47 -07:00
Daniel Hiltgen aa72281eae Trim spaces and quotes from llm lib override 2024-04-22 17:11:14 -07:00
ManniX-ITA c496967e56
Merge branch 'ollama:main' into mannix-server 2024-04-18 18:45:15 +02:00
Michael Yang 3cf483fe48 add stablelm graph calculation 2024-04-17 13:57:19 -07:00
Michael Yang a8b9b930b4 account for all non-repeating layers 2024-04-17 11:21:21 -07:00
ManniX-ITA bd54b08261
Streamlined WaitUntilRunning 2024-04-17 17:39:52 +02:00
Michael Yang 26df674785 scale graph based on gpu count 2024-04-16 14:44:13 -07:00
Michael Yang 41a272de9f darwin: no partial offloading if required memory greater than system 2024-04-16 11:22:38 -07:00
Jeffrey Morgan a0b8a32eb4
Terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading (#3653)
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading

* use `unload` in signal handler
2024-04-15 12:09:32 -04:00
Michael Yang 7e33a017c0 partial offloading 2024-04-10 11:37:20 -07:00
Michael Yang 8b2c10061c refactor tensor query 2024-04-10 11:37:20 -07:00
Daniel Hiltgen c5ff443b9f Handle very slow model loads
During testing, we're seeing some models take over 3 minutes.
2024-04-09 16:35:10 -07:00
Michael Yang be517e491c no rope parameters 2024-04-05 18:05:27 -07:00
Michael Yang 12e923e158 update graph size estimate 2024-04-03 13:34:12 -07:00
Daniel Hiltgen 464d817824
Merge pull request #3464 from dhiltgen/subprocess
Fix numgpu opt miscomparison
2024-04-02 20:10:17 -07:00
Daniel Hiltgen 6589eb8a8c Revert options as a ref in the server 2024-04-02 16:44:10 -07:00
Michael Yang 80163ebcb5 fix metal gpu 2024-04-02 16:06:45 -07:00
Daniel Hiltgen 58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00