server: num_predict==-2 fills context buffer

2025-02-27 12:18:35 +01:00 · 2025-02-27 12:18:35 +01:00 · bf42297350
parent 76e903cf9d
commit bf42297350
3 changed files with 9 additions and 1 deletions
--- a/docs/modelfile.md
+++ b/docs/modelfile.md
@ -159,7 +159,7 @@ PARAMETER <parameter> <parametervalue>
 | temperature    | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)                                                                                                                                     | float      | temperature 0.7      |
 | seed           | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0)                                                                                       | int        | seed 42              |
 | stop           | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile.                                      | string     | stop "AI assistant:" |
-| num_predict    | Maximum number of tokens to predict when generating text. (Default: -1, infinite generation)                                                                                                                                   | int        | num_predict 42       |
+| num_predict    | Maximum number of tokens to predict when generating text. (Default: -1, infinite generation, -2 = fill context)                                                                                                                                         | int        | num_predict 42       |
 | top_k          | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)                                                                        | int        | top_k 40             |
 | top_p          | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)                                                                 | float      | top_p 0.9            |
 | min_p          | Alternative to the top_p, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with *p*=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out. (Default: 0.0) | float      | min_p 0.05            |
--- a/runner/llamarunner/runner.go
+++ b/runner/llamarunner/runner.go
@ -382,6 +382,10 @@ func (s *Server) processBatch(tokenBatch *llama.Batch, embedBatch *llama.Batch)
 			s.removeSequence(seqIdx, "limit")
 			continue
 		}
+		if seq.numPredict == -2 && len(seq.cache.Inputs) >= s.cache.numCtx {
+			s.removeSequence(seqIdx, "limit")
+			continue
+		}

 		for i, input := range seq.inputs {
 			if len(seq.cache.Inputs)+len(seq.pendingInputs)+1 > s.cache.numCtx {
--- a/runner/ollamarunner/runner.go
+++ b/runner/ollamarunner/runner.go
@ -327,6 +327,10 @@ func (s *Server) processBatch() error {
 			s.removeSequence(seqIdx, "limit")
 			continue
 		}
+		if seq.numPredict == -2 && int32(len(seq.cache.Inputs)) >= s.cache.numCtx {
+			s.removeSequence(seqIdx, "limit")
+			continue
+		}

 		if !s.cache.enabled {
 			seq.inputs = append(seq.cache.Inputs, seq.inputs...)