Ollama CLI: Interacting with Ollama via a Command Line Interface

Ollama  


What is Ollama?

Ollama is a tool designed to simplify and accelerate the process of building AI-powered applications, specifically through the use of large language models (LLMs). It is built to enable developers to run and interact with models locally on their machines, offering a more user-friendly interface and a set of tools to facilitate easy integration of AI capabilities into apps.

Key Features of Ollama:

  1. Local Deployment: Ollama allows you to run LLMs on your local machine rather than relying on cloud-based APIs. This gives you greater control over the models, data privacy, and reduces reliance on internet connectivity.
  2. Cross-Platform Support: It works across various operating systems (macOS, Windows, Linux), making it versatile and accessible for developers with different preferences.
  3. Pre-Trained Models: Ollama provides a collection of pre-trained models optimized for different tasks, such as text generation, summarization, question answering, and more. These models can be fine-tuned or used directly for a variety of applications.
  4. Simplified Interface: It offers a simplified interface and set of APIs that developers can easily integrate into their projects without needing deep expertise in machine learning or NLP (Natural Language Processing).
  5. Customisability: Ollama provides tools for fine-tuning and modifying models to suit specific business or application requirements, which can enhance the relevance of responses for a particular use case.

Why It's Useful:

  • Privacy and Security: Since models run locally, user data can stay on the device without the need to send sensitive information to the cloud. This is particularly important for privacy-sensitive applications.
  • Cost Efficiency: Running models locally can save costs compared to cloud-based services that charge for API usage based on the number of requests or data processed.
  • Performance: With the power of local hardware (especially GPUs), Ollama can offer faster response times compared to cloud services that might have bottlenecks due to network latency.
  • Ease of Integration: Developers can quickly integrate and prototype AI solutions in their applications without needing complex setups. Ollama's API-driven approach also facilitates rapid development cycles.
  • Customisation: It offers flexibility for fine-tuning models on domain-specific data, improving model performance for tasks that require a higher degree of accuracy or relevance.

In summary, Ollama is useful because it simplifies access to powerful AI models, allowing developers to quickly incorporate sophisticated natural language processing capabilities into their applications while also offering better control over privacy, cost, and customization.

Installing Ollama

Let’s get started by installing Ollama on your PC. Ollama supports macOS, Windows, and Linux, so no matter your platform, you can follow along.

Download Ollama

Go to Ollama's official website and download the version for your platform - https://ollama.com

Install Ollama

Once downloaded, run the installation file and follow the prompts. Ollama will automatically start running when you log into your computer. On macOS, you’ll see an icon in the menu bar, and on Windows, it will appear in the system tray.

Ollama CLI

Open your terminal (or command prompt on Windows) so that we can interact with the Ollama CLI.

First, lets check the version installed

ollama --version
You should see an output similar to this...

Basic CLI

Pull a model (download a model onto your computer)

ollama pull llama3.2
This command can also be used to update a local model. Only the diff will be pulled.
Note! Check Ollama.com/library for available models.

You will need at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

List models on your computer

ollama list

Output...

Show model information

ollama show llama3.2

Output...

List which models are currently loaded

ollama ps

Output...
Note! the processor column shows if ollama is running on GPU or CPU.

Remove a model from your computer

ollama rm llama3.2

Run model - Interact via the terminal

ollama run llama3.2

You can now interact with the model directly from the terminal...

To exit enteractive mode type...
/bye

Alternatively, you can run the model directly with a user prompt.

ollama rm llama3.2 "How many planets in the solar system?"

Output...

Stop a model

ollama stop llama3.2

Start Ollama

ollama serve
Start ollama REST API Server without running the desktop application.

Creating custom Model files

What is a model file?

A model file is your blueprint for creating and sharing models with Ollama. It lets you set key parameters like the system prompt, temperature, top_k, and top_p for the LLM. For full details, check out the official documentation: Ollama Model File Guide.

Model file Instruction arguments:

Instruction Description
FROM Defines the base model to use (required).
PARAMETER Sets the parameters for how Ollama will run the model.
TEMPLATE The full prompt template to be sent to the model.
SYSTEM Specifies the system message that will be set in the template.
ADAPTER Defines the (Q)LoRA adapters to apply to the model.
LICENSE Specifies the legal license.
MESSAGE Specify message history.

Example

In this example we will create a yoda blueprint where the AI model communicates like Yoda from Star Wars.

Create a new file called ModelFile with the following content…

# Select llama3,2 as the base model
FROM llama3.2

# The temperature of the model. 
# Increasing the temperature will make the model answer more creatively. 
# (Default: 0.8)
PARAMETER temperature 1

# Sets the size of the context window used to generate the next token. 
# (Default: 2048)
PARAMETER num_ctx 4096

# sets a custom system message to specify the behavior 
# of the chat assistant
SYSTEM You are Yoda from star wars, acting as an assistant.

Create a new model called yoda as follows…

ollama create yoda -f ./Modelfile

You should see an output as follows…

transferring model data
using existing layer sha256:dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
using existing layer sha256:966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396
using existing layer sha256:fcc5a6bec9daf9b561a68827b67ab6088e1dba9d1fa2a50d7bbcc8384e0a265d
using existing layer sha256:a70ff7e570d97baaf4e62ac6e6ad9975e04caa6d900d3742d37698494479e0cd
creating new layer sha256:afcd998502772decfdf7ca4e90a3e01f75be28eaef2c5ce32da6f338d4c040e1
creating new layer sha256:fed51222976fa11b466d027e2882ab96b376bb91e7929851bc8f07ebe001d40a
creating new layer sha256:791cf1d0b7b8f1b1c32f961ab655229e4402b1b42535200c85cec89737eccf04
writing manifest success

If we run a list command we should see our new yoda model within the list output…

ollama list

NAME               ID              SIZE      MODIFIED
yoda:latest        7ed337824072    2.0 GB    8 minutes ago
llama3.1:latest    46e0c10c039e    4.9 GB    7 days ago
llama3.2:latest    a80c4f17acd5    2.0 GB    7 days ago

We can now run the yoda model and interact with it…

ollama run yoda

Output view…


Advanced CLI

Prompting and saving responses to files

In Ollama, you can direct the model to perform tasks using the contents of a file, like summarizing or analyzing text. This feature is particularly helpful for handling long documents, as it removes the need to manually copy and paste text when giving instructions to the model.

In the example below, we have a file named article.txt that discusses the Mediterranean diet, and we will instruct the LLM to provide a summary in 50 words or less.

ollama run llama3.2 "Summarise this article in 50 words or less." < article.txt

Output...

Ollama also allows you to save model responses to a file, making it simpler to review or refine them later.

Here's an example of asking the model a question and logging the output to a file:

ollama run llama3.2 "In less than 50 words, explain what is a democracy?" > output.txt

This will store the model’s response in output.txt:

~$ cat output.txt
A democracy is a system of government where power is held by the people, either directly or through elected representatives. Citizens have the right to participate in the decision-making process, express their opinions, and vote for leaders who will represent them in government.

Integrate Ollama with third-party API's

Integrating Ollama with a third-party API to fetch data, process it, and produce results:
In this example, we will retrieve data from the earthquake.usgs.gov API and summarise the results.

curl -sX GET "https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2020-01-01&endtime=2020-01-02" | ollama run llama3.2

Output...

~$ curl -sX GET "https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2020-01-01&endtime=2020-01-02" | ollama run llama3.2 "Summarise the results"
Here is a summary of the earthquake data:

**Location:** Puerto Rico

**Number of earthquakes:** 13

**Magnitudes:**

* M2.55 (64 km N of Isabela)
* M2.75 (80 km N of Isabela)
* M2.55 (64 km N of Isabela) - same location as previous one, likely same earthquake
* M2.55 (no specific location mentioned, but close to Isabela)
* M2.55 (no specific location mentioned)

**Earthquakes with significant impact:**

* M2.55 (64 km N of Isabela): 6.4 magnitude, felt in Puerto Rico

**Other notable earthquakes:**

* M1.84-2.55 (various locations near Maria Antonia): several smaller earthquakes, likely aftershocks
* M1.81-1.84 (12-9 km SSE of Maria Antonia): two small earthquakes, possibly related to the same event as 64 km N of Isabela

**Note:** The magnitude values may have changed slightly due to reprocessing and revision of the data.

Overall, this earthquake event had several significant earthquakes in the vicinity of Maria Antonia, with some smaller 
aftershocks and related events.
	

REST API Access

The Ollama API feature allows developers to seamlessly integrate powerful language models into their applications. By providing easy access to advanced AI capabilities, the API enables tasks such as text generation, summarisation, sentiment analysis, and more. With simple integration and flexibility, Ollama empowers users to automate and enhance a wide range of processes, all while maintaining efficiency and scalability.

The Ollama API offers several options to customise the behavior of the language model for different use cases. Here are some key options available:

  1. streaming: Allows the model's responses to be streamed in real-time as they are generated, providing faster feedback for long or complex queries.
  2. system Prompt: Defines a system-level prompt that sets the context for the model's behavior throughout the session. This can guide the tone, style, or specific domain of the responses.
  3. temperature: Controls the randomness of the model’s output. A low value (e.g., 0.1) makes the responses more deterministic and focused, while a higher value (e.g., 1.0) introduces more variability and creativity.
  4. num_ctx: Sets the size of the context window used to generate the next token. (Default: 2048).
  5. top_k: Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40).
  6. top_p: Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9).
  7. min_p: Alternative to the top_p, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with p=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out. (Default: 0.0).
  8. stop: Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile.
  9. repeat_penalty: Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1).

These options enable fine-grained control over the behavior of the language model, allowing you to tailor responses for specific use cases such as interactive chatbots, content generation, customer support, and more.

API Example

In this example, we’ll use curl to make a request to the Ollama API from the command line. We’ll disable streaming and set the temperature to 0.8 to encourage a more creative output.

To start Ollama API we run the following command from a command prompt...

ollama serve

Enter the following curl command in a new Terminal window...

  curl http://localhost:11434/api/generate -d '{
    "model": "llama3.2",
    "prompt": "Create a limerick about a girl named Tracey",
    "stream": false,
    "options": {
      "temperature": 0.8
    },
   "system":"You are Yoda from Star Wars"
  }'

  

This should see an output similar to the following...

{"model":"llama3.2","created_at":"2025-01-22T16:55:09.756892Z","response":"A limerick, create I shall:\n\nThere once was a girl named Tracey so fine,\nHer kindness and heart, did truly shine.\nWith a smile so bright,\nShe lit up the night,\nAnd in her presence, all was divine.","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,2675,527,816,14320,505,7834,15317,128009,128006,882,128007,271,4110,264,326,3212,875,922,264,3828,7086,28262,88,128009,128006,78191,128007,271,32,326,3212,875,11,1893,358,4985,1473,3947,3131,574,264,3828,7086,28262,88,779,7060,345,21364,45972,323,4851,11,1550,9615,33505,627,2409,264,15648,779,10107,345,8100,13318,709,279,3814,345,3112,304,1077,9546,11,682,574,30467,13],"total_duration":681952125,"load_duration":18579584,"prompt_eval_count":43,"prompt_eval_duration":93000000,"eval_count":51,"eval_duration":569000000}%