Week 14: LLM & Ollama - Local Large Language Models on Edge Devices

📋 Overview

This laboratory introduces you to running Large Language Models (LLMs) locally on edge devices using Ollama. Unlike cloud-based AI services, Ollama enables you to deploy and run powerful language models directly on your Jetson Orin Nano, providing privacy, reduced latency, and independence from internet connectivity. You'll learn how to set up Ollama in a Docker environment, interact with various LLMs, implement conversational memory, and build a web-based chat interface.

This hands-on experience bridges the gap between cloud-based AI and edge computing, demonstrating how modern LLMs can be deployed in resource-constrained environments for real-world applications such as robotics, autonomous systems, and privacy-sensitive AI applications.

What You'll Learn

Local LLM Deployment: Set up and configure Ollama to run language models on edge devices
Docker Integration: Manage LLM services in containerized environments for reliable deployment
Python API Integration: Use the Ollama Python library to programmatically interact with language models
Conversational AI: Implement chat systems with context memory for continuous conversations
Web Interface Development: Build interactive web-based chat interfaces for LLM interaction
Edge AI Applications: Understand the benefits and challenges of running LLMs on resource-constrained hardware

💡 Why This Matters

Running Large Language Models locally on edge devices represents a paradigm shift in AI deployment. While cloud-based models like ChatGPT and Claude offer powerful capabilities, edge deployment provides critical advantages: complete data privacy (sensitive information never leaves your device), zero latency from network delays, operational independence from internet connectivity, and cost savings by eliminating per-query API fees. This technology enables AI-powered robotics, autonomous vehicles, medical devices, and industrial applications where privacy, reliability, and real-time response are essential.

Lab Structure

This laboratory consists of two main parts:

Part 1: Ollama Fundamentals - Learn to set up Ollama, interact with LLMs using Python, implement conversational memory, and explore different model capabilities
Part 2: Web Chat Interface - Build a complete web-based chat application with HTML/CSS/JavaScript front-end and Flask backend for real-time LLM interaction

🎯

Learning Objectives

↑ Go Up

Configure and deploy Ollama in Docker containers on Jetson Orin Nano for running local LLMs in isolated, reproducible environments.
Interact programmatically with LLMs using the Ollama Python API to send prompts, receive responses, and manage model parameters.
Implement conversational memory systems to maintain context across multiple exchanges, enabling natural dialogue with language models.
Build web-based AI interfaces by creating Flask applications with HTML/CSS/JavaScript front-ends for user-friendly LLM interaction.
Understand edge AI deployment challenges including resource constraints, model selection, performance optimization, and trade-offs between capability and efficiency.
Explore practical applications of local LLMs in privacy-sensitive scenarios, offline environments, and real-time edge computing use cases.

📚

Background

↑ Go Up

Introduction to Large Language Models (LLMs)

Large Language Models are artificial intelligence systems trained on vast amounts of text data to understand and generate human-like language. These models, based on transformer architectures, have revolutionized natural language processing by demonstrating remarkable capabilities in tasks such as text generation, question answering, summarization, translation, and even code generation. Popular examples include GPT-4, Claude, Llama, and Mistral.

LLMs work by predicting the most likely next word or token given a sequence of previous words, using patterns learned from billions of training examples. Through this simple mechanism, combined with massive scale (models can have tens or hundreds of billions of parameters), they develop sophisticated understanding of grammar, facts, reasoning patterns, and contextual relationships. However, this power comes with significant computational requirements, traditionally necessitating cloud-based deployment with expensive GPU infrastructure.

What is Ollama?

Ollama is an open-source platform that simplifies the process of running Large Language Models on local machines, including edge devices like the Jetson Orin Nano. It provides a user-friendly interface for downloading, managing, and interacting with various open-source LLMs without requiring deep knowledge of model architecture or complex setup procedures. Think of Ollama as "Docker for LLMs"—it packages models with their runtime dependencies for easy deployment.

The key innovation of Ollama is its optimization for local execution: it automatically handles model quantization (reducing precision to save memory), manages system resources efficiently, provides a consistent API across different models, and supports hardware acceleration through CUDA on NVIDIA devices. This makes sophisticated AI capabilities accessible on edge devices that would otherwise lack the resources to run full-scale LLMs. In this lab, Ollama runs inside a Docker container on your Jetson, isolating the LLM environment and making it easy to manage.

Docker and Containerization

Docker is a containerization platform that packages applications and their dependencies into isolated, portable units called containers. Unlike virtual machines, containers share the host system's kernel while maintaining isolated user spaces, making them lightweight and efficient. For AI applications, Docker provides crucial benefits: consistent environments across different machines, easy version control of complex software stacks, resource isolation preventing conflicts between applications, and simple deployment procedures.

In this laboratory, the Ollama service runs inside a Docker container that has been pre-configured with all necessary dependencies, GPU drivers, and optimizations for the Jetson platform. You'll start this container using a provided script (./ai_lab_ollama.sh), which handles the complex setup automatically. This containerized approach ensures that every student works with an identical, properly configured environment, eliminating the "it works on my machine" problem common in AI development.

Edge Computing and On-Device AI

Edge computing refers to processing data near its source rather than in centralized cloud data centers. For AI applications, this paradigm shift offers transformative advantages: immediate response times without network latency, operation in disconnected or intermittent connectivity scenarios, reduced bandwidth costs by avoiding constant cloud communication, enhanced privacy by keeping sensitive data on-device, and improved reliability without dependency on external services.

However, edge deployment also presents challenges: limited computational resources compared to cloud servers, constrained memory requiring careful model selection, power consumption considerations for battery-operated devices, and the need for efficient model architectures. The NVIDIA Jetson Orin Nano addresses these challenges with its powerful GPU (capable of 40 TOPS AI performance), generous memory (8GB), and efficient ARM-based architecture. Running LLMs locally on such devices enables applications like autonomous robots that need real-time language understanding, industrial equipment with AI-powered diagnostics in remote locations, medical devices requiring complete data privacy, and smart home systems that function during internet outages.

Conversational Memory and Context Management

Unlike simple question-answering systems, conversational AI requires maintaining context across multiple exchanges to enable natural dialogue. This involves storing the history of user messages and model responses, then including relevant portions of this history when generating new responses. The challenge lies in balancing context window limitations (LLMs can only process a finite amount of text at once) with the need to remember important information from earlier in the conversation.

Modern approaches to conversational memory include simple full history (storing all previous messages until the context window fills), sliding windows (keeping only the most recent N exchanges), summarization (condensing old conversations into brief summaries), and semantic retrieval (storing embeddings and retrieving relevant past exchanges). In this lab, you'll implement basic conversational memory using Python lists to store message history, giving the model access to previous exchanges and enabling it to provide contextually appropriate responses. This same principle scales to more sophisticated applications like chatbots, virtual assistants, and interactive AI tutors.

Web-Based AI Interfaces

While command-line interfaces are powerful for developers, practical AI applications require user-friendly graphical interfaces accessible through web browsers. Web-based interfaces combine several technologies: HTML for structure and content, CSS for styling and visual design, JavaScript for client-side interactivity and real-time updates, and backend frameworks (like Flask in Python) for server logic and AI model integration. The client-server architecture separates concerns: the frontend handles user interaction and display, while the backend manages the computationally intensive AI inference.

Modern web interfaces for AI typically use asynchronous JavaScript (AJAX or Fetch API) to send user queries to the backend without page reloads, creating a smooth chat-like experience. The backend receives these requests, processes them through the LLM, and returns responses that the frontend displays dynamically. This architecture enables responsive, real-time AI applications accessible from any device with a web browser. In Part 2 of this lab, you'll build exactly such a system: a Flask backend that interfaces with Ollama and an HTML/JavaScript frontend that provides a polished chat interface, demonstrating the full stack of web-based AI application development.

Key Concepts Summary

LLMs: Neural networks trained to understand and generate human language by predicting text sequences
Ollama: Tool for running open-source LLMs locally with optimizations for edge devices
Docker: Containerization platform ensuring consistent, isolated execution environments
Edge AI: Processing AI workloads on local devices rather than cloud servers for privacy and latency benefits
Conversational Memory: Maintaining dialogue context through message history management
Web Interface: HTML/CSS/JavaScript frontend with Python Flask backend for user-friendly AI interaction

🎬

Pre-lab Preparation

↑ Go Up

Before starting the laboratory exercises, review the following concepts and ensure you understand the fundamental principles of Large Language Models, Docker containerization, and web application development. This preparation will help you complete the lab exercises more effectively.

⚠️ Note About Video Tutorials:

There are no video tutorials for this week's lab. Instead, focus on the multiple-choice questions below to assess your understanding of the key concepts. These questions cover the essential knowledge needed to successfully complete the laboratory exercises.

📝 Pre-lab Quiz (10 Questions)

Test your understanding of LLMs, Ollama, and edge AI deployment by answering these questions. Click on your answer to see if it's correct. Discuss any unclear concepts with your lab instructor before starting the exercises.

Question 1: What is the primary advantage of running LLMs locally on edge devices like Jetson Orin Nano?

A) It provides access to larger and more powerful models
B) It ensures data privacy and reduces dependency on internet connectivity
C) It eliminates the need for GPU hardware
D) It makes models train faster

Question 2: What is Ollama?

A) A cloud-based API service for accessing GPT-4
B) An open-source platform for running LLMs locally on your machine
C) A programming language for AI development
D) A dataset for training language models

Question 3: What is the purpose of Docker in this laboratory?

A) To increase the processing speed of the Jetson
B) To connect to cloud servers for additional computing power
C) To provide an isolated, consistent environment for running Ollama with all dependencies
D) To compress the LLM models for faster loading

Question 4: How do Large Language Models generate text?

A) By searching the internet for relevant information
B) By following pre-programmed rules and templates
C) By predicting the most likely next word based on patterns learned from training data
D) By copying text directly from their training datasets

Question 5: What is conversational memory in the context of LLMs?

A) The permanent storage of all conversations in a database
B) Maintaining context by storing message history to enable coherent multi-turn dialogues
C) The model's ability to remember facts from its training data
D) The RAM allocated to run the model

Question 6: What is the main benefit of edge computing for AI applications?

A) It always provides better accuracy than cloud-based models
B) It requires less powerful hardware
C) It provides real-time responses and works offline without internet connectivity
D) It makes models easier to train from scratch

Question 7: What command is used to start the Ollama Docker container in this lab?

A) docker run ollama
B) ./ai_lab_ollama.sh
C) python start_ollama.py
D) ollama start

Question 8: Which Python library is used to interact with Ollama in this lab?

A) ollama
B) openai
C) transformers
D) langchain

Question 9: What is Flask in the context of this lab?

A) A type of LLM model
B) A Python web framework used to create the backend API for the chat interface
C) A Docker container management tool
D) A JavaScript library for frontend development

Question 10: Why might you choose to run a smaller LLM model on edge devices?

A) Smaller models are always more accurate
B) To balance performance with memory and computational constraints of edge hardware
C) Smaller models don't require GPU acceleration
D) To avoid Docker containerization

Note: Discuss your answers with your lab instructor before beginning the practical exercises. Understanding these concepts is crucial for successfully completing the lab.

⚙️

Lab Procedure

↑ Go Up

This laboratory is divided into two parts, each focusing on different aspects of deploying and interacting with Large Language Models on edge devices. Complete each part sequentially, as Part 2 builds upon the concepts learned in Part 1. Work through the exercises systematically, and document your results for the lab report.

⚠️ Important Setup Instructions:

Before starting any exercises, you must initialize the Ollama Docker container:

1. Open a terminal on your Jetson Orin Nano
2. Navigate to the lab directory containing the setup script
3. Run the command: ./ai_lab_ollama.sh
4. Wait for the container to start - this may take 1-2 minutes on first run
5. Verify the container is running before proceeding with exercises

The lab technician has pre-configured this script to handle all Docker setup automatically. Keep the terminal with the running container open throughout the lab session.

⚠️ Important Working Instructions:

1. Attempt exercises independently first - Try to solve each problem before viewing solutions
2. Solutions are password-protected - Your instructor will provide passwords when appropriate
3. Execute code on Jetson Orin Nano - All exercises are designed for your edge device
4. Document your work thoroughly - Take screenshots and notes for your lab report
5. Monitor resource usage - Observe GPU memory and system resources during LLM inference

Part 1: Ollama Fundamentals - Local LLM Interaction

This part introduces you to the fundamentals of running Large Language Models locally using Ollama. You'll learn how to import and use the Ollama Python library, send prompts to language models, receive and process responses, implement basic conversational memory to maintain context across multiple exchanges, and explore different approaches to managing dialogue history. This hands-on experience forms the foundation for building more complex AI applications on edge devices.

Key Learning Points:

Import and configure the Ollama Python library for programmatic access
Send text prompts to LLMs and receive generated responses
Parse and display model outputs in readable formats
Implement conversational memory using Python lists to store message history
Understand the difference between stateless queries and continuous conversations
Explore how context affects model responses in multi-turn dialogues
Practice resource management and monitoring during LLM inference

📝 Open Part 1 Exercises

Part 2: Web Chat Interface - Building Interactive AI Applications

In this part, you'll build a complete web-based chat application for interacting with your local LLM. You'll create a Flask backend server that manages communication between the web interface and Ollama, develop an HTML/CSS/JavaScript frontend that provides a user-friendly chat experience, implement real-time message exchange using asynchronous JavaScript, and deploy the complete application on your Jetson device. This project demonstrates the full stack of skills needed to create production-ready AI applications accessible through web browsers.

Key Learning Points:

Build Flask applications with route handlers for API endpoints
Create HTML structure and CSS styling for modern chat interfaces
Implement JavaScript functions for sending messages and displaying responses
Use Fetch API for asynchronous communication with backend servers
Manage conversational state between frontend and backend
Handle errors and edge cases in web-based AI applications
Deploy and test complete AI applications on edge devices

📝 Open Part 2 Exercises

💡 Tips for Success

Start with simple prompts to test your Ollama setup before moving to complex interactions
Monitor the terminal window running the Docker container for error messages
Experiment with different prompt phrasings to understand how LLMs interpret instructions
Pay attention to response times—larger models may be slower on edge devices
When building the web interface, test backend and frontend components separately
Use browser developer tools (F12) to debug JavaScript and API communication
Take screenshots throughout your work for the lab report documentation

🔧

Lab Materials

↑ Go Up

Required Hardware

NVIDIA Jetson Orin Nano Developer Kit (assembled in Week 1)
- 8GB RAM for running LLM models
- NVIDIA Ampere GPU with 40 TOPS AI performance
- Ubuntu 20.04 or 22.04 operating system
Keyboard and Mouse for interacting with Jetson
Monitor connected via DisplayPort or HDMI
Power Supply (provided with Jetson kit)
Internet Connection (for Docker setup and model downloads)

Pre-installed Software

The lab technician has pre-configured the following software on your Jetson Orin Nano:

Docker and Docker Compose - Container platform for running Ollama
Ollama Docker Image - Pre-built container with LLM runtime
Python 3.8+ with pip package manager
Ollama Python Library - API for programmatic LLM interaction
Flask Web Framework - For building the chat interface backend
JupyterLab - Interactive development environment
LLM Models - Pre-downloaded language models for offline use
Setup Scripts - Automated scripts for starting services (ai_lab_ollama.sh)

⚠️ No Installation Required:

All software has been pre-installed and configured by the lab technician. You do not need to install Docker, Ollama, Python libraries, or download models. Simply run the provided setup script (./ai_lab_ollama.sh) to start the Docker container, and you're ready to begin the exercises. If you encounter any issues with pre-configured software, consult with your lab instructor.

Lab Files and Resources

Jupyter Notebooks: Interactive exercises for both parts of the lab
Docker Scripts: ai_lab_ollama.sh for container management
Sample Code: Python examples for Ollama interaction
HTML/CSS/JS Templates: Starting point for web interface development
Reference Documentation: Ollama API documentation and examples

Recommended Models for Jetson Orin Nano

The following LLM models are optimized for the 8GB memory of the Jetson Orin Nano:

Llama 2 7B: Balanced performance and quality for general conversations
Mistral 7B: Efficient model with strong reasoning capabilities
Phi-2: Smaller 2.7B parameter model for faster responses
TinyLlama: Lightweight 1.1B parameter model for testing and demos

Note: The lab technician has pre-downloaded at least one of these models. You can experiment with different models during the exercises.

📖

References

↑ Go Up

Official Documentation

Ollama Official Website - Main project page with downloads and documentation
Ollama GitHub Repository - Source code, issues, and community discussions
Ollama Python Library - Python API documentation and examples
Docker Documentation - Official Docker guides and references
Flask Documentation - Python web framework for building the interface

NVIDIA Jetson Resources

Jetson Orin Developer Page - Hardware specifications and software guides
Jetson Linux Developer Guide - System configuration and optimization
NVIDIA NGC Catalog - Pre-built containers for AI applications
Jetson Containers Repository - Community Docker images for Jetson

Large Language Model Resources

Hugging Face Model Hub - Repository of open-source language models
Meta Llama - Information about the Llama model family
Mistral AI - Documentation for Mistral models
Llama 2 Paper - Research paper on Llama 2 architecture and training

Web Development Resources

Fetch API Documentation - Using Fetch for API requests in JavaScript
Asynchronous JavaScript - Understanding async operations in web apps
Bootstrap CSS Framework - Styling framework for modern web interfaces
W3Schools HTML/CSS - Tutorials for web development fundamentals

Additional Learning Materials

Ollama Introduction Video - Quick overview of Ollama setup and usage
Running LLMs Locally Guide - Comprehensive tutorial on local LLM deployment
Understanding LLMs - Blog series on how language models work
DeepLearning.AI Courses - Free courses on LLMs and generative AI

📚 For Your Lab Report:

When citing resources in your lab report, use proper academic citation format. Include the author, title, publication date, and URL for web resources. These references provide additional context beyond what's covered in the laboratory exercises and can help you understand edge AI deployment more deeply.

📝

Lab Report

↑ Go Up

Submit a comprehensive lab report documenting your work with Ollama and local LLM deployment. Your report should demonstrate understanding of edge AI deployment, practical implementation skills, and critical analysis of running language models on resource-constrained devices. Follow the structure below and ensure all required components are included.

⚠️ Submission Requirements:

• Format: PDF document only
• File Naming: Week14_YourName_StudentID.pdf
• Due Date: One week from lab completion
• Submission: Upload to Blackboard under Week 14 Assignment
• Page Limit: 8-12 pages (excluding code appendices)
• Font: 11-12pt, single-spaced, standard margins

Lab Report Structure & Grading Rubric (100 Points Total)

1. Cover Page & Abstract (5 points)

Title: "Week 14: LLM & Ollama - Local Large Language Models on Edge Devices"
Your name, student ID, course code, date, and lab section
Abstract (150-200 words): Summarize the lab's purpose, methods, key findings, and conclusions

2. Objectives (10 points)

List all six learning objectives from this lab
For each objective, explain why it's important for edge AI development (1-2 sentences)
Briefly describe how this lab connects to previous weeks' content

3. Procedure & Results (60 points)

Document your complete implementation and testing:

Part 1: Ollama Fundamentals (30 points):
- Screenshot of Docker container startup (show successful initialization) (5 points)
- Example prompts and LLM responses demonstrating basic interaction (5 points)
- Implementation of conversational memory with code snippets and example dialogue (10 points)
- Analysis of response quality and inference times for different prompts (5 points)
- Comparison of responses with and without conversational context (5 points)
Part 2: Web Chat Interface (30 points):
- Screenshots of complete web interface showing chat functionality (10 points)
- Code snippets: Flask backend route handlers and key functions (7 points)
- Code snippets: JavaScript frontend for sending/receiving messages (7 points)
- Demonstration of multi-turn conversation through the web interface (6 points)

4. Discussion & Analysis (15 points)

Compare local LLM deployment to cloud-based APIs: What are the trade-offs in terms of privacy, latency, cost, and model capability? (5 points)
Analyze resource usage: How did GPU memory and inference time vary with different prompt lengths or conversation histories? (4 points)
Discuss edge AI challenges: What limitations did you encounter when running LLMs on Jetson? How might you optimize for resource-constrained deployment? (6 points)

5. Conclusion (5 points)

Summarize what you learned about running LLMs locally on edge devices
Reflect on practical applications: Where would edge LLM deployment be most valuable?
Identify potential improvements to your implementation or areas for further exploration

6. References (5 points)

Cite all resources used (minimum 3 references)
Include Ollama documentation, LLM model papers, and any tutorial resources
Use consistent citation format (IEEE, APA, or MLA)

Section	Points	Key Requirements
Cover Page & Abstract	5	Complete information, concise abstract
Objectives	10	All objectives listed with importance explained
Procedure & Results	60	Complete documentation with screenshots, code, and analysis
Discussion & Analysis	15	Critical thinking about trade-offs and challenges
Conclusion	5	Clear summary and reflection
References	5	Proper citations, minimum 3 references
TOTAL	100

Grading Notes:

All code must execute without errors for full credit
Screenshots must be clear, properly labeled, and show complete functionality
Discussion questions require critical analysis with supporting evidence
Code appendices (if included) do not count toward page limit
Late penalty: 10% per day (up to 3 days, after which reports are not accepted)
Plagiarism will result in zero credit for the entire assignment