Local Large Language Models on Edge Devices
← Back to Course HomeThis laboratory introduces you to running Large Language Models (LLMs) locally on edge devices using Ollama. Unlike cloud-based AI services, Ollama enables you to deploy and run powerful language models directly on your Jetson Orin Nano, providing privacy, reduced latency, and independence from internet connectivity. You'll learn how to set up Ollama in a Docker environment, interact with various LLMs, implement conversational memory, and build a web-based chat interface.
This hands-on experience bridges the gap between cloud-based AI and edge computing, demonstrating how modern LLMs can be deployed in resource-constrained environments for real-world applications such as robotics, autonomous systems, and privacy-sensitive AI applications.
Running Large Language Models locally on edge devices represents a paradigm shift in AI deployment. While cloud-based models like ChatGPT and Claude offer powerful capabilities, edge deployment provides critical advantages: complete data privacy (sensitive information never leaves your device), zero latency from network delays, operational independence from internet connectivity, and cost savings by eliminating per-query API fees. This technology enables AI-powered robotics, autonomous vehicles, medical devices, and industrial applications where privacy, reliability, and real-time response are essential.
This laboratory consists of two main parts:
Large Language Models are artificial intelligence systems trained on vast amounts of text data to understand and generate human-like language. These models, based on transformer architectures, have revolutionized natural language processing by demonstrating remarkable capabilities in tasks such as text generation, question answering, summarization, translation, and even code generation. Popular examples include GPT-4, Claude, Llama, and Mistral.
LLMs work by predicting the most likely next word or token given a sequence of previous words, using patterns learned from billions of training examples. Through this simple mechanism, combined with massive scale (models can have tens or hundreds of billions of parameters), they develop sophisticated understanding of grammar, facts, reasoning patterns, and contextual relationships. However, this power comes with significant computational requirements, traditionally necessitating cloud-based deployment with expensive GPU infrastructure.
Ollama is an open-source platform that simplifies the process of running Large Language Models on local machines, including edge devices like the Jetson Orin Nano. It provides a user-friendly interface for downloading, managing, and interacting with various open-source LLMs without requiring deep knowledge of model architecture or complex setup procedures. Think of Ollama as "Docker for LLMs"—it packages models with their runtime dependencies for easy deployment.
The key innovation of Ollama is its optimization for local execution: it automatically handles model quantization (reducing precision to save memory), manages system resources efficiently, provides a consistent API across different models, and supports hardware acceleration through CUDA on NVIDIA devices. This makes sophisticated AI capabilities accessible on edge devices that would otherwise lack the resources to run full-scale LLMs. In this lab, Ollama runs inside a Docker container on your Jetson, isolating the LLM environment and making it easy to manage.
Docker is a containerization platform that packages applications and their dependencies into isolated, portable units called containers. Unlike virtual machines, containers share the host system's kernel while maintaining isolated user spaces, making them lightweight and efficient. For AI applications, Docker provides crucial benefits: consistent environments across different machines, easy version control of complex software stacks, resource isolation preventing conflicts between applications, and simple deployment procedures.
In this laboratory, the Ollama service runs inside a Docker container that has been pre-configured with all necessary dependencies, GPU drivers, and optimizations for the Jetson platform. You'll start this container using a provided script (./ai_lab_ollama.sh), which handles the complex setup automatically. This containerized approach ensures that every student works with an identical, properly configured environment, eliminating the "it works on my machine" problem common in AI development.
Edge computing refers to processing data near its source rather than in centralized cloud data centers. For AI applications, this paradigm shift offers transformative advantages: immediate response times without network latency, operation in disconnected or intermittent connectivity scenarios, reduced bandwidth costs by avoiding constant cloud communication, enhanced privacy by keeping sensitive data on-device, and improved reliability without dependency on external services.
However, edge deployment also presents challenges: limited computational resources compared to cloud servers, constrained memory requiring careful model selection, power consumption considerations for battery-operated devices, and the need for efficient model architectures. The NVIDIA Jetson Orin Nano addresses these challenges with its powerful GPU (capable of 40 TOPS AI performance), generous memory (8GB), and efficient ARM-based architecture. Running LLMs locally on such devices enables applications like autonomous robots that need real-time language understanding, industrial equipment with AI-powered diagnostics in remote locations, medical devices requiring complete data privacy, and smart home systems that function during internet outages.
Unlike simple question-answering systems, conversational AI requires maintaining context across multiple exchanges to enable natural dialogue. This involves storing the history of user messages and model responses, then including relevant portions of this history when generating new responses. The challenge lies in balancing context window limitations (LLMs can only process a finite amount of text at once) with the need to remember important information from earlier in the conversation.
Modern approaches to conversational memory include simple full history (storing all previous messages until the context window fills), sliding windows (keeping only the most recent N exchanges), summarization (condensing old conversations into brief summaries), and semantic retrieval (storing embeddings and retrieving relevant past exchanges). In this lab, you'll implement basic conversational memory using Python lists to store message history, giving the model access to previous exchanges and enabling it to provide contextually appropriate responses. This same principle scales to more sophisticated applications like chatbots, virtual assistants, and interactive AI tutors.
While command-line interfaces are powerful for developers, practical AI applications require user-friendly graphical interfaces accessible through web browsers. Web-based interfaces combine several technologies: HTML for structure and content, CSS for styling and visual design, JavaScript for client-side interactivity and real-time updates, and backend frameworks (like Flask in Python) for server logic and AI model integration. The client-server architecture separates concerns: the frontend handles user interaction and display, while the backend manages the computationally intensive AI inference.
Modern web interfaces for AI typically use asynchronous JavaScript (AJAX or Fetch API) to send user queries to the backend without page reloads, creating a smooth chat-like experience. The backend receives these requests, processes them through the LLM, and returns responses that the frontend displays dynamically. This architecture enables responsive, real-time AI applications accessible from any device with a web browser. In Part 2 of this lab, you'll build exactly such a system: a Flask backend that interfaces with Ollama and an HTML/JavaScript frontend that provides a polished chat interface, demonstrating the full stack of web-based AI application development.
Before starting the laboratory exercises, review the following concepts and ensure you understand the fundamental principles of Large Language Models, Docker containerization, and web application development. This preparation will help you complete the lab exercises more effectively.
There are no video tutorials for this week's lab. Instead, focus on the multiple-choice questions below to assess your understanding of the key concepts. These questions cover the essential knowledge needed to successfully complete the laboratory exercises.
Test your understanding of LLMs, Ollama, and edge AI deployment by answering these questions. Click on your answer to see if it's correct. Discuss any unclear concepts with your lab instructor before starting the exercises.
Note: Discuss your answers with your lab instructor before beginning the practical exercises. Understanding these concepts is crucial for successfully completing the lab.
This laboratory is divided into two parts, each focusing on different aspects of deploying and interacting with Large Language Models on edge devices. Complete each part sequentially, as Part 2 builds upon the concepts learned in Part 1. Work through the exercises systematically, and document your results for the lab report.
Before starting any exercises, you must initialize the Ollama Docker container:
1. Open a terminal on your Jetson Orin Nano
2. Navigate to the lab directory containing the setup script
3. Run the command: ./ai_lab_ollama.sh
4. Wait for the container to start - this may take 1-2 minutes on first run
5. Verify the container is running before proceeding with exercises
The lab technician has pre-configured this script to handle all Docker setup automatically. Keep the terminal with the running container open throughout the lab session.
1. Attempt exercises independently first - Try to solve each problem before viewing solutions
2. Solutions are password-protected - Your instructor will provide passwords when appropriate
3. Execute code on Jetson Orin Nano - All exercises are designed for your edge device
4. Document your work thoroughly - Take screenshots and notes for your lab report
5. Monitor resource usage - Observe GPU memory and system resources during LLM inference
This part introduces you to the fundamentals of running Large Language Models locally using Ollama. You'll learn how to import and use the Ollama Python library, send prompts to language models, receive and process responses, implement basic conversational memory to maintain context across multiple exchanges, and explore different approaches to managing dialogue history. This hands-on experience forms the foundation for building more complex AI applications on edge devices.
In this part, you'll build a complete web-based chat application for interacting with your local LLM. You'll create a Flask backend server that manages communication between the web interface and Ollama, develop an HTML/CSS/JavaScript frontend that provides a user-friendly chat experience, implement real-time message exchange using asynchronous JavaScript, and deploy the complete application on your Jetson device. This project demonstrates the full stack of skills needed to create production-ready AI applications accessible through web browsers.
The lab technician has pre-configured the following software on your Jetson Orin Nano:
All software has been pre-installed and configured by the lab technician. You do not need to install Docker, Ollama, Python libraries, or download models. Simply run the provided setup script (./ai_lab_ollama.sh) to start the Docker container, and you're ready to begin the exercises. If you encounter any issues with pre-configured software, consult with your lab instructor.
The following LLM models are optimized for the 8GB memory of the Jetson Orin Nano:
Note: The lab technician has pre-downloaded at least one of these models. You can experiment with different models during the exercises.
When citing resources in your lab report, use proper academic citation format. Include the author, title, publication date, and URL for web resources. These references provide additional context beyond what's covered in the laboratory exercises and can help you understand edge AI deployment more deeply.
Submit a comprehensive lab report documenting your work with Ollama and local LLM deployment. Your report should demonstrate understanding of edge AI deployment, practical implementation skills, and critical analysis of running language models on resource-constrained devices. Follow the structure below and ensure all required components are included.
• Format: PDF document only
• File Naming: Week14_YourName_StudentID.pdf
• Due Date: One week from lab completion
• Submission: Upload to Blackboard under Week 14 Assignment
• Page Limit: 8-12 pages (excluding code appendices)
• Font: 11-12pt, single-spaced, standard margins
Document your complete implementation and testing:
| Section | Points | Key Requirements |
|---|---|---|
| Cover Page & Abstract | 5 | Complete information, concise abstract |
| Objectives | 10 | All objectives listed with importance explained |
| Procedure & Results | 60 | Complete documentation with screenshots, code, and analysis |
| Discussion & Analysis | 15 | Critical thinking about trade-offs and challenges |
| Conclusion | 5 | Clear summary and reflection |
| References | 5 | Proper citations, minimum 3 references |
| TOTAL | 100 |