Week 2: Fundamentals of ML and AI

📋

Laboratory Overview

↑ Go Up

This laboratory introduces students to gradient descent—the fundamental optimization algorithm that powers all modern machine learning and deep learning systems. Through hands-on implementation from scratch, students will develop a deep understanding of how neural networks learn by iteratively adjusting their parameters to minimize error. This foundational knowledge is essential for understanding training processes, debugging models, and building effective AI systems.

What You'll Learn

Gradient Descent Algorithm: Master the core optimization technique used in all neural network training
Cost Function Minimization: Understand how to measure and reduce prediction errors
Learning Rate Effects: Discover how learning rate impacts convergence speed and stability
Decision Boundary Visualization: See how optimization shapes the decision boundary over time
From-Scratch Implementation: Build gradient descent without high-level libraries to understand the mechanics
Binary Classification: Apply gradient descent to solve real classification problems

💡 Why This Matters

Gradient descent is the engine behind every neural network training process. Whether you're training a simple logistic regression model or a massive transformer with billions of parameters, gradient descent is at work, finding the optimal weights that minimize prediction error. Understanding gradient descent from first principles will help you debug training issues, select appropriate hyperparameters, design better architectures, and grasp advanced optimization techniques like Adam, RMSprop, and momentum-based methods. This is your foundation for all future machine learning work.

Lab Structure

This laboratory consists of one comprehensive part that builds your understanding step-by-step:

Part 1: Gradient Descent Implementation
- Implement the sigmoid activation function
- Calculate model outputs (forward pass)
- Compute the cross-entropy error function
- Update weights using gradient descent (backward pass)
- Train a model and visualize decision boundaries
- Experiment with different learning rates and epochs

🎯

Learning Objectives

↑ Go Up

By the end of this laboratory session, you will be able to:

Understand gradient descent optimization fundamentals including how gradients indicate the direction of steepest descent and how iterative updates minimize cost functions.
Implement gradient descent algorithm from scratch including forward propagation, error calculation, gradient computation, and weight updates without using high-level machine learning libraries.
Visualize the learning process and decision boundaries to understand how the model evolves during training and how the decision boundary adapts to separate classes.
Understand cost/loss functions and their role in measuring prediction accuracy, particularly the cross-entropy loss for classification problems.
Apply gradient descent to classification problems by training a logistic regression model on real datasets and achieving accurate predictions.
Analyze the effect of learning rates on convergence by experimenting with different learning rate values and observing their impact on training speed and stability.
Debug and optimize gradient descent implementations by identifying common issues such as divergence, slow convergence, and numerical instability, and applying appropriate solutions.

📚

Background Theory

↑ Go Up

Introduction to Gradient Descent

Gradient descent is an iterative optimization algorithm used to minimize a cost function by adjusting model parameters in the direction of steepest descent. In machine learning, we use gradient descent to find the optimal weights that minimize the difference between predicted and actual outputs. The algorithm follows a simple principle: calculate the gradient (slope) of the cost function with respect to each parameter, then adjust parameters in the opposite direction of the gradient.

The Gradient Descent Algorithm

The general gradient descent update rule is:

w_new = w_old - α × ∇J(w)

Where:

w: Model parameters (weights and biases)
α (alpha): Learning rate - controls the step size
∇J(w): Gradient of the cost function with respect to weights
J(w): Cost function that measures prediction error

Key Components

1. The Sigmoid Activation Function

For binary classification, we use the sigmoid function to convert linear combinations into probabilities between 0 and 1:

σ(z) = 1 / (1 + e^-z)

The sigmoid function has several desirable properties: it's differentiable everywhere, outputs values in (0,1) range which can be interpreted as probabilities, and has a simple derivative: σ'(z) = σ(z) × (1 - σ(z)).

2. Forward Propagation

In forward propagation, we compute the model's prediction by passing input through the network:

z = w₁x₁ + w₂x₂ + ... + b
ŷ = σ(z)

Where x represents input features, w represents weights, b is the bias term, and ŷ is the predicted output.

3. Cross-Entropy Loss Function

For binary classification, we use cross-entropy loss to measure prediction error:

J = -1/m × Σ [y × log(ŷ) + (1-y) × log(1-ŷ)]

This loss function penalizes confident wrong predictions heavily and approaches zero when predictions are correct. It's convex for logistic regression, guaranteeing that gradient descent will find the global minimum.

4. Gradient Computation (Backward Pass)

The gradient tells us how to adjust each weight to reduce error. For logistic regression, the gradient of the loss with respect to weights is:

∂J/∂w_i = 1/m × Σ (ŷ - y) × x_i

This gradient is computed for each weight and bias, then used to update parameters in the direction that reduces loss.

5. Learning Rate Selection

The learning rate α controls how large each update step is. Too large, and the algorithm may overshoot the minimum and diverge. Too small, and training will be extremely slow. Typical values range from 0.001 to 0.1, but the optimal value depends on the specific problem and dataset scale. In this lab, you'll experiment with different learning rates to understand their impact.

🎓 Understanding Through Analogy

Imagine you're hiking down a mountain in thick fog (you can't see the bottom). Gradient descent is like feeling the slope beneath your feet and taking steps in the steepest downward direction. The learning rate determines your step size—small steps are safe but slow, while large steps are faster but risk overshooting valleys. The gradient tells you which direction is steepest, and you keep walking until you reach a valley (local minimum). In machine learning, the mountain is the cost function landscape, and the valley represents optimal weights.

Common Challenges in Gradient Descent

Local Minima: The algorithm may converge to suboptimal solutions (though convex problems like logistic regression don't have this issue)
Learning Rate Too High: Updates overshoot the minimum, causing oscillations or divergence
Learning Rate Too Low: Training becomes extremely slow, requiring many epochs to converge
Feature Scaling: Features with different scales can slow convergence; normalization helps
Vanishing/Exploding Gradients: In deep networks, gradients can become too small or too large

📹

Pre-lab Preparation

↑ Go Up

Before beginning the hands-on lab exercises, please watch the following video lectures from Udacity's "Introduction to Deep Learning with PyTorch" course. These videos provide essential background on neural network fundamentals, optimization, and gradient descent. The concepts covered will directly support your understanding of the lab implementation.

⚠️ Important: These pre-lab videos are required viewing. They cover fundamental concepts that are essential for successfully completing the lab exercises. Plan to spend approximately 2-3 hours watching these materials before your lab session.

📺 Video Lectures

Chapter 1: Introduction to Neural Networks

Chapter 2: Error Functions and Activation Functions

Chapter 3: Logistic Regression

▸ Logistic Regression 1
▸ Logistic Regression 2

⭐ Chapter 4: Gradient Descent (CORE TOPIC)

Chapter 5: Neural Network Architecture

Chapter 6: Feedforward and Backpropagation

✅ Pre-lab Quiz (MCQs)

Instructions: Test your understanding after watching the videos. Click on an answer to see if it's correct. These questions will also be answered in detail in your lab report.

Question 1: What is the primary purpose of gradient descent in machine learning?

A) To increase the dimensionality of the input features
B) To minimize the cost function by iteratively adjusting model parameters
C) To normalize the input data before training
D) To split the dataset into training and validation sets

Question 2: The sigmoid function σ(z) = 1/(1 + e^-z) outputs values in which range?

A) [-1, 1]
B) (0, 1)
C) [0, ∞)
D) (-∞, ∞)

Question 3: What happens if the learning rate (α) in gradient descent is too large?

A) The algorithm will converge faster with guaranteed accuracy
B) Training will be very slow but always reach the global minimum
C) The algorithm may overshoot the minimum and diverge or oscillate
D) The model will automatically stop training at the optimal point

Question 4: Why is cross-entropy loss preferred over mean squared error for classification problems?

A) It's computationally faster to calculate
B) It provides better gradient signals for probability-based outputs and penalizes confident wrong predictions more heavily
C) It always produces lower loss values than MSE
D) It doesn't require the use of activation functions

Question 5: In the gradient descent update rule w_new = w_old - α × ∇J(w), what does ∇J(w) represent?

A) The current weight values
B) The learning rate multiplied by the loss
C) The gradient (slope) of the cost function with respect to the weights
D) The final prediction of the model

📝 Lab Report Requirement

In your lab report, you must provide detailed written answers to the following questions (not just multiple choice):

What is the mathematical formula for gradient descent weight updates? Explain each component (w, α, ∇J(w)).
Why do we use the sigmoid function for binary classification? What range does it output and why is this useful?
What is cross-entropy loss and why is it preferred over mean squared error for classification?
How does learning rate affect the convergence of gradient descent? What happens if it's too large or too small?
Explain the difference between forward propagation and backpropagation in neural network training.

🔬

Lab Procedure

↑ Go Up

This laboratory consists of one comprehensive hands-on exercise where you'll implement gradient descent from scratch and apply it to a binary classification problem. You'll build the core components step-by-step and visualize how the model learns.

Part 1: Gradient Descent Implementation

In this exercise, you'll implement the complete gradient descent algorithm from scratch without using high-level machine learning libraries. You'll build each component—sigmoid function, forward propagation, error calculation, and weight updates—to understand how neural networks learn through optimization.

What You'll Implement:

Sigmoid Activation Function: Convert linear outputs to probabilities
Forward Pass: Calculate predictions from inputs and weights
Cross-Entropy Loss: Measure prediction error
Gradient Computation: Calculate how to adjust weights
Weight Updates: Apply gradient descent to improve the model
Training Loop: Iterate through epochs to minimize error
Visualization: Plot decision boundary evolution and error curves

Learning Outcomes:

Understand the mathematical foundations of neural network training
Implement optimization algorithms from first principles
Visualize and interpret the learning process
Experiment with hyperparameters (learning rate, epochs)
Debug and optimize gradient descent implementations

📝 Open Exercise Notebook 💡 Exercise Code Explained

📌 Important: Start with the exercise notebook and attempt all implementations yourself before viewing the solution. Use the "Exercise Code Explained" resource if you need help understanding the starter code. The password for solutions will be provided by your instructor after you've made a genuine attempt at the exercises. Learning happens through struggle and problem-solving!

🔧

Lab Materials

↑ Go Up

Software Requirements

Python: Version 3.8 or higher
NumPy: For numerical operations and array manipulation
Matplotlib: For visualization and plotting
Jupyter Notebook: For running interactive exercises (optional)

Included Files

All necessary code and exercises are included in the lab HTML files:

GradientDescent.html - Exercise file with tasks to complete
GradientDescentSolutions.html - Solution file (password-protected)
GradientDescentExplained.html - Detailed explanations of exercise code
GradientDescentSolutionsExplained.html - Detailed explanations of solution code (password-protected)
All required helper functions included within the exercise
Inline documentation and explanations

⚠️ Setup Verification:

Before starting the lab, verify your Python environment has NumPy and Matplotlib installed. You can install them using: pip install numpy matplotlib

📄

Lab Report Requirements

↑ Go Up

Students must submit a comprehensive lab report demonstrating their understanding of gradient descent optimization and its implementation. The report should showcase practical skills acquired through the laboratory exercises and include evidence of all completed tasks.

⚠️ Submission Deadline:

Submit your completed lab report by [Insert Deadline - Typically 1 week after lab session]. Late submissions will be penalized according to course policy (10% per day, maximum 3 days).

Report Structure

Your lab report must include the following sections:

1. Title Page & Formatting (5 points)

Lab title, your name, student ID, date, course name, and instructor name
Professional formatting with clear headers and page numbers

2. Objectives (10 points)

List all learning objectives
Briefly explain why each is important (1-2 sentences each)

3. Procedure & Results (50 points)

For the gradient descent implementation:

Include code snippets with clear outputs
Provide screenshots of key results (training output, plots)
Add plots: decision boundary evolution and error curves
Explain what each component demonstrates

4. Discussion (20 points)

Analyze your experimental results
Compare different learning rates and their effects
Discuss the impact of hyperparameters (learning rate, epochs)
Support all statements with evidence from your experiments

5. Challenges & Solutions (10 points)

Describe problems you encountered
Explain your debugging process
Reflect on what you learned from solving these challenges

6. Conclusion (5 points)

Summarize key learnings
Reflect on the most challenging concepts
Discuss potential applications of this knowledge

📋 Submission Checklist

Before submitting, ensure you have:

✓ Completed gradient descent implementation with working code
✓ Included clear screenshots of all outputs, plots, and visualizations
✓ Answered all discussion questions thoroughly with supporting evidence
✓ Documented challenges and solutions in detail
✓ Checked all code for errors and verified all functions execute correctly
✓ Formatted report professionally with clear section headers and page numbers
✓ Referenced all sources and datasets used
✓ Proofread for grammar, spelling, and technical accuracy
✓ Verified all images are clear, properly labeled, and referenced in text
✓ Included your name and student ID on all pages

📤 Submission Format

File Format: Submit report as PDF document (required)
Code Files: Include Jupyter notebooks (.ipynb) in a separate ZIP file
File Naming Convention:
- Report: Week2_[YourLastName]_[StudentID].pdf
- Code: Week2_[YourLastName]_[StudentID]_Code.zip
- Example: Week2_Ahmed_202012345.pdf
Submission Method: Upload to University LMS (Blackboard/Moodle)
File Size Limit: Maximum 50MB total
- If exceeded, compress images or use PDF compression tools
- Ensure PDF is searchable text, not scanned images
Required Components:
- 1. Main PDF lab report
- 2. ZIP file containing all code files with outputs

Important:

Ensure PDF is searchable and not password-protected
All code must be properly commented and executable
Include all necessary imports and dependencies
Test that your code runs completely from top to bottom
Make sure all plots and visualizations are clearly visible

📊 Grading Rubric

Component	Points	Criteria
Title Page & Formatting	5	Complete, professional presentation
Objectives	10	Clear, comprehensive understanding demonstrated
Procedure & Results	50	Complete implementation with correct outputs
Discussion	20	Thoughtful analysis, supported by results
Challenges & Solutions	10	Detailed problem-solving process
Conclusion	5	Reflective, insightful
Total	100

Grading Notes:

All code must execute without errors for full credit
Screenshots must be clear, properly labeled, and referenced
All experiments must be completed with comparative analysis
Mathematical explanations must be accurate and well-written
Late penalty: 10% per day (up to 3 days)
Plagiarism will result in zero credit

📖

References & Additional Resources

↑ Go Up

Primary Course Material

Udacity: Introduction to Deep Learning with PyTorch - Chapter: Introduction to Neural Networks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Chapter 4: Numerical Computation (Gradient Descent)
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Chapter 5: Neural Networks
Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press.
Free online book: neuralnetworksanddeeplearning.com

Online Resources

3Blue1Brown: Neural Networks Series (YouTube)
Excellent visual explanations of gradient descent and backpropagation
Andrew Ng: Machine Learning Course (Coursera)
Weeks 1-2 cover gradient descent fundamentals
Stanford CS229: Machine Learning Course Notes
Mathematical foundations of gradient descent
Towards Data Science: Gradient Descent Optimization Algorithms
Comparison of SGD, Adam, RMSprop, and other variants

Python Libraries Documentation

NumPy: numpy.org - Array operations and mathematical functions
Matplotlib: matplotlib.org - Data visualization and plotting
Scikit-learn: scikit-learn.org - Machine learning algorithms reference

Research Papers (Optional Advanced Reading)

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

                    💡 Tips for Success
                    Start early - gradient descent can be tricky to implement correctly
Test each function independently before combining them
Use small datasets first to verify your implementation
Print intermediate values to debug issues
Visualize your results - plots reveal problems that numbers hide
Experiment beyond the required tests - try edge cases
Ask questions in office hours if you're stuck
Review the pre-lab videos if concepts are unclear

                

Getting Help

If you encounter difficulties:

Office Hours: Attend instructor office hours for one-on-one help
Discussion Forum: Post questions on the course LMS discussion board
Study Groups: Collaborate with classmates (but submit individual work)
Lab TAs: Ask teaching assistants during lab sessions
Email: Contact instructor for specific questions or clarifications

Laboratory Overview

What You'll Learn

💡 Why This Matters

Lab Structure

Learning Objectives

Background Theory

Introduction to Gradient Descent

The Gradient Descent Algorithm

Key Components

1. The Sigmoid Activation Function

2. Forward Propagation

3. Cross-Entropy Loss Function

4. Gradient Computation (Backward Pass)

5. Learning Rate Selection

🎓 Understanding Through Analogy

Common Challenges in Gradient Descent

Pre-lab Preparation

📺 Video Lectures

Chapter 1: Introduction to Neural Networks

Chapter 2: Error Functions and Activation Functions

Chapter 3: Logistic Regression

⭐ Chapter 4: Gradient Descent (CORE TOPIC)

Chapter 5: Neural Network Architecture

Chapter 6: Feedforward and Backpropagation

✅ Pre-lab Quiz (MCQs)

Question 1: What is the primary purpose of gradient descent in machine learning?

Question 2: The sigmoid function σ(z) = 1/(1 + e-z) outputs values in which range?

Question 3: What happens if the learning rate (α) in gradient descent is too large?

Question 4: Why is cross-entropy loss preferred over mean squared error for classification problems?

Question 5: In the gradient descent update rule wnew = wold - α × ∇J(w), what does ∇J(w) represent?

📝 Lab Report Requirement

Lab Procedure

Part 1: Gradient Descent Implementation

What You'll Implement:

Learning Outcomes:

Lab Materials

Software Requirements

Included Files

Lab Report Requirements

Report Structure

1. Title Page & Formatting (5 points)

2. Objectives (10 points)

3. Procedure & Results (50 points)

4. Discussion (20 points)

5. Challenges & Solutions (10 points)

6. Conclusion (5 points)

📋 Submission Checklist

📤 Submission Format

📊 Grading Rubric

Grading Notes:

References & Additional Resources

Primary Course Material

Recommended Textbooks

Online Resources

Python Libraries Documentation

Research Papers (Optional Advanced Reading)

💡 Tips for Success

Getting Help

Question 2: The sigmoid function σ(z) = 1/(1 + e^-z) outputs values in which range?

Question 5: In the gradient descent update rule w_new = w_old - α × ∇J(w), what does ∇J(w) represent?