This laboratory introduces students to gradient descent—the fundamental optimization algorithm that powers all modern machine learning and deep learning systems. Through hands-on implementation from scratch, students will develop a deep understanding of how neural networks learn by iteratively adjusting their parameters to minimize error. This foundational knowledge is essential for understanding training processes, debugging models, and building effective AI systems.
What You'll Learn
- Gradient Descent Algorithm: Master the core optimization technique used in all neural network training
- Cost Function Minimization: Understand how to measure and reduce prediction errors
- Learning Rate Effects: Discover how learning rate impacts convergence speed and stability
- Decision Boundary Visualization: See how optimization shapes the decision boundary over time
- From-Scratch Implementation: Build gradient descent without high-level libraries to understand the mechanics
- Binary Classification: Apply gradient descent to solve real classification problems
💡 Why This Matters
Gradient descent is the engine behind every neural network training process. Whether you're training a simple logistic regression model or a massive transformer with billions of parameters, gradient descent is at work, finding the optimal weights that minimize prediction error. Understanding gradient descent from first principles will help you debug training issues, select appropriate hyperparameters, design better architectures, and grasp advanced optimization techniques like Adam, RMSprop, and momentum-based methods. This is your foundation for all future machine learning work.
Lab Structure
This laboratory consists of one comprehensive part that builds your understanding step-by-step:
- Part 1: Gradient Descent Implementation
- Implement the sigmoid activation function
- Calculate model outputs (forward pass)
- Compute the cross-entropy error function
- Update weights using gradient descent (backward pass)
- Train a model and visualize decision boundaries
- Experiment with different learning rates and epochs
By the end of this laboratory session, you will be able to:
-
Understand gradient descent optimization fundamentals including how gradients indicate the direction of steepest descent and how iterative updates minimize cost functions.
-
Implement gradient descent algorithm from scratch including forward propagation, error calculation, gradient computation, and weight updates without using high-level machine learning libraries.
-
Visualize the learning process and decision boundaries to understand how the model evolves during training and how the decision boundary adapts to separate classes.
-
Understand cost/loss functions and their role in measuring prediction accuracy, particularly the cross-entropy loss for classification problems.
-
Apply gradient descent to classification problems by training a logistic regression model on real datasets and achieving accurate predictions.
-
Analyze the effect of learning rates on convergence by experimenting with different learning rate values and observing their impact on training speed and stability.
-
Debug and optimize gradient descent implementations by identifying common issues such as divergence, slow convergence, and numerical instability, and applying appropriate solutions.
Introduction to Gradient Descent
Gradient descent is an iterative optimization algorithm used to minimize a cost function by adjusting model parameters in the direction of steepest descent. In machine learning, we use gradient descent to find the optimal weights that minimize the difference between predicted and actual outputs. The algorithm follows a simple principle: calculate the gradient (slope) of the cost function with respect to each parameter, then adjust parameters in the opposite direction of the gradient.
The Gradient Descent Algorithm
The general gradient descent update rule is:
wnew = wold - α × ∇J(w)
Where:
- w: Model parameters (weights and biases)
- α (alpha): Learning rate - controls the step size
- ∇J(w): Gradient of the cost function with respect to weights
- J(w): Cost function that measures prediction error
Key Components
1. The Sigmoid Activation Function
For binary classification, we use the sigmoid function to convert linear combinations into probabilities between 0 and 1:
σ(z) = 1 / (1 + e-z)
The sigmoid function has several desirable properties: it's differentiable everywhere, outputs values in (0,1) range which can be interpreted as probabilities, and has a simple derivative: σ'(z) = σ(z) × (1 - σ(z)).
2. Forward Propagation
In forward propagation, we compute the model's prediction by passing input through the network:
z = w1x1 + w2x2 + ... + b
ŷ = σ(z)
Where x represents input features, w represents weights, b is the bias term, and ŷ is the predicted output.
3. Cross-Entropy Loss Function
For binary classification, we use cross-entropy loss to measure prediction error:
J = -1/m × Σ [y × log(ŷ) + (1-y) × log(1-ŷ)]
This loss function penalizes confident wrong predictions heavily and approaches zero when predictions are correct. It's convex for logistic regression, guaranteeing that gradient descent will find the global minimum.
4. Gradient Computation (Backward Pass)
The gradient tells us how to adjust each weight to reduce error. For logistic regression, the gradient of the loss with respect to weights is:
∂J/∂wi = 1/m × Σ (ŷ - y) × xi
This gradient is computed for each weight and bias, then used to update parameters in the direction that reduces loss.
5. Learning Rate Selection
The learning rate α controls how large each update step is. Too large, and the algorithm may overshoot the minimum and diverge. Too small, and training will be extremely slow. Typical values range from 0.001 to 0.1, but the optimal value depends on the specific problem and dataset scale. In this lab, you'll experiment with different learning rates to understand their impact.
🎓 Understanding Through Analogy
Imagine you're hiking down a mountain in thick fog (you can't see the bottom). Gradient descent is like feeling the slope beneath your feet and taking steps in the steepest downward direction. The learning rate determines your step size—small steps are safe but slow, while large steps are faster but risk overshooting valleys. The gradient tells you which direction is steepest, and you keep walking until you reach a valley (local minimum). In machine learning, the mountain is the cost function landscape, and the valley represents optimal weights.
Common Challenges in Gradient Descent
- Local Minima: The algorithm may converge to suboptimal solutions (though convex problems like logistic regression don't have this issue)
- Learning Rate Too High: Updates overshoot the minimum, causing oscillations or divergence
- Learning Rate Too Low: Training becomes extremely slow, requiring many epochs to converge
- Feature Scaling: Features with different scales can slow convergence; normalization helps
- Vanishing/Exploding Gradients: In deep networks, gradients can become too small or too large
Before beginning the hands-on lab exercises, please watch the following video lectures from Udacity's "Introduction to Deep Learning with PyTorch" course. These videos provide essential background on neural network fundamentals, optimization, and gradient descent. The concepts covered will directly support your understanding of the lab implementation.
⚠️ Important: These pre-lab videos are required viewing. They cover fundamental concepts that are essential for successfully completing the lab exercises. Plan to spend approximately 2-3 hours watching these materials before your lab session.
📺 Video Lectures
Chapter 1: Introduction to Neural Networks
Chapter 2: Error Functions and Activation Functions
Chapter 3: Logistic Regression
⭐ Chapter 4: Gradient Descent (CORE TOPIC)
Chapter 5: Neural Network Architecture
Chapter 6: Feedforward and Backpropagation
✅ Pre-lab Quiz (MCQs)
Instructions: Test your understanding after watching the videos. Click on an answer to see if it's correct. These questions will also be answered in detail in your lab report.
Question 1: What is the primary purpose of gradient descent in machine learning?
- A) To increase the dimensionality of the input features
- B) To minimize the cost function by iteratively adjusting model parameters
- C) To normalize the input data before training
- D) To split the dataset into training and validation sets
Question 2: The sigmoid function σ(z) = 1/(1 + e-z) outputs values in which range?
- A) [-1, 1]
- B) (0, 1)
- C) [0, ∞)
- D) (-∞, ∞)
Question 3: What happens if the learning rate (α) in gradient descent is too large?
- A) The algorithm will converge faster with guaranteed accuracy
- B) Training will be very slow but always reach the global minimum
- C) The algorithm may overshoot the minimum and diverge or oscillate
- D) The model will automatically stop training at the optimal point
Question 4: Why is cross-entropy loss preferred over mean squared error for classification problems?
- A) It's computationally faster to calculate
- B) It provides better gradient signals for probability-based outputs and penalizes confident wrong predictions more heavily
- C) It always produces lower loss values than MSE
- D) It doesn't require the use of activation functions
Question 5: In the gradient descent update rule wnew = wold - α × ∇J(w), what does ∇J(w) represent?
- A) The current weight values
- B) The learning rate multiplied by the loss
- C) The gradient (slope) of the cost function with respect to the weights
- D) The final prediction of the model
📝 Lab Report Requirement
In your lab report, you must provide detailed written answers to the following questions (not just multiple choice):
- What is the mathematical formula for gradient descent weight updates? Explain each component (w, α, ∇J(w)).
- Why do we use the sigmoid function for binary classification? What range does it output and why is this useful?
- What is cross-entropy loss and why is it preferred over mean squared error for classification?
- How does learning rate affect the convergence of gradient descent? What happens if it's too large or too small?
- Explain the difference between forward propagation and backpropagation in neural network training.
This laboratory consists of one comprehensive hands-on exercise where you'll implement gradient descent from scratch and apply it to a binary classification problem. You'll build the core components step-by-step and visualize how the model learns.
Part 1: Gradient Descent Implementation
In this exercise, you'll implement the complete gradient descent algorithm from scratch without using high-level machine learning libraries. You'll build each component—sigmoid function, forward propagation, error calculation, and weight updates—to understand how neural networks learn through optimization.
What You'll Implement:
- Sigmoid Activation Function: Convert linear outputs to probabilities
- Forward Pass: Calculate predictions from inputs and weights
- Cross-Entropy Loss: Measure prediction error
- Gradient Computation: Calculate how to adjust weights
- Weight Updates: Apply gradient descent to improve the model
- Training Loop: Iterate through epochs to minimize error
- Visualization: Plot decision boundary evolution and error curves
Learning Outcomes:
- Understand the mathematical foundations of neural network training
- Implement optimization algorithms from first principles
- Visualize and interpret the learning process
- Experiment with hyperparameters (learning rate, epochs)
- Debug and optimize gradient descent implementations
📌 Important: Start with the exercise notebook and attempt all implementations yourself before viewing the solution. Use the "Exercise Code Explained" resource if you need help understanding the starter code. The password for solutions will be provided by your instructor after you've made a genuine attempt at the exercises. Learning happens through struggle and problem-solving!
Software Requirements
- Python: Version 3.8 or higher
- NumPy: For numerical operations and array manipulation
- Matplotlib: For visualization and plotting
- Jupyter Notebook: For running interactive exercises (optional)
Included Files
All necessary code and exercises are included in the lab HTML files:
- GradientDescent.html - Exercise file with tasks to complete
- GradientDescentSolutions.html - Solution file (password-protected)
- GradientDescentExplained.html - Detailed explanations of exercise code
- GradientDescentSolutionsExplained.html - Detailed explanations of solution code (password-protected)
- All required helper functions included within the exercise
- Inline documentation and explanations
⚠️ Setup Verification:
Before starting the lab, verify your Python environment has NumPy and Matplotlib installed. You can install them using: pip install numpy matplotlib
Students must submit a comprehensive lab report demonstrating their understanding of gradient descent optimization and its implementation. The report should showcase practical skills acquired through the laboratory exercises and include evidence of all completed tasks.
⚠️ Submission Deadline:
Submit your completed lab report by [Insert Deadline - Typically 1 week after lab session]. Late submissions will be penalized according to course policy (10% per day, maximum 3 days).
Report Structure
Your lab report must include the following sections:
1. Title Page & Formatting (5 points)
- Lab title, your name, student ID, date, course name, and instructor name
- Professional formatting with clear headers and page numbers
2. Objectives (10 points)
- List all learning objectives
- Briefly explain why each is important (1-2 sentences each)
3. Procedure & Results (50 points)
For the gradient descent implementation:
- Include code snippets with clear outputs
- Provide screenshots of key results (training output, plots)
- Add plots: decision boundary evolution and error curves
- Explain what each component demonstrates
4. Discussion (20 points)
- Analyze your experimental results
- Compare different learning rates and their effects
- Discuss the impact of hyperparameters (learning rate, epochs)
- Support all statements with evidence from your experiments
5. Challenges & Solutions (10 points)
- Describe problems you encountered
- Explain your debugging process
- Reflect on what you learned from solving these challenges
6. Conclusion (5 points)
- Summarize key learnings
- Reflect on the most challenging concepts
- Discuss potential applications of this knowledge
📋 Submission Checklist
Before submitting, ensure you have:
- ✓ Completed gradient descent implementation with working code
- ✓ Included clear screenshots of all outputs, plots, and visualizations
- ✓ Answered all discussion questions thoroughly with supporting evidence
- ✓ Documented challenges and solutions in detail
- ✓ Checked all code for errors and verified all functions execute correctly
- ✓ Formatted report professionally with clear section headers and page numbers
- ✓ Referenced all sources and datasets used
- ✓ Proofread for grammar, spelling, and technical accuracy
- ✓ Verified all images are clear, properly labeled, and referenced in text
- ✓ Included your name and student ID on all pages
📤 Submission Format
- File Format: Submit report as PDF document (required)
- Code Files: Include Jupyter notebooks (.ipynb) in a separate ZIP file
- File Naming Convention:
- Report:
Week2_[YourLastName]_[StudentID].pdf
- Code:
Week2_[YourLastName]_[StudentID]_Code.zip
- Example:
Week2_Ahmed_202012345.pdf
- Submission Method: Upload to University LMS (Blackboard/Moodle)
- File Size Limit: Maximum 50MB total
- If exceeded, compress images or use PDF compression tools
- Ensure PDF is searchable text, not scanned images
- Required Components:
- 1. Main PDF lab report
- 2. ZIP file containing all code files with outputs
Important:
- Ensure PDF is searchable and not password-protected
- All code must be properly commented and executable
- Include all necessary imports and dependencies
- Test that your code runs completely from top to bottom
- Make sure all plots and visualizations are clearly visible
📊 Grading Rubric
| Component |
Points |
Criteria |
| Title Page & Formatting |
5 |
Complete, professional presentation |
| Objectives |
10 |
Clear, comprehensive understanding demonstrated |
| Procedure & Results |
50 |
Complete implementation with correct outputs |
| Discussion |
20 |
Thoughtful analysis, supported by results |
| Challenges & Solutions |
10 |
Detailed problem-solving process |
| Conclusion |
5 |
Reflective, insightful |
| Total |
100 |
|
Grading Notes:
- All code must execute without errors for full credit
- Screenshots must be clear, properly labeled, and referenced
- All experiments must be completed with comparative analysis
- Mathematical explanations must be accurate and well-written
- Late penalty: 10% per day (up to 3 days)
- Plagiarism will result in zero credit
Primary Course Material
- Udacity: Introduction to Deep Learning with PyTorch - Chapter: Introduction to Neural Networks
Recommended Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Chapter 4: Numerical Computation (Gradient Descent)
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Chapter 5: Neural Networks
- Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press.
Free online book: neuralnetworksanddeeplearning.com
Online Resources
- 3Blue1Brown: Neural Networks Series (YouTube)
Excellent visual explanations of gradient descent and backpropagation
- Andrew Ng: Machine Learning Course (Coursera)
Weeks 1-2 cover gradient descent fundamentals
- Stanford CS229: Machine Learning Course Notes
Mathematical foundations of gradient descent
- Towards Data Science: Gradient Descent Optimization Algorithms
Comparison of SGD, Adam, RMSprop, and other variants
Python Libraries Documentation
- NumPy: numpy.org - Array operations and mathematical functions
- Matplotlib: matplotlib.org - Data visualization and plotting
- Scikit-learn: scikit-learn.org - Machine learning algorithms reference
Research Papers (Optional Advanced Reading)
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
💡 Tips for Success
- Start early - gradient descent can be tricky to implement correctly
- Test each function independently before combining them
- Use small datasets first to verify your implementation
- Print intermediate values to debug issues
- Visualize your results - plots reveal problems that numbers hide
- Experiment beyond the required tests - try edge cases
- Ask questions in office hours if you're stuck
- Review the pre-lab videos if concepts are unclear
Getting Help
If you encounter difficulties:
- Office Hours: Attend instructor office hours for one-on-one help
- Discussion Forum: Post questions on the course LMS discussion board
- Study Groups: Collaborate with classmates (but submit individual work)
- Lab TAs: Ask teaching assistants during lab sessions
- Email: Contact instructor for specific questions or clarifications