1. Data

Deadline and Submission

05.sep (friday)

Commits until 23:59

Individual

Submission the GitHub Pages' Link (yes, only the link for pages) via insper.blackboard.com.

Activity: Data Preparation and Analysis for Neural Networks

This activity is designed to test your skills in generating synthetic datasets, handling real-world data challenges, and preparing data to be fed into neural networks.

Exercise 1

Exploring Class Separability in 2D

Understanding how data is distributed is the first step before designing a network architecture. In this exercise, you will generate and visualize a two-dimensional dataset to explore how data distribution affects the complexity of the decision boundaries a neural network would need to learn.

Instructions

Generate the Data: Create a synthetic dataset with a total of 400 samples, divided equally among 4 classes (100 samples each). Use a Gaussian distribution to generate the points for each class based on the following parameters:
- Class 0: Mean = \([2, 3]\), Standard Deviation = \([0.8, 2.5]\)
- Class 1: Mean = \([5, 6]\), Standard Deviation = \([1.2, 1.9]\)
- Class 2: Mean = \([8, 1]\), Standard Deviation = \([0.9, 0.9]\)
- Class 3: Mean = \([15, 4]\), Standard Deviation = \([0.5, 2.0]\)
Plot the Data: Create a 2D scatter plot showing all the data points. Use a different color for each class to make them distinguishable.
Analyze and Draw Boundaries:
1. Examine the scatter plot carefully. Describe the distribution and overlap of the four classes.
2. Based on your visual inspection, could a simple, linear boundary separate all classes?
3. On your plot, sketch the decision boundaries that you think a trained neural network might learn to separate these classes.

Exercise 2

Non-Linearity in Higher Dimensions

Simple neural networks (like a Perceptron) can only learn linear boundaries. Deep networks excel when data is not linearly separable. This exercise challenges you to create and visualize such a dataset.

Instructions

Generate the Data: Create a dataset with 500 samples for Class A and 500 samples for Class B. Use a multivariate normal distribution with the following parameters:
- Class A:
  
  Mean vector:
  
  \[\mu_A = [0, 0, 0, 0, 0]\]
  
  Covariance matrix:
  
  \[ \Sigma_A = \begin{pmatrix} 1.0 & 0.8 & 0.1 & 0.0 & 0.0 \\ 0.8 & 1.0 & 0.3 & 0.0 & 0.0 \\ 0.1 & 0.3 & 1.0 & 0.5 & 0.0 \\ 0.0 & 0.0 & 0.5 & 1.0 & 0.2 \\ 0.0 & 0.0 & 0.0 & 0.2 & 1.0 \end{pmatrix} \]
- Class B:
  
  Mean vector:
  
  \[\mu_B = [1.5, 1.5, 1.5, 1.5, 1.5]\]
  
  Covariance matrix:
  
  \[ \Sigma_B = \begin{pmatrix} 1.5 & -0.7 & 0.2 & 0.0 & 0.0 \\ -0.7 & 1.5 & 0.4 & 0.0 & 0.0 \\ 0.2 & 0.4 & 1.5 & 0.6 & 0.0 \\ 0.0 & 0.0 & 0.6 & 1.5 & 0.3 \\ 0.0 & 0.0 & 0.0 & 0.3 & 1.5 \end{pmatrix} \]
Visualize the Data: Since you cannot directly plot a 5D graph, you must reduce its dimensionality.
- Use a technique like Principal Component Analysis (PCA) to project the 5D data down to 2 dimensions.
- Create a scatter plot of this 2D representation, coloring the points by their class (A or B).
Analyze the Plots:
1. Based on your 2D projection, describe the relationship between the two classes.
2. Discuss the linear separability of the data. Explain why this type of data structure poses a challenge for simple linear models and would likely require a multi-layer neural network with non-linear activation functions to be classified accurately.

Exercise 3

Preparing Real-World Data for a Neural Network

This exercise uses a real dataset from Kaggle. Your task is to perform the necessary preprocessing to make it suitable for a neural network that uses the hyperbolic tangent (tanh) activation function in its hidden layers.

Instructions

Get the Data: Download the Spaceship Titanic dataset from Kaggle.
Describe the Data:
- Briefly describe the dataset's objective (i.e., what does the Transported column represent?).
- List the features and identify which are numerical (e.g., Age, RoomService) and which are categorical (e.g., HomePlanet, Destination).
- Investigate the dataset for missing values. Which columns have them, and how many?
Preprocess the Data: Your goal is to clean and transform the data so it can be fed into a neural network. The tanh activation function produces outputs in the range [-1, 1], so your input data should be scaled appropriately for stable training.
- Handle Missing Data: Devise and implement a strategy to handle the missing values in all the affected columns. Justify your choices.
- Encode Categorical Features: Convert categorical columns like HomePlanet, CryoSleep, and Destination into a numerical format. One-hot encoding is a good choice.
- Normalize/Standardize Numerical Features: Scale the numerical columns (e.g., Age, RoomService, etc.). Since the tanh activation function is centered at zero and outputs values in [-1, 1], Standardization (to mean 0, std 1) or Normalization to a [-1, 1] range are excellent choices. Implement one and explain why it is a good practice for training neural networks with this activation function.
Visualize the Results:
- Create histograms for one or two numerical features (like FoodCourt or Age) before and after scaling to show the effect of your transformation.

Evaluation Criteria

The deliverable for this activity consists of a report that includes:

A brief description of your approach to each exercise.
The code used to generate the datasets, preprocess the data, and create the visualizations. With comments explaining each step.
The plots and visualizations requested in each exercise.
Your analysis and answers to the questions posed in each exercise.

Important Notes:

The deliverable must be submitted in the format specified: GitHub Pages. No other formats will be accepted. - there exists a template for the course that you can use to create your GitHub Pages - template;
There is a strict policy against plagiarism. Any form of plagiarism will result in a zero grade for the activity and may lead to further disciplinary actions as per the university's academic integrity policies;
The deadline for each activity is not extended, and it is expected that you complete them within the timeframe provided in the course schedule - NO EXCEPTIONS will be made for late submissions.
AI Collaboration is allowed, but each student MUST UNDERSTAND and be able to explain all parts of the code and analysis submitted. Any use of AI tools must be properly cited in your report. ORAL EXAMS may require you to explain your work in detail.
All deliverables for individual activities should be submitted through the course platform insper.blackboard.com.

Grade Criteria:

Exercise 1 (3 points):

Criteria	Description
1 pt	Data is generated correctly and visualized in a clear scatter plot with proper labels and colors.
2 pts	The analysis of class separability is accurate, and the proposed decision boundaries are logical and well-explained in the context of what a network would learn.

Exercise 2 (3 points):

Criteria	Description
1 pt	Data is generated correctly using the specified multivariate parameters.
1 pt	Dimensionality reduction is applied correctly, and the resulting 2D projection is clearly plotted.
1 pt	The analysis correctly identifies the non-linear relationship and explains why a neural network would be a suitable model.

Exercise 3 (4 points):

Criteria	Description
1 pt	The data is correctly loaded, and its characteristics are accurately described.
2 pts	All preprocessing steps (handling missing data, encoding, and appropriate feature scaling for `tanh`) are implemented correctly and with clear justification for a neural network context.
1 pt	Visualizations effectively demonstrate the impact of the data preprocessing.