Artificial Intelligence (AI) was founded in 1956 at Dartmouth College. Interest in AI boomed until 1974 and funding dried up by 1980's. AI resurged around 2012 with the shift from rule-based system to data-driven learning, the evolution of deep learning algorithms (neural networks, transformers) and increased computing power.
Deep Learning ⊄ Machine Learning ⊄ Artificial Intelligence (where ⊄ is the symbol for subset)
Artificial Intelligence (AI) is the broad concept of machines learning and simulating human intelligence.
Machine Learning (ML) is a subset of AI where machines learn from data without explicit programming.
Supervised learning: learns by using training data.
Unsupervised learning: learns by clustering or grouping data.
Reinforcement learning: Learns by interacting with the environment, for example, a robot vacuum cleaner.
Deep Learning (DL) is a subset of ML that uses complex neural networks to learn from vast amounts of data. It uses supervised and unsupervised learning to train deep neural networks.
Artificial Intelligence
|----- Behavior
|----- Robotics
|----- Cognition and Learning
|----- Fuzzy Logic
|----- Planning
|----- Knowledge representation
|----- Reason
|----- Probability
|----- Machine Learning
|----- Supervised Learning
|----- Classification
|----- SVM
|----- Decision Tree
|----- AdaBoost
|----- Naive Bayes
|----- Regression
|----- Logistic Regression
|----- Prediction
|----- Random Forests
|----- Boosting
|----- Bagging
|----- Recommendation
|----- Collaborative Filtering
|----- Unsupervised Learning
|----- Clustering
|----- K-Means/K-Medoids
|----- Reinforcement Learning
|----- Deep Learning
|----- Transfer Learning
|----- Perception
|----- Natural Language Processing (NLP)
|----- Natural Language Understanding (NLU)
|----- Speech recognition
|----- Machine translation
|----- Text summarization
|----- Text classification
|----- Text proofreading
|----- Information extraction
|----- Natural Language Generation (NLG)
|----- Speech synthesis
|----- Computer Vision
|----- Image classification
|----- Object detection
|----- Target tracking
|----- Image segmentation
Weak AI: specializes in single tasks, for example, most ML applications including Chat GPT, semi-autonomous robots and cars (Tesla FSD), etc.
Strong AI: can solve unseen problems, matches human intelligence, does not exist yet, for example, the fully automated robots and cars in movies.
Super AI: super intelligent, surpass all human abilities; it does not exist yet, it's a hypothetical concept.
A neural network (NN) is a type of machine learning model inspired by the human brain, consisting of interconnected nodes arranged in layers that process data to find patterns and make predictions.
NN has an input layer, one or more hidden layers, and an output layer, where the connections between nodes have varying strengths (weights) that are adjusted during a learning process to minimize errors and improve accuracy.
Input layer: receives raw data.
Hidden layers: performs computations and extract features using weights ad activation functions.
Output layer: Produces predictions or classifications based on learned patterns.
The "deep" in Deep Learning refers to the depth of layers in a neural network, which consists of 3 or more hidden layers.
Forward propagation moves data through layers to generate an output. Backpropagation, short for Backward Propagations for Error, adjusts weight based on errors to improve accuracy. It trains neural networks by minimizing the difference between predicted and actual outputs. It works by propagating errors backward through the network, using the chain rule of calculus to compute gradients and then iteratively updating the weights and biases.
An activation function is a mathematical function in a neural network that determines a neuron's (node's) output based on its input. It takes the inputs, multiply with weights, add certain biases and transfers to the next layer.
The value produced by the activation function is the neuron's "output score." This score can vary depending on the type of activation function used:
Sigmoid: Outputs values between 0 and 1, often interpreted as probabilities in binary classification.
Tanh (Hyperbolic Tangent): Outputs values between -1 and 1, providing a zero-centered output.
ReLU (Rectified Linear Unit): Outputs 0 for negative inputs and the input value itself for positive inputs, ranging from 0 to infinity.
Softmax: Used in the output layer for multi-class classification, converting raw scores into a probability distribution where the sum of all outputs equals 1.
Score is the loss value during training or the confidence score of a prediction for new data. The loss score indicates the model's error (lower is better) and is a result of the loss function. Confidence scores, often derived from a final layer like a softmax function, represent the model's certainty for a specific prediction, typically in the range of 0 to 1.
Features, also known as attributes, are the individual characteristics or pieces of information that describe the input data. These are the raw inputs fed into the neural network.
Model parameters are defined as the internal variables of this model. They are learned or estimated purely from the data during training as every ML algorithm has mechanisms to optimize these parameters.
In a simple Linear Regression model, y=ax+b, the variables a and b are the parameters of the model.
In a Neural Network model, the weights and biases are the parameters of the model.
In a Clustering model, the centroids of the clusters are the parameters of the model.
Classes applies only to classification tasks where we want to learn a mapping function from our input features to some discrete output variables. These output variables are referred to as classes (or labels).
In the grad school application scenario below:
features are GPAs, recommendation letters, test scores, essay, publication, etc.
parameters are admission requirements set in the model
classes are admitted (yes) or rejected (no).
Identify complex data patterns using interconnected layers of nodes, working with diverse inputs and adapting to unpredictable data (e.g. stock prices).
A DNN is simply an ANN with multiple hidden layers—“deep” meaning more layers stacked between the input and output. A shallow NN learns simple patterns, a deep NN learns complex patterns by combining simple ones across many layers. The depth allows the network to learn complex patterns, hierarchies, and abstractions in data that shallow networks cannot.
Applications:
Computer vision
Self-driving cars
Speech recognition
Natural language processing
Fraud detection
Recommendation systems
Medical imaging
Game-playing agents (AlphaGo).
Limitations:
Require large amounts of data
Computationally expensive
Can be hard to interpret (“black box”)
Risk of overfitting if not trained carefully
A DNN isn’t one specific model. It’s a family of deep architectures, including:
CNNs (Convolutional NNs) → images
RNNs (Recurrent NNs) → sequences
Transformers → language & multimodal AI
Autoencoders → compression & generation
GANs → synthetic data generation
Designed for sequential or time-dependent data, information where order matters.
They have loops (connections loop back on themselves), allowing the model to remember previous inputs while processing the current one. Because of this loop, RNNs can understand: Time sequences, ordered data, context that develops over time.
Typical neural networks process each input independently. RNNs are different — they maintain a hidden state, which is updated at every step. This lets the model keep track of what happened before.
Applications:
Time-series forecasting
Speech recognition
Text generation
Language modeling
Machine translation
Music generation.
Limitations: RNNs struggle with -
Long-term dependencies
Vanishing gradients (the memory fades too quickly)
Slow training
Designed to process grid-like data, most commonly images.
Extremely powerful for tasks involving visual inputs because they can automatically learn spatial patterns like edges, textures, shapes, and objects - directly from the raw pixels.
Applications:
Image classification
Object detection (e.g., self-driving cars)
Face recognition
Medical image analysis
Video analysis
Image segmentation
Style transfer / AI art
Transformers are a neural network architecture that learn to understand and generate human-like text by analyzing patterns in large amounts of text data. Unlike traditional models, transformers do not process data sequentially, they analyze all words at once using self-attention. They are used for performing machine learning tasks particularly in natural language processing (NLP) and computer vision.
A transformer model architecture consists of two main components, which work together to capture long-term dependencies and improve translation accuracy: encoder and and decoder. The encoder extracts features from given input data, creates a numerical, context-aware representation of the input. It uses bidirectional self-attention, meaning each token can attend to all other tokens in the input sequence. The decoder uses the contextual representation produced by the encoder to generate an output sequence, typically one token at a time; it uses a masked self-attention layer, so each token can only attend to previous tokens in the output sequence, not future ones. The encoder is for understanding the input, while the decoder is for producing the output.
Attention mechanism enables transformers to process and understand sequences without relying on recurrent or convolutional structures. It helps transformers to capture relationships between distant elements in a long sequence of data and focus on the most important parts of input data when making predictions.
A differentiable attention method.
The model assigns weights (via SoftMax) to all input positions.
Key idea:
Looks at everything, but looks at some things more than others.
Pros:
Trainable end-to-end with gradient descent.
Most commonly used (Transformers, seq2seq).
Use case:
Widely used in NLP tasks, text translation, summarization.
Selects one part (or a small subset) of the input.
Non-differentiable - uses sampling, not smooth weights.
It is trained using reinforcement learning.
Key idea:
The model "chooses" a single location rather than blending all.
Training:
Requires reinforcement learning.
Use case:
Image captioning
Object detection
Enables each input element to attend to other aspects in the same sequence.
Attends to different position of the input sequences
Uses multiple attention heads to capture diverse features from different representation subspaces.
See "Attention Architecture" paragraph below.
A modern type of deep neural network designed for document classification. It captures the hierarchical structure of documents by focusing on individual words and sentences using attention levels, incorporating bi-directional encoders and attentional mechanisms for classification.
Uses a feed-forward neural network to calculate attention scores instead of dot products.
BERT = Bidirectional Encoder Representations from Transformers.
BERT is a language model developed by Google that excels at understanding word relationships and context. It achieves this through pre-training on two unsupervised tasks: masked language modeling (predicting masked words) and next sentence prediction. This enables it to be fine-tuned for a wide variety of natural language processing (NLP) tasks, such as search, question answering, and text classification.
Unlike traditional models that process text from left to right, BERT considers the context of a word from both directions (left and right) simultaneously (Bidirectional training).
During pre-training, BERT masks a percentage of words in a sentence and learns to predict the original words based on the surrounding context (Masked Language Model or MLM).
BERT is also trained to predict whether a second sentence logically follows a first sentence (Next Sentence Prediction or NSP), which helps it understand sentence relationships.
BERT is based on the Transformer architecture, which uses self-attention to weigh the importance of different words in a sentence relative to each other.
Input text is first preprocessed (mostly light cleaning, such as removing extra spaces or converting characters). Then it is tokenized, which means the text is broken into tokens and converted into numerical data. These numerical data are passed into an embedding layer, which turns tokens into meaningful vectors—basically, it gives words meaning in a mathematical form the model can understand. The processed numerical data then flows through the model’s neural layers to generate predictions or text outputs.
Generative AI (GenAI) is a subset of AI that focuses on creating models capable of generating new content (e.g. text, image, audio, video, etc.). GenAI can help with the followings:
Creativity and innovation: create novel content, empower content creators and designers with art, design and porduct development.
Automation and efficiency: automating various task, enhance efficiency, save time, resources and cost, excels in content generation, data analysis in business, finance and research.
Personalization and problem solving: tailors content to individual preferences, enhances user experience in different use cases, assists in complex problem solving, learns and adapts refining its capabilities over time.
Industry applications: enrichment in scripts, music, visual actors (entertainment), assisting in drug discovery (healthcare), crafting personalized ads and contents (marketing), creating unique digital arts and designs (art), language translations (language), risk assessment, fraud detection, algorithmic trading (finance), etc.
Generative AI:
Function: Creates new content like text, code, or images based on a prompt.
Behavior: Reactive and operates in a request-response cycle. It doesn't have its own goals or adapt dynamically to its environment.
Example: ChatGPT generating a poem or a user asking an image generator to create a picture.
Agentic AI:
Function: Manages dynamic and complex tasks by planning, executing, and adapting its actions to achieve a specific objective.
Behavior: Proactive and autonomous. It can handle unpredictable situations and use reasoning to make decisions to achieve a goal.
Example: An AI system that autonomously identifies what needs testing, creates a test strategy, executes tests, and self-heals when the application changes.
AI Agents:
Function: A type of autonomous system designed to perform specific tasks and workflows with a high degree of autonomy.
Behavior: Proactive and goal-oriented, but can range in complexity. They are often specialized in their domain and can be more limited than a full agentic system unless powered by an agentic core.
Example: A customer support agent that can authenticate users, access account information, and process transactions without human intervention.
How they work together:
Agentic AI systems often use Generative AI as a component to perform specific parts of a task, such as generating a summary of a conversation.
An agentic system can use generative AI to create a customer-facing response while it handles the underlying business logic and decision-making process.
NLP is the capability of machines to understand and generate human languages.
It is used for -
Machine translation
Speech recognition
Sentiment analysis
Rule based systems: earlier form of NLP, relies on predefined liguistic rules.
Statistical models: Marchov chains and n-grams, which utilize statistical methods to predict langugage patterns.
Deep learning based models: Recurrent neural networks and transformers, which utilize deep learning techniques to enhance language processsing capabilities.
Language Model is a machine learning entity.
Large Language Models are trained on large data sets and they can generate human like text, image, etc.
Pretrained LLMs ae available in the market for Gen AI solutions.
Converts unstructured text into a structured format that a machine can process.
Breaks down text into smaller, discrete units like words, sub-words, or characters. For example, the sentence "AI is smart" can be tokenized into ["AI", "is", "smart"].
Maps each token (word or subword) into a numerical vector that represents its semantic meaning (the similarity in meaning). These vectors allow the model to understand relationships between words based on their positions in high-dimensional space, for example, the vector for "dog" is closer to the vector for "puppy" than to "banana".
Adds information about the position of tokens in a sequence to their input embeddings.
Analyzes and transforms the input sequence into a contextualized representation, thus creating memories to remember what it has read.
Attention:
Generates an output sequence based on the encoder generated contextualized representation and previously generated tokens.
Self-Attention Mechanism: Allows the model to weigh the importance of different tokens in the input sequence when processing a specific token.
Generates an output sequence based on the encoder generated contextualized representation and previously generated tokens.
.
Multi-Head Attention: A variation where attention is calculated multiple times in parallel, enabling the model to focus on different relationships (e.g., syntax, semantics) simultaneously.
Feed-Forward Networks: Applied to each token's representation after the attention mechanism to perform further processing and capture non-linear patterns.
Keeps everything in check and make sure the machine learns well
Normalizes its understandings
Data collection
Data cleaning and pre-processing
Corpus preparation
Tokenization
Neural network training
Embedding generation
Commonly available LLMs in the market:
GPT 3.5/4
PaLM
Google AI developed next gen LLM.
It can process both texts and image inputs.
Claude
Developed by AI research company Anthropic.
Performs better, provides
Cohere
Developed by Cohere Technologies.
Manages variou tasks and can be fine tuned for specific areas.
T
Falcon
Developed by Technology Innovation Institute (TII) in UAE.
Performance stands out, boasting
LLaMA
Developed by Meta.
Outstanding performance in various NLP tasks
Understands context rich information, enhancing its effectiveness in complex tasks.
Might unintentionally produce biased or inaccurate content.
Diverse reasoning
LLM explores varied reasonings, including common sense, math, adapting to diverse contexts.
Eliciting reasoning
Methods like chain-of-thought prompting guide LLMs to stimulate and prompt thoughtful engineering.
Reasoning contribution enigma
The challenges lies in understanding reason's role and impact, differentiating it from factual information.
Performance, architecture and computational requirements
impact of context length and model size, practical factors for inference speed and precision
Scalability and performance
Task specific vs general purpose
Deployment cost
Licensing and commercial use
Data security and privacy
Non-technical aspects like ethics and biases
An auto regressive LLM capable of generating text in 46 natural languages and 13 programming languages.
Unsupervised learning models
Used for dimensionality reduction, data compression and feature extraction
Consists of 3 main components
Encoder, which maps input data to lower dimensional representation
Latent space, where data is in its most compressed form
Decoder, which maps the lower-dimensional representation back to the original input data
Challenges
Data generation
robustness in learning
Handling variability
Unsupervised learning models
AE with generative capabilities
VAE Architecture
Roles of VAE Components
Encoder
Latent space
Decoder
Reconstruction loss
KL Divergence term
VAE Generative Training Process
Example code:
VAE with TensorFlow for Image Generation
Example tool: StyleGAN
Unsupervised learning models
Known for producing high quality, sharp, and realistic images that do not exist yet.
Uses a generator and a discriminator in an adversarial, zero-sum game during training to create samples with sharp and intricate features.
Excel at capturing high frequency details, unlike VAEs, which tent to produce blurry outputs.
Use case
virtual clothing try on
customized shopping