Introduction to AI (Work In Progress)

Artificial Intelligence (AI) was founded in 1956 at Dartmouth College. Interest in AI boomed until 1974 and funding dried up by 1980's. AI resurged around 2012 with the shift from rule-based system to data-driven learning, the evolution of deep learning algorithms (neural networks, transformers) and increased computing power.

Artificial Intelligence vs. Machine Learning vs. Deep Learning

Deep Learning ⊄ Machine Learning ⊄ Artificial Intelligence (where ⊄ is the symbol for subset)
- - Artificial Intelligence (AI) is the broad concept of machines learning and simulating human intelligence.
  - Machine Learning (ML) is a subset of AI where machines learn from data without explicit programming.
    - - Supervised learning: learns by using training data.
      - Unsupervised learning: learns by clustering or grouping data.
      - Reinforcement learning: Learns by interacting with the environment, for example, a robot vacuum cleaner.
  - Deep Learning (DL) is a subset of ML that uses complex neural networks to learn from vast amounts of data. It uses supervised and unsupervised learning to train deep neural networks.

Fields in Artificial Intelligence

Type of AIs

Weak AI: specializes in single tasks, for example, most ML applications including Chat GPT, semi-autonomous robots and cars (Tesla FSD), etc.
Strong AI: can solve unseen problems, matches human intelligence, does not exist yet, for example, the fully automated robots and cars in movies.
Super AI: super intelligent, surpass all human abilities; it does not exist yet, it's a hypothetical concept.

Neural Network

A neural network (NN) is a type of machine learning model inspired by the human brain, consisting of interconnected nodes arranged in layers that process data to find patterns and make predictions.
NN has an input layer, one or more hidden layers, and an output layer, where the connections between nodes have varying strengths (weights) that are adjusted during a learning process to minimize errors and improve accuracy.
- - Input layer: receives raw data.
  - Hidden layers: performs computations and extract features using weights ad activation functions.
  - Output layer: Produces predictions or classifications based on learned patterns.

Deep Learning

- The "deep" in Deep Learning refers to the depth of layers in a neural network, which consists of 3 or more hidden layers.

Concepts

Forward propagation moves data through layers to generate an output. Backpropagation, short for Backward Propagations for Error, adjusts weight based on errors to improve accuracy. It trains neural networks by minimizing the difference between predicted and actual outputs. It works by propagating errors backward through the network, using the chain rule of calculus to compute gradients and then iteratively updating the weights and biases.

- An activation function is a mathematical function in a neural network that determines a neuron's (node's) output based on its input. It takes the inputs, multiply with weights, add certain biases and transfers to the next layer.
- The value produced by the activation function is the neuron's "output score." This score can vary depending on the type of activation function used:
  - - - Sigmoid: Outputs values between 0 and 1, often interpreted as probabilities in binary classification.
      - Tanh (Hyperbolic Tangent): Outputs values between -1 and 1, providing a zero-centered output.
      - ReLU (Rectified Linear Unit): Outputs 0 for negative inputs and the input value itself for positive inputs, ranging from 0 to infinity.
      - Softmax: Used in the output layer for multi-class classification, converting raw scores into a probability distribution where the sum of all outputs equals 1.

Score is the loss value during training or the confidence score of a prediction for new data. The loss score indicates the model's error (lower is better) and is a result of the loss function. Confidence scores, often derived from a final layer like a softmax function, represent the model's certainty for a specific prediction, typically in the range of 0 to 1.

Features, also known as attributes, are the individual characteristics or pieces of information that describe the input data. These are the raw inputs fed into the neural network.
Model parameters are defined as the internal variables of this model. They are learned or estimated purely from the data during training as every ML algorithm has mechanisms to optimize these parameters.
- - In a simple Linear Regression model, y=ax+b, the variables a and b are the parameters of the model.
  - In a Neural Network model, the weights and biases are the parameters of the model.
  - In a Clustering model, the centroids of the clusters are the parameters of the model.
Classes applies only to classification tasks where we want to learn a mapping function from our input features to some discrete output variables. These output variables are referred to as classes (or labels).
In the grad school application scenario below:
- - features are GPAs, recommendation letters, test scores, essay, publication, etc.
  - parameters are admission requirements set in the model
  - classes are admitted (yes) or rejected (no).

Types of Neural Networks

Artificial Neural Network (ANN)

Identify complex data patterns using interconnected layers of nodes, working with diverse inputs and adapting to unpredictable data (e.g. stock prices).

Deep Neural Network (DNN)

A DNN is simply an ANN with multiple hidden layers—“deep” meaning more layers stacked between the input and output. A shallow NN learns simple patterns, a deep NN learns complex patterns by combining simple ones across many layers. The depth allows the network to learn complex patterns, hierarchies, and abstractions in data that shallow networks cannot.
Applications:
- - Computer vision
  - Self-driving cars
  - Speech recognition
  - Natural language processing
  - Fraud detection
  - Recommendation systems
  - Medical imaging
  - Game-playing agents (AlphaGo).
Limitations:
- - Require large amounts of data
  - Computationally expensive
  - Can be hard to interpret (“black box”)
  - Risk of overfitting if not trained carefully
A DNN isn’t one specific model. It’s a family of deep architectures, including:
- - CNNs (Convolutional NNs) → images
  - RNNs (Recurrent NNs) → sequences
  - Transformers → language & multimodal AI
  - Autoencoders → compression & generation
  - GANs → synthetic data generation

Recurrent Neural Network (RNN)

Designed for sequential or time-dependent data, information where order matters.
They have loops (connections loop back on themselves), allowing the model to remember previous inputs while processing the current one. Because of this loop, RNNs can understand: Time sequences, ordered data, context that develops over time.
Typical neural networks process each input independently. RNNs are different — they maintain a hidden state, which is updated at every step. This lets the model keep track of what happened before.
Applications:
- - Time-series forecasting
  - Speech recognition
  - Text generation
  - Language modeling
  - Machine translation
  - Music generation.
Limitations: RNNs struggle with -
- - Long-term dependencies
  - Vanishing gradients (the memory fades too quickly)
  - Slow training

Convolutional Neural Network (CNN)

Designed to process grid-like data, most commonly images.
Extremely powerful for tasks involving visual inputs because they can automatically learn spatial patterns like edges, textures, shapes, and objects - directly from the raw pixels.
Applications:
- - Image classification
  - Object detection (e.g., self-driving cars)
  - Face recognition
  - Medical image analysis
  - Video analysis
  - Image segmentation
  - Style transfer / AI art

Transformers

Transformers are a neural network architecture that learn to understand and generate human-like text by analyzing patterns in large amounts of text data. Unlike traditional models, transformers do not process data sequentially, they analyze all words at once using self-attention. They are used for performing machine learning tasks particularly in natural language processing (NLP) and computer vision.

Transformer Model Architecture

A transformer model architecture consists of two main components, which work together to capture long-term dependencies and improve translation accuracy: encoder and and decoder. The encoder extracts features from given input data, creates a numerical, context-aware representation of the input. It uses bidirectional self-attention, meaning each token can attend to all other tokens in the input sequence. The decoder uses the contextual representation produced by the encoder to generate an output sequence, typically one token at a time; it uses a masked self-attention layer, so each token can only attend to previous tokens in the output sequence, not future ones. The encoder is for understanding the input, while the decoder is for producing the output.

Attention Mechanism

Attention mechanism enables transformers to process and understand sequences without relying on recurrent or convolutional structures. It helps transformers to capture relationships between distant elements in a long sequence of data and focus on the most important parts of input data when making predictions.

Different Attention Mechanisms

Soft attention

A differentiable attention method.
The model assigns weights (via SoftMax) to all input positions.
Key idea:
- - Looks at everything, but looks at some things more than others.
Pros:
- - Trainable end-to-end with gradient descent.
  - Most commonly used (Transformers, seq2seq).
Use case:
- - Widely used in NLP tasks, text translation, summarization.

Hard attention

Selects one part (or a small subset) of the input.
Non-differentiable - uses sampling, not smooth weights.
It is trained using reinforcement learning.
Key idea:
- - The model "chooses" a single location rather than blending all.
Training:
- - Requires reinforcement learning.
Use case:
- - Image captioning
  - Object detection

Self-attention

Enables each input element to attend to other aspects in the same sequence.
Attends to different position of the input sequences

Multi-head attention

Uses multiple attention heads to capture diverse features from different representation subspaces.

Encoder-decoder attention

See "Attention Architecture" paragraph below.

Hierarchical attention

A modern type of deep neural network designed for document classification. It captures the hierarchical structure of documents by focusing on individual words and sentences using attention levels, incorporating bi-directional encoders and attentional mechanisms for classification.

Additive attention

Uses a feed-forward neural network to calculate attention scores instead of dot products.

BERT Model

BERT = Bidirectional Encoder Representations from Transformers.
BERT is a language model developed by Google that excels at understanding word relationships and context. It achieves this through pre-training on two unsupervised tasks: masked language modeling (predicting masked words) and next sentence prediction. This enables it to be fine-tuned for a wide variety of natural language processing (NLP) tasks, such as search, question answering, and text classification.
Unlike traditional models that process text from left to right, BERT considers the context of a word from both directions (left and right) simultaneously (Bidirectional training).
During pre-training, BERT masks a percentage of words in a sentence and learns to predict the original words based on the surrounding context (Masked Language Model or MLM).
BERT is also trained to predict whether a second sentence logically follows a first sentence (Next Sentence Prediction or NSP), which helps it understand sentence relationships.
BERT is based on the Transformer architecture, which uses self-attention to weigh the importance of different words in a sentence relative to each other.

Functional Workflow

Input text is first preprocessed (mostly light cleaning, such as removing extra spaces or converting characters). Then it is tokenized, which means the text is broken into tokens and converted into numerical data. These numerical data are passed into an embedding layer, which turns tokens into meaningful vectors—basically, it gives words meaning in a mathematical form the model can understand. The processed numerical data then flows through the model’s neural layers to generate predictions or text outputs.

Generative AI (GenAI)

Generative AI (GenAI) is a subset of AI that focuses on creating models capable of generating new content (e.g. text, image, audio, video, etc.). GenAI can help with the followings:

Creativity and innovation: create novel content, empower content creators and designers with art, design and porduct development.
Automation and efficiency: automating various task, enhance efficiency, save time, resources and cost, excels in content generation, data analysis in business, finance and research.
Personalization and problem solving: tailors content to individual preferences, enhances user experience in different use cases, assists in complex problem solving, learns and adapts refining its capabilities over time.
Industry applications: enrichment in scripts, music, visual actors (entertainment), assisting in drug discovery (healthcare), crafting personalized ads and contents (marketing), creating unique digital arts and designs (art), language translations (language), risk assessment, fraud detection, algorithmic trading (finance), etc.

Differences among Generative AI, Agentic AI and AI Agents

Generative AI:

Function: Creates new content like text, code, or images based on a prompt.
Behavior: Reactive and operates in a request-response cycle. It doesn't have its own goals or adapt dynamically to its environment.
Example: ChatGPT generating a poem or a user asking an image generator to create a picture.

Agentic AI:

Function: Manages dynamic and complex tasks by planning, executing, and adapting its actions to achieve a specific objective.
Behavior: Proactive and autonomous. It can handle unpredictable situations and use reasoning to make decisions to achieve a goal.
Example: An AI system that autonomously identifies what needs testing, creates a test strategy, executes tests, and self-heals when the application changes.

AI Agents:

Function: A type of autonomous system designed to perform specific tasks and workflows with a high degree of autonomy.
Behavior: Proactive and goal-oriented, but can range in complexity. They are often specialized in their domain and can be more limited than a full agentic system unless powered by an agentic core.
Example: A customer support agent that can authenticate users, access account information, and process transactions without human intervention.

How they work together:

Agentic AI systems often use Generative AI as a component to perform specific parts of a task, such as generating a summary of a conversation.
An agentic system can use generative AI to create a customer-facing response while it handles the underlying business logic and decision-making process.

Natural Language Processing

NLP is the capability of machines to understand and generate human languages.
It is used for -
- - Machine translation
  - Speech recognition
  - Sentiment analysis

Types of NLP Models

Rule based systems: earlier form of NLP, relies on predefined liguistic rules.
Statistical models: Marchov chains and n-grams, which utilize statistical methods to predict langugage patterns.
Deep learning based models: Recurrent neural networks and transformers, which utilize deep learning techniques to enhance language processsing capabilities.

Large Language Model (LLM)

Language Model is a machine learning entity.
Large Language Models are trained on large data sets and they can generate human like text, image, etc.
Pretrained LLMs ae available in the market for Gen AI solutions.

Components of LLM architecture

Tokenization

Converts unstructured text into a structured format that a machine can process.
Breaks down text into smaller, discrete units like words, sub-words, or characters. For example, the sentence "AI is smart" can be tokenized into ["AI", "is", "smart"].

Input Embeddings

Maps each token (word or subword) into a numerical vector that represents its semantic meaning (the similarity in meaning). These vectors allow the model to understand relationships between words based on their positions in high-dimensional space, for example, the vector for "dog" is closer to the vector for "puppy" than to "banana".

Positional Encoding

Adds information about the position of tokens in a sequence to their input embeddings.

Encoder

Analyzes and transforms the input sequence into a contextualized representation, thus creating memories to remember what it has read.
Attention:

Decoder

Generates an output sequence based on the encoder generated contextualized representation and previously generated tokens.

Transformer Layers

Self-Attention Mechanism: Allows the model to weigh the importance of different tokens in the input sequence when processing a specific token.
Generates an output sequence based on the encoder generated contextualized representation and previously generated tokens.
.
Multi-Head Attention: A variation where attention is calculated multiple times in parallel, enabling the model to focus on different relationships (e.g., syntax, semantics) simultaneously.
Feed-Forward Networks: Applied to each token's representation after the attention mechanism to perform further processing and capture non-linear patterns.

Layer Normalization

Keeps everything in check and make sure the machine learns well
Normalizes its understandings

Output

LLM Training Steps

Data collection
Data cleaning and pre-processing
Corpus preparation
Tokenization
Neural network training
Embedding generation

Types of LLMs

Commonly available LLMs in the market:

GPT 3.5/4
PaLM
- - Google AI developed next gen LLM.
  - It can process both texts and image inputs.
Claude
- - Developed by AI research company Anthropic.
  - Performs better, provides
Cohere
- - Developed by Cohere Technologies.
  - Manages variou tasks and can be fine tuned for specific areas.
  - T
Falcon
- - Developed by Technology Innovation Institute (TII) in UAE.
  - Performance stands out, boasting
LLaMA
- - Developed by Meta.
  - Outstanding performance in various NLP tasks
  - Understands context rich information, enhancing its effectiveness in complex tasks.
  - Might unintentionally produce biased or inaccurate content.

LLM Reasoning

Diverse reasoning
- - LLM explores varied reasonings, including common sense, math, adapting to diverse contexts.
Eliciting reasoning
- - Methods like chain-of-thought prompting guide LLMs to stimulate and prompt thoughtful engineering.
Reasoning contribution enigma
- - The challenges lies in understanding reason's role and impact, differentiating it from factual information.

Factors to Consider When Choosing an LLM

Performance, architecture and computational requirements
- - impact of context length and model size, practical factors for inference speed and precision
  - Scalability and performance
  - Task specific vs general purpose
  - Deployment cost
Licensing and commercial use
Data security and privacy
Non-technical aspects like ethics and biases

Bloom's Architecture

An auto regressive LLM capable of generating text in 46 natural languages and 13 programming languages.

Autoencoders

Unsupervised learning models
Used for dimensionality reduction, data compression and feature extraction
Consists of 3 main components
- - Encoder, which maps input data to lower dimensional representation
  - Latent space, where data is in its most compressed form
  - Decoder, which maps the lower-dimensional representation back to the original input data
Challenges
- - Data generation
  - robustness in learning
  - Handling variability

Variational Autoencoders (VAEs)

Unsupervised learning models

AE with generative capabilities
VAE Architecture
Roles of VAE Components

Encoder
Latent space
Decoder
Reconstruction loss
KL Divergence term

VAE Generative Training Process

Example code:
VAE with TensorFlow for Image Generation
Example tool: StyleGAN
- - https://arxiv.org/abs/1912.04958
  - https://github.com/NVlabs/stylegan2

Generative Adversarial Network (GAN)

Unsupervised learning models
Known for producing high quality, sharp, and realistic images that do not exist yet.
Uses a generator and a discriminator in an adversarial, zero-sum game during training to create samples with sharp and intricate features.
Excel at capturing high frequency details, unlike VAEs, which tent to produce blurry outputs.
Use case
- - virtual clothing try on
  - customized shopping

Page updated

Google Sites

Report abuse