SJSU - Summer 2024 - Big Data & Generative AI

Session 3 - Large Language Models Lab

Dr. Julien Pierret

Today's Agenda

  • Large Language Models
    • Using them
    • Setting up PyTorch
    • Hugging Faces Setup
    • Running models!

Large Language Models - How do we use them?

  • Train our own?
💰

Varying stages of usage


Prompt Engineering

Retrieval-Augmented Generation (RAG) (1)

  • Wasn't trained on all knowledge

  • Llama 3 prompt:

                {
                  "role": "user",
                  "content": "Who won the Super Bowl in 2023?"
                },
            
    • Llama 3 response:

                I'm happy to help! However, I need to correct you - Super Bowl 2023 has not yet occurred. The 2023 
                NFL season has not started, and the Super Bowl will take place in February 2024. I'll make sure 
                to provide you with the correct information once the game is played!
            

Retrieval-Augmented Generation (2)

  • Llama 3 prompt:

                {
                  "role": "system",
                  "content": "Super Bowl LVII was an American football game played to determine "
                              "the champion of the National Football League (NFL) for the 2022 "
                              "season. The American Football Conference (AFC) champion Kansas City "
                              "Chiefs defeated the National Football Conference (NFC) champion "
                              "Philadelphia Eagles, 38–35. The game was played on February 12, 2023, "
                              "at State Farm Stadium in Glendale, Arizona. It was the fourth Super "
                              "Bowl hosted by the Phoenix metropolitan area, and the third at this "
                              "venue, after Super Bowls XLII in 2008 and XLIX in 2015 when it was "
                              "known as University of Phoenix Stadium. ",
                },
                {
                    "role": "user", "content": "Who won the Super Bowl in 2023?"
                },
            
  • Llama 3 response:

                According to the text, the Kansas City Chiefs won 
                Super Bowl LVII in 2023 by defeating the Philadelphia
                Eagles 38-35.
              

Retrieval-Augmented Generation (3)

  • Limits to amount of context that can be given
    • Search in documents
    • Find the right context
    • Give it to the LLM
  • Can vectorize prompt
    • Search for result in vectorized Database

Retrieval-Augmented Generation (4)


Fine-tuning

  • BERT world
    • Adjust model weights
    • Train on domain problem
      • Text message: "My toilet exploded!" 📱
      • predict: ["window", "sink", "door", "toilet",...]
    • Use seq2seq to extract information
    • 
                        THE HOME DEPOT #4708 OLYMPIA US
                        COSTCO WHSE #0064 TUMWATER WA
                        WAL-MART #3531
                        DOLLAR TREE
                      
      • Merchant
      • Location
      • Phone number


Fine-tuning

  • GPT/Llama world
    • Instruction fine-tuning
      • Better responses
      • More scientific detail?
    • Full fine-tuning
      • Similar to the "BERT" fine-tuning
      • Adjust the weights of the model
      • Train on data never seen before
        • Company data

Pre-training

  • Building from scratch
  • Expensive 💸

PyTorch setup (1)


PyTorch setup (2)

  • Linux - Nvidia GPU
    • nvidia-smi
    • 
                    +---------------------------------------------------------------------------------------+
                    | NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
                    |-----------------------------------------+----------------------+----------------------+
                    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
                    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
                    |                                         |                      |               MIG M. |
                    |=========================================+======================+======================|
                    |   0  NVIDIA GeForce RTX 3090        Off | 00000000:42:00.0 Off |                  N/A |
                    |  0%   37C    P8              34W / 420W |    281MiB / 24576MiB |      0%      Default |
                    |                                         |                      |                  N/A |
                    +-----------------------------------------+----------------------+----------------------+
                                                                                                            
                    +---------------------------------------------------------------------------------------+
                    | Processes:                                                                            |
                    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
                    |        ID   ID                                                             Usage      |
                    |=======================================================================================|
                    |    0   N/A  N/A    208284      G   /usr/lib/xorg/Xorg                          241MiB |
                    |    0   N/A  N/A    208477      G   /usr/bin/gnome-shell                         26MiB |
                    +---------------------------------------------------------------------------------------+
                        
  • CPU - Usual choice

PyTorch setup (3)

  • Linux CPU - requirements.in
  • 
                  
  • Linux GPU - requirements.in
  • 
                  
  • OSX - requirements.in
  • 
                  

Huggingface

  • Login
  • Top right icon
  • Settings
  • Access Tokens
  • Create new token
  • Write down (save!) that token!

Huggingfaces - So much Stuff


Python time!

  • PyCharm
    • New Project
  • requirements.in


                


                # Install pip-tools (it has "pip-compile" in it)
                pip3 install pip-tools
          
                # Create the requirements.txt from the requirements.in
                pip-compile --output-file=requirements.txt requirements.in
          
                # Install the libraries from the generated requirements.txt
                pip3 install -r requirements.txt

                # Install huggingface-cli
                pip install -U "huggingface_hub[cli]"
                

Huggingface's CLI login

  • Remember that token?
  • 
                    (.venv) julien@server:~/src/PythonProjcet$ huggingface-cli login
                  
                      _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
                      _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
                      _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
                      _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
                      _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
                  
                      To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
                    Enter your token (input will not be visible): 
                    
  • Let's start coding!

Imports & __main__


            

BERT - Code


            

BERT - Results


              ********************************************************************************
              BERT - Masked model
              ********************************************************************************
              Input string: What color is the sky? The sky is [MASK].

              ...

              token=blue, prob=0.4481164813041687
              token=black, prob=0.08515145629644394
              token=white, prob=0.06458993256092072
              token=red, prob=0.04340418055653572
              token=gray, prob=0.04022093862295151
              
            

seq2seq - Code


            

seq2seq - Results


              ********************************************************************************
              FLAN-T5 - seq2seq
              ********************************************************************************
              Input string: What color is the sky?
              {'input_ids': tensor([[ 363,  945,   19,    8, 5796,   58,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
              ['bluere']
              
            

GPT2 - Code (1)


            

GPT2 - Results (1)


              ********************************************************************************
              GPT2 - Causal Model
              ********************************************************************************
              Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
              Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
              Input: What is the color of the sky?
              What is the color of the sky? White!
              What is the color of the sky? (4
              What is the color of the sky?


              What is the color of the sky? You will
              What is the color of the sky?


              What is the color of the sky? I know
              What is the color of the sky?


              What is the color of the sky?


              What is the color of the sky?


              What is the color of the sky?
              
            

GPT2 - Code (2)


            

GPT2 - Results (2)


              ********************************************************************************
              GPT2 - Causal Model
              ********************************************************************************

              ...

              Input: What is the color of the sky? The sky is
              Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
              What is the color of the sky? The sky is white.
              
              
              Are you sure? The
              What is the color of the sky? The sky is violet-orange!
              
              Well, it
              What is the color of the sky? The sky is blue, pink, pink, etc, etc
              What is the color of the sky? The sky is mostly green and violet.
              
              What is
              What is the color of the sky? The sky is pink. What is the color of your tongue
              What is the color of the sky? The sky is red, the stars are blue and the soil
              What is the color of the sky? The sky is orange on this morning.
              
              The sky
              What is the color of the sky? The sky is pinkish black. Why is it that when
              What is the color of the sky? The sky is red!
              
              
              Why is this happening?
              What is the color of the sky? The sky is blue, it's the blue sky that brings
              
            

GPT Style LLM

GPT - imports & __main__


            

GPT - Llama 2 - Code (1)


            

GPT - Llama 2 - Results (1)


              Result: What is the color of the sky?

              Answer: The color of the sky is blue.
              
              Reasoning: The sky appears blue to us because of the way that light
              interacts with the Earth's atmosphere. When sunlight enters the atmosphere,
              it is scattered in all directions by the tiny molecules of gases that make
              up the atmosphere. This scattering effect is known as Rayleigh scattering,
              and it is responsible for the blue color of the sky. The shorter, blue
              wavelengths of light are scattered more than the longer, red wavelengths,
              which is why the sky appears blue to our eyes.
            

GPT - Llama 2- Code (2)


            

GPT - Llama 2 - Result (2)


              Result: What is the most common English-language pangram?

              A pangram is a phrase or sentence that uses all the letters of the alphabet
              at least once. The most common English-language pangram is:
              
              "The quick brown fox jumps over the lazy dog."
              
              This sentence is often used as a demonstration of a font or keyboard's
              capabilities, as it includes all 26 letters of the alphabet. It was first
              used in the 19th century and has since become a well-known and widely used
              pangram.
            

GPT - Llama 3 - Code (1)


            

Llama 3 - Alignment

  • Instruct Models
    • Fine-tuned to follow specific instructions or prompts
    • “expect” to be asked to perform a task based on given instructions
    • Good for
      • Chatbots
      • Implementing RAG
      • Agents
  • Non-Instruct Models
    • Generate output that follows on from the prompt without specific instructions
    • Do not follow prompts
    • Good at content generation with no specific task
    • pre-trained

GPT - Llama 3 - Code (2)


            

GPT - Llama 3 - Code (3)


            

GPT - Llama 3 - Result (1)


              {
                'role': 'assistant',
                'content': "What a great question!
                
                            The color of the sky can look different  depending on the
                            time of day and the weather. But usually, the sky appears
                            blue to our eyes!
                            
                            When the sun is shining, the sky can look a bright blue,
                            almost like a clear blueberry. This is because the Earth's
                            atmosphere scatters the sunlight in all directions, and
                            the shorter, blue wavelengths are scattered more than the
                            longer, red wavelengths. This is called Rayleigh
                            scattering, and it's the reason why the sky often appears
                            blue.
                            
                            But, did you know that the sky can also look different at
                            sunrise and sunset? During these times, the sun's light has to
                            travel through more of the Earth's atmosphere, which scatters
                            the shorter wavelengths of light, making the sky appear more
                            orange or reddish. It's like the sky is painting a beautiful
                            sunset just for us!
                            
                            So, to answer your question, the color of the sky is usually
                            blue, but it can also be many other colors depending on the
                            time of day and the weather."
                }
            

GPT - Llama 3 - Code (4)


            

GPT - Llama 3 - Result (2)


              {
                'role': 'assistant',
                'content': "I'm sorry, but since the future is not yet known, I do not
                           have the answer to that question. The Super Bowl for 2024
                           has not yet taken place."
              }
            

GPT - Llama 3 - Code (5)


            

GPT - Llama 3 - Result (3)


              {
                'role': 'assistant',
                'content': 'According to the text, the Kansas City Chiefs won Super Bowl
                           LVIII in 2024 by defeating the San Francisco 49ers 25-22 in
                           overtime.'
              }
            
  • RAG

In Summary

  • Transformer Library
  • Makes this easy

It was a pleasure

  • Questions?