Model Inference Flow on Virtual GPU
================================

1. Storage and VRAM Setup
-------------------------
[HTTPGPUStorage]  
      │     ╲
      │      ╲    Zero-Copy
      │       ╲   Memory Mapping
      ▼        ▼
[Local Storage]──>[Virtual VRAM]
 (Memory Pages)     (Page Tables)
      │                  │
      └──────────────┐  │
                     ▼  ▼
                [vGPU Device]
                     │
                     ▼
2. Model Loading and Device Movement
----------------------------------
[Florence-2-Large] ---load---> [PyTorch Model]
         │                          │
         │                          ▼
         │                   [to_vgpu() conversion]
         │                          │
         └─────────────────┐       │
                          ▼       ▼
                    [Model on vGPU Device]
                           │
                           ▼
3. Input Processing and Inference
--------------------------------
[Input Text] -----> [Tokenizer] -----> [Tensor]
                                         │
                                         ▼
                              [to_vgpu() conversion]
                                         │
                                         ▼
                               [Tensor on vGPU]
                                         │
                                         ▼
4. Model Inference Flow
----------------------
[Model Forward Pass]
       │
       ▼
[vGPU Computation]
       │
       ▼
[PyTorch Output Tensor]
       │
       ▼
[Last Hidden State]
(Shape: [batch_size, seq_length, hidden_size])

Data Flow and Memory Management:
-----------------------------
1. Storage Layer:   
   - HTTPGPUStorage ──> Local Storage (Memory Pages)
   - Local Storage ──> Virtual VRAM (Zero-Copy)
   - Virtual VRAM manages page tables pointing to local storage

2. Memory Architecture:
   - Local Storage: Physical memory pages
   - Virtual VRAM: Page tables and memory mappings
   - Zero-copy between Local Storage and VRAM
   - Direct memory access for GPU operations

3. Processing Flow:
   - Model Layer:   HF Model ──> PyTorch ──> vGPU
   - Input Layer:   Text ──> Tokens ──> Tensor ──> vGPU
   - Output Layer:  vGPU ──> PyTorch Tensor ──> Results

Key Components:
--------------
- HTTP Storage:  HTTPGPUStorage (Network interface)
- Local Store:   Memory pages (Physical storage)
- Virtual VRAM:  Page tables (Memory management)
- Device:        vGPU (Computation)
- Model:         Florence-2-Large (transformer)
- Framework:     PyTorch (ML operations)
- Interface:     to_vgpu() (Zero-copy transfer)

Memory Management Details:
------------------------
1. Local Storage:
   - Manages physical memory pages
   - Direct mapping to virtual VRAM
   - Zero-copy access for GPU ops

2. Virtual VRAM:
   - Page table management
   - Memory mapping to local storage
   - No physical copying of data
   - Direct GPU access to memory


Model Load (.npy files)
    │
    ├── AIAccelerator (Manages distribution)
    │   │
    │   ├── MultiGPUSystem (8 chips)
    │   │   │
    │   │   ├── Each GPUChip (108 SMs each)
    │   │   │   │
    │   │   │   └── Each SM (3000 tensor cores)
    │   │   │       │
    │   │   │       └── Individual Tensor Cores
    │   │   │           (Direct hardware-level execution)
    │   │   │
    │   │   └── NVLink 4.0 between chips
    │   │
    │   └── LocalStorage (electron-speed data access)


    AI Model/Operation
    │
    ├── AIAccelerator
    │   │
    │   ├── GPUParallelDistributor (Splits work)
    │   │   │
    │   │   └── Distributes across GPUs
    │   │
    │   └── MultiGPUSystem (Manages hardware)
    │       │
    │       ├── 8 GPU Chips
    │       │   │
    │       │   ├── 108 SMs each
    │       │   │   │
    │       │   │   └── 3000 tensor cores each
    │       │   │
    │       │   └── Local Storage
    │       │
    │       └── NVLink Connections

http_storage.py (LocalStorage)
    ↓
tensor_storage.py (TensorStorage)
    ↓
multithread_storage.py (MultithreadStorage)
    ↓
ai_http.py (AIAccelerator)




multi_gpu_system_http.py
    │
    ├──Uses──> LocalStorage (for state/tensor storage)
    └──Uses──> GPUChip (for individual GPU operations)
              │
              └──Uses──> MultiCoreSystem (for computation)
                        └──Uses──> ThreadedCore (for threads)

gpu_arch.py (outdated)
    │
    ├──Uses──> MultiCoreSystem (old usage)
    ├──Uses──> CustomVRAM (outdated)
    ├──Uses──> GPUStateDB (outdated)
    └──Uses──> AIAccelerator (limited integration)