[00:00] (0.20s)
                    there's a new AI model in town Chinese
                 
            
                
                    [00:02] (2.68s)
                    AI company deep seek recently made ways
                 
            
                
                    [00:05] (5.52s)
                    when it announced R1 an open source
                 
            
                
                    [00:08] (8.48s)
                    reasoning model that it claimed achieve
                 
            
                
                    [00:10] (10.76s)
                    comparable performance to open AI 01 at
                 
            
                
                    [00:13] (13.72s)
                    a fraction of the cost the announcement
                 
            
                
                    [00:15] (15.80s)
                    Unleashed a wave of social media panic
                 
            
                
                    [00:18] (18.08s)
                    and stock market chaos Nvidia losing
                 
            
                
                    [00:21] (21.20s)
                    nearly $600 billion doll in market cap
                 
            
                
                    [00:24] (24.00s)
                    today alone but for those following AI
                 
            
                
                    [00:27] (27.12s)
                    developments closely deep seek on R1
                 
            
                
                    [00:30] (30.40s)
                    didn't come out of nowhere the company
                 
            
                
                    [00:32] (32.80s)
                    has been publishing its research and
                 
            
                
                    [00:34] (34.88s)
                    releasing its model weights for months
                 
            
                
                    [00:37] (37.56s)
                    following a pth similar to meta's Lama
                 
            
                
                    [00:40] (40.20s)
                    model this is in contrast to other major
                 
            
                
                    [00:42] (42.44s)
                    AI Labs like open AI Google deep mine
                 
            
                
                    [00:45] (45.44s)
                    and anthropic that have close weights
                 
            
                
                    [00:48] (48.28s)
                    and publish more limited technical
                 
            
                
                    [00:50] (50.60s)
                    reports what's changed is just that now
                 
            
                
                    [00:53] (53.88s)
                    the broader public is actually paying
                 
            
                
                    [00:56] (56.20s)
                    attention so let's decode what the real
                 
            
                
                    [00:59] (59.64s)
                    deel elements here are where they come
                 
            
                
                    [01:01] (61.88s)
                    from and why they
                 
            
                
            
                
                    [01:08] (68.68s)
                    matter first of all it is important to
                 
            
                
                    [01:11] (71.92s)
                    distinguish between two relevant models
                 
            
                
                    [01:14] (74.40s)
                    here deep seek R1 and deep seek V3 deep
                 
            
                
                    [01:19] (79.00s)
                    seek V3 which was actually released this
                 
            
                
                    [01:22] (82.16s)
                    past December is a general purpose based
                 
            
                
                    [01:25] (85.60s)
                    model that achieves comparable
                 
            
                
                    [01:28] (88.28s)
                    performance to other base models like
                 
            
                
                    [01:30] (90.24s)
                    open ai's GPT 40 anthropics claw 3.5
                 
            
                
                    [01:33] (93.60s)
                    Sonet and Google's Gemini 1.5 deep seek
                 
            
                
                    [01:36] (96.84s)
                    R1 which is released at the end of
                 
            
                
                    [01:39] (99.80s)
                    January is a reasoning model built on
                 
            
                
                    [01:42] (102.72s)
                    top of deep seek V3 in other words deeps
                 
            
                
                    [01:46] (106.32s)
                    took B3 and applied various algorithmic
                 
            
                
                    [01:49] (109.56s)
                    improvements to it in order to optimize
                 
            
                
                    [01:52] (112.32s)
                    its reasoning ability resulting in R1 a
                 
            
                
                    [01:56] (116.60s)
                    model that's achieved comparable
                 
            
                
                    [01:58] (118.96s)
                    performance to open a01 and Google Flash
                 
            
                
                    [02:02] (122.68s)
                    2.0 on certain complex reasoning
                 
            
                
                    [02:04] (124.92s)
                    benchmarks but many of the algorithmic
                 
            
                
                    [02:07] (127.36s)
                    Innovations responsible for r1's
                 
            
                
                    [02:09] (129.36s)
                    remarkable performance were actually
                 
            
                
                    [02:11] (131.32s)
                    discussed in this past December V3 paper
                 
            
                
                    [02:13] (133.96s)
                    or even before that in deep seeks V2
                 
            
                
                    [02:16] (136.60s)
                    paper which was published in May 2024 or
                 
            
                
                    [02:19] (139.92s)
                    the Deep seek math paper which came out
                 
            
                
                    [02:22] (142.16s)
                    February 2024 V3 stitches together many
                 
            
                
                    [02:26] (146.04s)
                    of these Innovations which were designed
                 
            
                
                    [02:28] (148.28s)
                    primarily with compute and training
                 
            
                
                    [02:30] (150.64s)
                    efficiency in Mind One Way de seek
                 
            
                
                    [02:33] (153.08s)
                    optimized for efficiency and got more
                 
            
                
                    [02:34] (154.96s)
                    floating Point operations per second or
                 
            
                
                    [02:37] (157.56s)
                    flops from the gpus was by trading V3
                 
            
                
                    [02:40] (160.92s)
                    natively in 8bit floating Point format
                 
            
                
                    [02:44] (164.48s)
                    rather than the usual 16bit or 32-bit
                 
            
                
                    [02:47] (167.08s)
                    format this is not a new idea many other
                 
            
                
                    [02:50] (170.12s)
                    labs are doing it too but it was key for
                 
            
                
                    [02:53] (173.40s)
                    getting such massive memory savings
                 
            
                
                    [02:55] (175.56s)
                    without sacrificing performance a
                 
            
                
                    [02:57] (177.72s)
                    crucial enhancement is their fp8
                 
            
                
                    [02:59] (179.88s)
                    accumulation fix which periodically
                 
            
                
                    [03:02] (182.12s)
                    merges calculations back into a higher
                 
            
                
                    [03:04] (184.76s)
                    position fp32 accumulator to prevent
                 
            
                
                    [03:08] (188.28s)
                    small numerical errors from compounding
                 
            
                
                    [03:10] (190.92s)
                    the result far more efficient training
                 
            
                
                    [03:13] (193.52s)
                    across thousands of gpus cutting cost
                 
            
                
                    [03:16] (196.40s)
                    while maintaining model quality but why
                 
            
                
                    [03:19] (199.68s)
                    does this efficiency matter given its
                 
            
                
                    [03:22] (202.52s)
                    Hardware constraints and US exports
                 
            
                
                    [03:24] (204.76s)
                    controls on the sale of gpus to China
                 
            
                
                    [03:27] (207.68s)
                    deep seek needed to find a way to get
                 
            
                
                    [03:30] (210.48s)
                    more training and more bandwidth from
                 
            
                
                    [03:32] (212.88s)
                    their existing cluster of gpus you see
                 
            
                
                    [03:35] (215.56s)
                    at AI Labs these gpus which do number
                 
            
                
                    [03:38] (218.76s)
                    crunching and matrix multiplication to
                 
            
                
                    [03:40] (220.80s)
                    train these models are actually sitting
                 
            
                
                    [03:43] (223.68s)
                    idle most of the time at fp8 it is
                 
            
                
                    [03:47] (227.16s)
                    typical to only see around 35% model
                 
            
                
                    [03:50] (230.60s)
                    flops utilization or mfu meaning gpus
                 
            
                
                    [03:54] (234.52s)
                    are only being utilized at Peak
                 
            
                
                    [03:57] (237.04s)
                    potential about a third of the time the
                 
            
                
                    [03:59] (239.32s)
                    rest of the time these gpus are waiting
                 
            
                
                    [04:01] (241.60s)
                    for data to be moved either between
                 
            
                
                    [04:04] (244.44s)
                    caches or other gpus this is nvidia's
                 
            
                
                    [04:07] (247.40s)
                    key Advantage it is not just about gpus
                 
            
                
                    [04:10] (250.12s)
                    it is about an integrated solution
                 
            
                
                    [04:12] (252.28s)
                    they've been building for over a decade
                 
            
                
                    [04:14] (254.84s)
                    that includes the networking with
                 
            
                
                    [04:16] (256.52s)
                    infinity band software with Cuda and
                 
            
                
                    [04:18] (258.96s)
                    developer experience essentially Nvidia
                 
            
                
                    [04:21] (261.96s)
                    provides a deeply integrated system that
                 
            
                
                    [04:24] (264.08s)
                    lets AI researchers program GPU cluster
                 
            
                
                    [04:27] (267.56s)
                    L is a distributed system and and closer
                 
            
                
                    [04:30] (270.28s)
                    to what Jensen describes is one giant
                 
            
                
                    [04:33] (273.68s)
                    GPU another clever way deep seek makes
                 
            
                
                    [04:36] (276.64s)
                    the most out of their Hardware is their
                 
            
                
                    [04:38] (278.72s)
                    particular implementation of a mixture
                 
            
                
                    [04:40] (280.72s)
                    of experts architecture deep seek V3 has
                 
            
                
                    [04:44] (284.84s)
                    671 billion modern parameters but only
                 
            
                
                    [04:48] (288.48s)
                    37 billion are activated for a given
                 
            
                
                    [04:51] (291.00s)
                    token prediction by contrast the largest
                 
            
                
                    [04:53] (293.72s)
                    and most capable Lama 3 Model doesn't
                 
            
                
                    [04:56] (296.60s)
                    use a mixture of expert architecture so
                 
            
                
                    [04:59] (299.72s)
                    it activates its full 405 billion for
                 
            
                
                    [05:03] (303.72s)
                    each token prediction in other words V3
                 
            
                
                    [05:07] (307.12s)
                    activates 11x fewer parameters for each
                 
            
                
                    [05:10] (310.64s)
                    forward pass saving tons of computation
                 
            
                
                    [05:14] (314.20s)
                    mixture of experts isn't A New Concept
                 
            
                
                    [05:17] (317.24s)
                    but it's been challenging to train
                 
            
                
                    [05:19] (319.60s)
                    models with this architecture
                 
            
                
                    [05:21] (321.44s)
                    efficiently deep seek introduced novel
                 
            
                
                    [05:24] (324.04s)
                    techniques that stabilize performance
                 
            
                
                    [05:26] (326.28s)
                    and increase GPU utilization
                 
            
                
                    [05:29] (329.04s)
                    additionally to overcome key performance
                 
            
                
                    [05:31] (331.92s)
                    bottlenecks V3 makes use of multi head
                 
            
                
                    [05:35] (335.96s)
                    latent attention or MLA which deep seek
                 
            
                
                    [05:39] (339.36s)
                    first revealed with its V2 paper which
                 
            
                
                    [05:42] (342.36s)
                    was published in May 2024 MLA is a
                 
            
                
                    [05:45] (345.48s)
                    solution designed to tackle KV cat
                 
            
                
                    [05:48] (348.52s)
                    storage limitation one of the biggest
                 
            
                
                    [05:51] (351.20s)
                    sources of bam overhead in large models
                 
            
                
                    [05:54] (354.60s)
                    instead of storing full key and value
                 
            
                
                    [05:56] (356.84s)
                    matrices MLA manages to compress press
                 
            
                
                    [05:59] (359.88s)
                    them down into a latent representation
                 
            
                
                    [06:02] (362.40s)
                    reconstructing them only when needed
                 
            
                
                    [06:05] (365.16s)
                    this helped the B2 model reduce his KV
                 
            
                
                    [06:07] (367.76s)
                    cach size by
                 
            
                
                    [06:09] (369.88s)
                    93.3% and boosted its maximum generation
                 
            
                
                    [06:12] (372.96s)
                    throughput to 5.76 times finally unlike
                 
            
                
                    [06:17] (377.32s)
                    traditional models that predict only the
                 
            
                
                    [06:19] (379.40s)
                    next token V3 makes use of multi- token
                 
            
                
                    [06:22] (382.52s)
                    prediction or MTP MTP enables V3 to
                 
            
                
                    [06:26] (386.88s)
                    anticipate multiple future tokens at
                 
            
                
                    [06:29] (389.36s)
                    each each step this densifies training
                 
            
                
                    [06:32] (392.24s)
                    signals providing more feedback per step
                 
            
                
                    [06:34] (394.68s)
                    for better data efficiency and faster
                 
            
                
                    [06:36] (396.84s)
                    learning it also improves representation
                 
            
                
                    [06:39] (399.56s)
                    planning allowing the model to pre-plan
                 
            
                
                    [06:42] (402.16s)
                    sequences for smoother more coherent
                 
            
                
                    [06:45] (405.12s)
                    outputs during inference MTP modules can
                 
            
                
                    [06:48] (408.88s)
                    be repurposed for speculative decoding
                 
            
                
                    [06:51] (411.48s)
                    reducing sequential processing steps and
                 
            
                
                    [06:53] (413.88s)
                    significantly speeding up generation
                 
            
                
                    [06:56] (416.20s)
                    taken alog together this makes V3 one of
                 
            
                
                    [06:58] (418.68s)
                    the most impressive based models on the
                 
            
                
                    [07:00] (420.68s)
                    market and it's been out for some time
                 
            
                
                    [07:02] (422.84s)
                    now however the recent release of deep
                 
            
                
                    [07:05] (425.32s)
                    seeks R1 reasoning model is what really
                 
            
                
                    [07:08] (428.36s)
                    made waves most llms can be improved by
                 
            
                
                    [07:11] (431.96s)
                    being prompted to think step by step but
                 
            
                
                    [07:14] (434.40s)
                    what sets reasoning models apart is that
                 
            
                
                    [07:17] (437.00s)
                    they are specifically trained to break
                 
            
                
                    [07:19] (439.40s)
                    down hard problems and think about them
                 
            
                
                    [07:21] (441.76s)
                    for paragraphs at a time in September
                 
            
                
                    [07:24] (444.44s)
                    open AI showed the power of this new
                 
            
                
                    [07:26] (446.60s)
                    approach with 01 this achieved
                 
            
                
                    [07:28] (448.72s)
                    state-of-the-art results in math coding
                 
            
                
                    [07:31] (451.24s)
                    and science benchmarks with R1 deep seek
                 
            
                
                    [07:34] (454.52s)
                    took a similar approach and published
                 
            
                
                    [07:36] (456.28s)
                    The Secret Sauce open Ai and deep seek
                 
            
                
                    [07:39] (459.08s)
                    achieve their impressive results through
                 
            
                
                    [07:41] (461.12s)
                    reinforcement learning a technique to
                 
            
                
                    [07:43] (463.76s)
                    shape an llms Behavior based on feedback
                 
            
                
                    [07:46] (466.40s)
                    and reward signals modern llms use some
                 
            
                
                    [07:49] (469.52s)
                    variation of reinforcement learning with
                 
            
                
                    [07:51] (471.56s)
                    human feedback AKA RL HF or
                 
            
                
                    [07:55] (475.84s)
                    reinforcement learning from AI feedback
                 
            
                
                    [07:57] (477.96s)
                    AKA RL AI F to improve their models
                 
            
                
                    [08:01] (481.48s)
                    usefulness and Alignment but reasoning
                 
            
                
                    [08:04] (484.04s)
                    models apply RL specifically towards the
                 
            
                
                    [08:07] (487.00s)
                    task of thinking step by step through
                 
            
                
                    [08:09] (489.28s)
                    complex problems so how did deep seek
                 
            
                
                    [08:12] (492.36s)
                    apply RL to get a reasoning model at a
                 
            
                
                    [08:15] (495.36s)
                    high level they assemble a bunch of
                 
            
                
                    [08:18] (498.04s)
                    problems with verifiable outputs
                 
            
                
                    [08:20] (500.56s)
                    especially in math and coding problems
                 
            
                
                    [08:23] (503.00s)
                    and then design a training pipeline to
                 
            
                
                    [08:25] (505.80s)
                    get the model to think for a bit and
                 
            
                
                    [08:28] (508.04s)
                    output the correct answers
                 
            
                
                    [08:29] (509.96s)
                    but they don't give the model any
                 
            
                
                    [08:32] (512.12s)
                    external examples of how to think
                 
            
                
                    [08:34] (514.72s)
                    whether from humans or Ai and Grading
                 
            
                
                    [08:37] (517.48s)
                    process was extremely simple rather than
                 
            
                
                    [08:40] (520.12s)
                    using a complex AI to give the model
                 
            
                
                    [08:42] (522.80s)
                    fine grain feedback deep seek uses
                 
            
                
                    [08:46] (526.28s)
                    simple rules to evaluate the model's
                 
            
                
                    [08:48] (528.40s)
                    final output on accuracy and formatting
                 
            
                
                    [08:51] (531.16s)
                    they use these output scores to update
                 
            
                
                    [08:53] (533.24s)
                    their model through a novel technique
                 
            
                
                    [08:55] (535.32s)
                    they published in February 2024 called
                 
            
                
                    [08:58] (538.68s)
                    group Rel relative policy optimization
                 
            
                
                    [09:01] (541.40s)
                    or grpo remarkably with this process
                 
            
                
                    [09:05] (545.12s)
                    alone deep seek saw reasoning emerg over
                 
            
                
                    [09:08] (548.56s)
                    thousands of RL steps the model learned
                 
            
                
                    [09:11] (551.68s)
                    skills like extended Chain of Thought
                 
            
                
                    [09:14] (554.32s)
                    and even experience a aha moment where
                 
            
                
                    [09:17] (557.68s)
                    it recognized its own mistakes and
                 
            
                
                    [09:20] (560.36s)
                    backtracked to correct its reasoning
                 
            
                
                    [09:22] (562.72s)
                    this model was r10 one of the first
                 
            
                
                    [09:25] (565.80s)
                    large models to achieve top tier results
                 
            
                
                    [09:28] (568.60s)
                    purely through enforcement learning pure
                 
            
                
                    [09:31] (571.48s)
                    RL has long been a subject of
                 
            
                
                    [09:33] (573.76s)
                    Investigation in Western research Labs
                 
            
                
                    [09:36] (576.12s)
                    such as deep Minds alphao which
                 
            
                
                    [09:38] (578.24s)
                    simulated thousands of random games of
                 
            
                
                    [09:40] (580.32s)
                    self-play to beat Lisa do the world's
                 
            
                
                    [09:43] (583.00s)
                    top go player in 2016 in 2019 open aai
                 
            
                
                    [09:47] (587.20s)
                    achieved notable success using
                 
            
                
                    [09:48] (588.84s)
                    reinforcement learning to train a
                 
            
                
                    [09:50] (590.60s)
                    robotics hand to solve a Rubik's Cube
                 
            
                
                    [09:52] (592.72s)
                    and beat a top human team in competitive
                 
            
                
                    [09:55] (595.52s)
                    Dota 2 but unconstrained by human
                 
            
                
                    [09:58] (598.44s)
                    examples R1 Z's thinking steps suffer
                 
            
                
                    [10:01] (601.88s)
                    from poor readability switching between
                 
            
                
                    [10:04] (604.60s)
                    English and Chinese at random so deep
                 
            
                
                    [10:08] (608.32s)
                    seek introduced a cold start phase
                 
            
                
                    [10:10] (610.84s)
                    fine-tuning on structured reasoning
                 
            
                
                    [10:12] (612.64s)
                    examples before RL to get our one this
                 
            
                
                    [10:16] (616.52s)
                    eliminated the language mixing issues
                 
            
                
                    [10:19] (619.08s)
                    and made outputs far more comprehensible
                 
            
                
                    [10:22] (622.40s)
                    the results are impressive R1 achieves
                 
            
                
                    [10:25] (625.88s)
                    comparable performance to 01 on certain
                 
            
                
                    [10:28] (628.64s)
                    math and coding benchmarks but the pace
                 
            
                
                    [10:31] (631.56s)
                    of innovation is speeding up just 2
                 
            
                
                    [10:34] (634.32s)
                    weeks after R1 was released open AI
                 
            
                
                    [10:37] (637.04s)
                    release 03 Mei which outperforms R1 and
                 
            
                
                    [10:40] (640.72s)
                    01 on key benchmarks so if R1 didn't
                 
            
                
                    [10:44] (644.84s)
                    actually come out of nowhere what
                 
            
                
                    [10:47] (647.12s)
                    explains the hype cycle one explanation
                 
            
                
                    [10:50] (650.12s)
                    is the sheer accessibility of deep seeks
                 
            
                
                    [10:52] (652.56s)
                    model ran is freely accessible through
                 
            
                
                    [10:55] (655.48s)
                    their website and app and it is free to
                 
            
                
                    [10:57] (657.48s)
                    download run locally and customized also
                 
            
                
                    [11:00] (660.96s)
                    because of all the efficiency
                 
            
                
                    [11:02] (662.80s)
                    improvements it offers near
                 
            
                
                    [11:04] (664.56s)
                    state-of-the-art performance at a
                 
            
                
                    [11:06] (666.28s)
                    fraction of the price of other reasoning
                 
            
                
                    [11:08] (668.12s)
                    models another explanation is that a lot
                 
            
                
                    [11:11] (671.28s)
                    of the hype cycle didn't actually have
                 
            
                
                    [11:13] (673.36s)
                    to do with the specific algorithmic
                 
            
                
                    [11:15] (675.64s)
                    improvements that we describ but with
                 
            
                
                    [11:18] (678.12s)
                    misconceptions around v3's alleged $5.5
                 
            
                
                    [11:21] (681.80s)
                    million in training cost there's some
                 
            
                
                    [11:25] (685.04s)
                    important fine print here the 5.5
                 
            
                
                    [11:28] (688.00s)
                    million figure refers only to the cost
                 
            
                
                    [11:30] (690.44s)
                    of the final training run for V3 it
                 
            
                
                    [11:33] (693.48s)
                    doesn't include any of the training cost
                 
            
                
                    [11:36] (696.16s)
                    of R1 or the associated R&D or Hardware
                 
            
                
                    [11:40] (700.16s)
                    operating expenses which are presumably
                 
            
                
                    [11:42] (702.92s)
                    in the hundreds of millions given the
                 
            
                
                    [11:45] (705.20s)
                    extreme algorithmic optimizations here
                 
            
                
                    [11:48] (708.12s)
                    that 5.5 million training run number
                 
            
                
                    [11:50] (710.68s)
                    actually seems perfectly possible and it
                 
            
                
                    [11:53] (713.96s)
                    is worth noting that this work is
                 
            
                
                    [11:56] (716.36s)
                    reproducible a UC Berkeley lab recently
                 
            
                
                    [11:59] (719.40s)
                    apply R1 Z's key techniques to produce
                 
            
                
                    [12:02] (722.68s)
                    complex reasoning in a smaller model for
                 
            
                
                    [12:05] (725.56s)
                    just $30 what deep seek really proves is
                 
            
                
                    [12:09] (729.24s)
                    that there is still room for new players
                 
            
                
                    [12:11] (731.92s)
                    on the frontier in particular there's
                 
            
                
                    [12:14] (734.48s)
                    room for rebuilding the stack for
                 
            
                
                    [12:16] (736.60s)
                    optimizing GPU workloads improving
                 
            
                
                    [12:19] (739.20s)
                    software at inference layer tooling and
                 
            
                
                    [12:21] (741.32s)
                    developing AI generated kernels
                 
            
                
                    [12:24] (744.00s)
                    ultimately this is fantastic news for AI
                 
            
                
                    [12:26] (746.84s)
                    applications in consumer or B2B since it
                 
            
                
                    [12:29] (749.96s)
                    means the cost of intelligence keeps
                 
            
                
                    [12:32] (752.72s)
                    going down so the big takeaway here this
                 
            
                
                    [12:36] (756.52s)
                    is the best possible time to be building
                 
            
                
            
                
                    [12:41] (761.86s)
                    [Music]
                 
            
                
                    [12:47] (767.20s)
                    startup the deadline to apply for the
                 
            
                
                    [12:49] (769.80s)
                    first YC spring batch is February 11th
                 
            
                
                    [12:53] (773.92s)
                    if you're accepted you'll receive
                 
            
                
                    [12:55] (775.80s)
                    $500,000 in investment plus access to
                 
            
                
                    [12:58] (778.56s)
                    the best startup community in the world
                 
            
                
                    [13:01] (781.32s)
                    so apply now and come build the future
                 
            
                
                    [13:04] (784.16s)
                    with us