[00:00] (0.20s)
there's a new AI model in town Chinese
[00:02] (2.68s)
AI company deep seek recently made ways
[00:05] (5.52s)
when it announced R1 an open source
[00:08] (8.48s)
reasoning model that it claimed achieve
[00:10] (10.76s)
comparable performance to open AI 01 at
[00:13] (13.72s)
a fraction of the cost the announcement
[00:15] (15.80s)
Unleashed a wave of social media panic
[00:18] (18.08s)
and stock market chaos Nvidia losing
[00:21] (21.20s)
nearly $600 billion doll in market cap
[00:24] (24.00s)
today alone but for those following AI
[00:27] (27.12s)
developments closely deep seek on R1
[00:30] (30.40s)
didn't come out of nowhere the company
[00:32] (32.80s)
has been publishing its research and
[00:34] (34.88s)
releasing its model weights for months
[00:37] (37.56s)
following a pth similar to meta's Lama
[00:40] (40.20s)
model this is in contrast to other major
[00:42] (42.44s)
AI Labs like open AI Google deep mine
[00:45] (45.44s)
and anthropic that have close weights
[00:48] (48.28s)
and publish more limited technical
[00:50] (50.60s)
reports what's changed is just that now
[00:53] (53.88s)
the broader public is actually paying
[00:56] (56.20s)
attention so let's decode what the real
[00:59] (59.64s)
deel elements here are where they come
[01:01] (61.88s)
from and why they
[01:08] (68.68s)
matter first of all it is important to
[01:11] (71.92s)
distinguish between two relevant models
[01:14] (74.40s)
here deep seek R1 and deep seek V3 deep
[01:19] (79.00s)
seek V3 which was actually released this
[01:22] (82.16s)
past December is a general purpose based
[01:25] (85.60s)
model that achieves comparable
[01:28] (88.28s)
performance to other base models like
[01:30] (90.24s)
open ai's GPT 40 anthropics claw 3.5
[01:33] (93.60s)
Sonet and Google's Gemini 1.5 deep seek
[01:36] (96.84s)
R1 which is released at the end of
[01:39] (99.80s)
January is a reasoning model built on
[01:42] (102.72s)
top of deep seek V3 in other words deeps
[01:46] (106.32s)
took B3 and applied various algorithmic
[01:49] (109.56s)
improvements to it in order to optimize
[01:52] (112.32s)
its reasoning ability resulting in R1 a
[01:56] (116.60s)
model that's achieved comparable
[01:58] (118.96s)
performance to open a01 and Google Flash
[02:02] (122.68s)
2.0 on certain complex reasoning
[02:04] (124.92s)
benchmarks but many of the algorithmic
[02:07] (127.36s)
Innovations responsible for r1's
[02:09] (129.36s)
remarkable performance were actually
[02:11] (131.32s)
discussed in this past December V3 paper
[02:13] (133.96s)
or even before that in deep seeks V2
[02:16] (136.60s)
paper which was published in May 2024 or
[02:19] (139.92s)
the Deep seek math paper which came out
[02:22] (142.16s)
February 2024 V3 stitches together many
[02:26] (146.04s)
of these Innovations which were designed
[02:28] (148.28s)
primarily with compute and training
[02:30] (150.64s)
efficiency in Mind One Way de seek
[02:33] (153.08s)
optimized for efficiency and got more
[02:34] (154.96s)
floating Point operations per second or
[02:37] (157.56s)
flops from the gpus was by trading V3
[02:40] (160.92s)
natively in 8bit floating Point format
[02:44] (164.48s)
rather than the usual 16bit or 32-bit
[02:47] (167.08s)
format this is not a new idea many other
[02:50] (170.12s)
labs are doing it too but it was key for
[02:53] (173.40s)
getting such massive memory savings
[02:55] (175.56s)
without sacrificing performance a
[02:57] (177.72s)
crucial enhancement is their fp8
[02:59] (179.88s)
accumulation fix which periodically
[03:02] (182.12s)
merges calculations back into a higher
[03:04] (184.76s)
position fp32 accumulator to prevent
[03:08] (188.28s)
small numerical errors from compounding
[03:10] (190.92s)
the result far more efficient training
[03:13] (193.52s)
across thousands of gpus cutting cost
[03:16] (196.40s)
while maintaining model quality but why
[03:19] (199.68s)
does this efficiency matter given its
[03:22] (202.52s)
Hardware constraints and US exports
[03:24] (204.76s)
controls on the sale of gpus to China
[03:27] (207.68s)
deep seek needed to find a way to get
[03:30] (210.48s)
more training and more bandwidth from
[03:32] (212.88s)
their existing cluster of gpus you see
[03:35] (215.56s)
at AI Labs these gpus which do number
[03:38] (218.76s)
crunching and matrix multiplication to
[03:40] (220.80s)
train these models are actually sitting
[03:43] (223.68s)
idle most of the time at fp8 it is
[03:47] (227.16s)
typical to only see around 35% model
[03:50] (230.60s)
flops utilization or mfu meaning gpus
[03:54] (234.52s)
are only being utilized at Peak
[03:57] (237.04s)
potential about a third of the time the
[03:59] (239.32s)
rest of the time these gpus are waiting
[04:01] (241.60s)
for data to be moved either between
[04:04] (244.44s)
caches or other gpus this is nvidia's
[04:07] (247.40s)
key Advantage it is not just about gpus
[04:10] (250.12s)
it is about an integrated solution
[04:12] (252.28s)
they've been building for over a decade
[04:14] (254.84s)
that includes the networking with
[04:16] (256.52s)
infinity band software with Cuda and
[04:18] (258.96s)
developer experience essentially Nvidia
[04:21] (261.96s)
provides a deeply integrated system that
[04:24] (264.08s)
lets AI researchers program GPU cluster
[04:27] (267.56s)
L is a distributed system and and closer
[04:30] (270.28s)
to what Jensen describes is one giant
[04:33] (273.68s)
GPU another clever way deep seek makes
[04:36] (276.64s)
the most out of their Hardware is their
[04:38] (278.72s)
particular implementation of a mixture
[04:40] (280.72s)
of experts architecture deep seek V3 has
[04:44] (284.84s)
671 billion modern parameters but only
[04:48] (288.48s)
37 billion are activated for a given
[04:51] (291.00s)
token prediction by contrast the largest
[04:53] (293.72s)
and most capable Lama 3 Model doesn't
[04:56] (296.60s)
use a mixture of expert architecture so
[04:59] (299.72s)
it activates its full 405 billion for
[05:03] (303.72s)
each token prediction in other words V3
[05:07] (307.12s)
activates 11x fewer parameters for each
[05:10] (310.64s)
forward pass saving tons of computation
[05:14] (314.20s)
mixture of experts isn't A New Concept
[05:17] (317.24s)
but it's been challenging to train
[05:19] (319.60s)
models with this architecture
[05:21] (321.44s)
efficiently deep seek introduced novel
[05:24] (324.04s)
techniques that stabilize performance
[05:26] (326.28s)
and increase GPU utilization
[05:29] (329.04s)
additionally to overcome key performance
[05:31] (331.92s)
bottlenecks V3 makes use of multi head
[05:35] (335.96s)
latent attention or MLA which deep seek
[05:39] (339.36s)
first revealed with its V2 paper which
[05:42] (342.36s)
was published in May 2024 MLA is a
[05:45] (345.48s)
solution designed to tackle KV cat
[05:48] (348.52s)
storage limitation one of the biggest
[05:51] (351.20s)
sources of bam overhead in large models
[05:54] (354.60s)
instead of storing full key and value
[05:56] (356.84s)
matrices MLA manages to compress press
[05:59] (359.88s)
them down into a latent representation
[06:02] (362.40s)
reconstructing them only when needed
[06:05] (365.16s)
this helped the B2 model reduce his KV
[06:07] (367.76s)
cach size by
[06:09] (369.88s)
93.3% and boosted its maximum generation
[06:12] (372.96s)
throughput to 5.76 times finally unlike
[06:17] (377.32s)
traditional models that predict only the
[06:19] (379.40s)
next token V3 makes use of multi- token
[06:22] (382.52s)
prediction or MTP MTP enables V3 to
[06:26] (386.88s)
anticipate multiple future tokens at
[06:29] (389.36s)
each each step this densifies training
[06:32] (392.24s)
signals providing more feedback per step
[06:34] (394.68s)
for better data efficiency and faster
[06:36] (396.84s)
learning it also improves representation
[06:39] (399.56s)
planning allowing the model to pre-plan
[06:42] (402.16s)
sequences for smoother more coherent
[06:45] (405.12s)
outputs during inference MTP modules can
[06:48] (408.88s)
be repurposed for speculative decoding
[06:51] (411.48s)
reducing sequential processing steps and
[06:53] (413.88s)
significantly speeding up generation
[06:56] (416.20s)
taken alog together this makes V3 one of
[06:58] (418.68s)
the most impressive based models on the
[07:00] (420.68s)
market and it's been out for some time
[07:02] (422.84s)
now however the recent release of deep
[07:05] (425.32s)
seeks R1 reasoning model is what really
[07:08] (428.36s)
made waves most llms can be improved by
[07:11] (431.96s)
being prompted to think step by step but
[07:14] (434.40s)
what sets reasoning models apart is that
[07:17] (437.00s)
they are specifically trained to break
[07:19] (439.40s)
down hard problems and think about them
[07:21] (441.76s)
for paragraphs at a time in September
[07:24] (444.44s)
open AI showed the power of this new
[07:26] (446.60s)
approach with 01 this achieved
[07:28] (448.72s)
state-of-the-art results in math coding
[07:31] (451.24s)
and science benchmarks with R1 deep seek
[07:34] (454.52s)
took a similar approach and published
[07:36] (456.28s)
The Secret Sauce open Ai and deep seek
[07:39] (459.08s)
achieve their impressive results through
[07:41] (461.12s)
reinforcement learning a technique to
[07:43] (463.76s)
shape an llms Behavior based on feedback
[07:46] (466.40s)
and reward signals modern llms use some
[07:49] (469.52s)
variation of reinforcement learning with
[07:51] (471.56s)
human feedback AKA RL HF or
[07:55] (475.84s)
reinforcement learning from AI feedback
[07:57] (477.96s)
AKA RL AI F to improve their models
[08:01] (481.48s)
usefulness and Alignment but reasoning
[08:04] (484.04s)
models apply RL specifically towards the
[08:07] (487.00s)
task of thinking step by step through
[08:09] (489.28s)
complex problems so how did deep seek
[08:12] (492.36s)
apply RL to get a reasoning model at a
[08:15] (495.36s)
high level they assemble a bunch of
[08:18] (498.04s)
problems with verifiable outputs
[08:20] (500.56s)
especially in math and coding problems
[08:23] (503.00s)
and then design a training pipeline to
[08:25] (505.80s)
get the model to think for a bit and
[08:28] (508.04s)
output the correct answers
[08:29] (509.96s)
but they don't give the model any
[08:32] (512.12s)
external examples of how to think
[08:34] (514.72s)
whether from humans or Ai and Grading
[08:37] (517.48s)
process was extremely simple rather than
[08:40] (520.12s)
using a complex AI to give the model
[08:42] (522.80s)
fine grain feedback deep seek uses
[08:46] (526.28s)
simple rules to evaluate the model's
[08:48] (528.40s)
final output on accuracy and formatting
[08:51] (531.16s)
they use these output scores to update
[08:53] (533.24s)
their model through a novel technique
[08:55] (535.32s)
they published in February 2024 called
[08:58] (538.68s)
group Rel relative policy optimization
[09:01] (541.40s)
or grpo remarkably with this process
[09:05] (545.12s)
alone deep seek saw reasoning emerg over
[09:08] (548.56s)
thousands of RL steps the model learned
[09:11] (551.68s)
skills like extended Chain of Thought
[09:14] (554.32s)
and even experience a aha moment where
[09:17] (557.68s)
it recognized its own mistakes and
[09:20] (560.36s)
backtracked to correct its reasoning
[09:22] (562.72s)
this model was r10 one of the first
[09:25] (565.80s)
large models to achieve top tier results
[09:28] (568.60s)
purely through enforcement learning pure
[09:31] (571.48s)
RL has long been a subject of
[09:33] (573.76s)
Investigation in Western research Labs
[09:36] (576.12s)
such as deep Minds alphao which
[09:38] (578.24s)
simulated thousands of random games of
[09:40] (580.32s)
self-play to beat Lisa do the world's
[09:43] (583.00s)
top go player in 2016 in 2019 open aai
[09:47] (587.20s)
achieved notable success using
[09:48] (588.84s)
reinforcement learning to train a
[09:50] (590.60s)
robotics hand to solve a Rubik's Cube
[09:52] (592.72s)
and beat a top human team in competitive
[09:55] (595.52s)
Dota 2 but unconstrained by human
[09:58] (598.44s)
examples R1 Z's thinking steps suffer
[10:01] (601.88s)
from poor readability switching between
[10:04] (604.60s)
English and Chinese at random so deep
[10:08] (608.32s)
seek introduced a cold start phase
[10:10] (610.84s)
fine-tuning on structured reasoning
[10:12] (612.64s)
examples before RL to get our one this
[10:16] (616.52s)
eliminated the language mixing issues
[10:19] (619.08s)
and made outputs far more comprehensible
[10:22] (622.40s)
the results are impressive R1 achieves
[10:25] (625.88s)
comparable performance to 01 on certain
[10:28] (628.64s)
math and coding benchmarks but the pace
[10:31] (631.56s)
of innovation is speeding up just 2
[10:34] (634.32s)
weeks after R1 was released open AI
[10:37] (637.04s)
release 03 Mei which outperforms R1 and
[10:40] (640.72s)
01 on key benchmarks so if R1 didn't
[10:44] (644.84s)
actually come out of nowhere what
[10:47] (647.12s)
explains the hype cycle one explanation
[10:50] (650.12s)
is the sheer accessibility of deep seeks
[10:52] (652.56s)
model ran is freely accessible through
[10:55] (655.48s)
their website and app and it is free to
[10:57] (657.48s)
download run locally and customized also
[11:00] (660.96s)
because of all the efficiency
[11:02] (662.80s)
improvements it offers near
[11:04] (664.56s)
state-of-the-art performance at a
[11:06] (666.28s)
fraction of the price of other reasoning
[11:08] (668.12s)
models another explanation is that a lot
[11:11] (671.28s)
of the hype cycle didn't actually have
[11:13] (673.36s)
to do with the specific algorithmic
[11:15] (675.64s)
improvements that we describ but with
[11:18] (678.12s)
misconceptions around v3's alleged $5.5
[11:21] (681.80s)
million in training cost there's some
[11:25] (685.04s)
important fine print here the 5.5
[11:28] (688.00s)
million figure refers only to the cost
[11:30] (690.44s)
of the final training run for V3 it
[11:33] (693.48s)
doesn't include any of the training cost
[11:36] (696.16s)
of R1 or the associated R&D or Hardware
[11:40] (700.16s)
operating expenses which are presumably
[11:42] (702.92s)
in the hundreds of millions given the
[11:45] (705.20s)
extreme algorithmic optimizations here
[11:48] (708.12s)
that 5.5 million training run number
[11:50] (710.68s)
actually seems perfectly possible and it
[11:53] (713.96s)
is worth noting that this work is
[11:56] (716.36s)
reproducible a UC Berkeley lab recently
[11:59] (719.40s)
apply R1 Z's key techniques to produce
[12:02] (722.68s)
complex reasoning in a smaller model for
[12:05] (725.56s)
just $30 what deep seek really proves is
[12:09] (729.24s)
that there is still room for new players
[12:11] (731.92s)
on the frontier in particular there's
[12:14] (734.48s)
room for rebuilding the stack for
[12:16] (736.60s)
optimizing GPU workloads improving
[12:19] (739.20s)
software at inference layer tooling and
[12:21] (741.32s)
developing AI generated kernels
[12:24] (744.00s)
ultimately this is fantastic news for AI
[12:26] (746.84s)
applications in consumer or B2B since it
[12:29] (749.96s)
means the cost of intelligence keeps
[12:32] (752.72s)
going down so the big takeaway here this
[12:36] (756.52s)
is the best possible time to be building
[12:41] (761.86s)
[Music]
[12:47] (767.20s)
startup the deadline to apply for the
[12:49] (769.80s)
first YC spring batch is February 11th
[12:53] (773.92s)
if you're accepted you'll receive
[12:55] (775.80s)
$500,000 in investment plus access to
[12:58] (778.56s)
the best startup community in the world
[13:01] (781.32s)
so apply now and come build the future
[13:04] (784.16s)
with us