The Engineering Unlocks Behind DeepSeek | YC Decoded

[00:00] (0.20s)

there's a new AI model in town Chinese

[00:02] (2.68s)

AI company deep seek recently made ways

[00:05] (5.52s)

when it announced R1 an open source

[00:08] (8.48s)

reasoning model that it claimed achieve

[00:10] (10.76s)

comparable performance to open AI 01 at

[00:13] (13.72s)

a fraction of the cost the announcement

[00:15] (15.80s)

Unleashed a wave of social media panic

[00:18] (18.08s)

and stock market chaos Nvidia losing

[00:21] (21.20s)

nearly $600 billion doll in market cap

[00:24] (24.00s)

today alone but for those following AI

[00:27] (27.12s)

developments closely deep seek on R1

[00:30] (30.40s)

didn't come out of nowhere the company

[00:32] (32.80s)

has been publishing its research and

[00:34] (34.88s)

releasing its model weights for months

[00:37] (37.56s)

following a pth similar to meta's Lama

[00:40] (40.20s)

model this is in contrast to other major

[00:42] (42.44s)

AI Labs like open AI Google deep mine

[00:45] (45.44s)

and anthropic that have close weights

[00:48] (48.28s)

and publish more limited technical

[00:50] (50.60s)

reports what's changed is just that now

[00:53] (53.88s)

the broader public is actually paying

[00:56] (56.20s)

attention so let's decode what the real

[00:59] (59.64s)

deel elements here are where they come

[01:01] (61.88s)

from and why they

[01:04] (64.35s)

[Music]

[01:08] (68.68s)

matter first of all it is important to

[01:11] (71.92s)

distinguish between two relevant models

[01:14] (74.40s)

here deep seek R1 and deep seek V3 deep

[01:19] (79.00s)

seek V3 which was actually released this

[01:22] (82.16s)

past December is a general purpose based

[01:25] (85.60s)

model that achieves comparable

[01:28] (88.28s)

performance to other base models like

[01:30] (90.24s)

open ai's GPT 40 anthropics claw 3.5

[01:33] (93.60s)

Sonet and Google's Gemini 1.5 deep seek

[01:36] (96.84s)

R1 which is released at the end of

[01:39] (99.80s)

January is a reasoning model built on

[01:42] (102.72s)

top of deep seek V3 in other words deeps

[01:46] (106.32s)

took B3 and applied various algorithmic

[01:49] (109.56s)

improvements to it in order to optimize

[01:52] (112.32s)

its reasoning ability resulting in R1 a

[01:56] (116.60s)

model that's achieved comparable

[01:58] (118.96s)

performance to open a01 and Google Flash

[02:02] (122.68s)

2.0 on certain complex reasoning

[02:04] (124.92s)

benchmarks but many of the algorithmic

[02:07] (127.36s)

Innovations responsible for r1's

[02:09] (129.36s)

remarkable performance were actually

[02:11] (131.32s)

discussed in this past December V3 paper

[02:13] (133.96s)

or even before that in deep seeks V2

[02:16] (136.60s)

paper which was published in May 2024 or

[02:19] (139.92s)

the Deep seek math paper which came out

[02:22] (142.16s)

February 2024 V3 stitches together many

[02:26] (146.04s)

of these Innovations which were designed

[02:28] (148.28s)

primarily with compute and training

[02:30] (150.64s)

efficiency in Mind One Way de seek

[02:33] (153.08s)

optimized for efficiency and got more

[02:34] (154.96s)

floating Point operations per second or

[02:37] (157.56s)

flops from the gpus was by trading V3

[02:40] (160.92s)

natively in 8bit floating Point format

[02:44] (164.48s)

rather than the usual 16bit or 32-bit

[02:47] (167.08s)

format this is not a new idea many other

[02:50] (170.12s)

labs are doing it too but it was key for

[02:53] (173.40s)

getting such massive memory savings

[02:55] (175.56s)

without sacrificing performance a

[02:57] (177.72s)

crucial enhancement is their fp8

[02:59] (179.88s)

accumulation fix which periodically

[03:02] (182.12s)

merges calculations back into a higher

[03:04] (184.76s)

position fp32 accumulator to prevent

[03:08] (188.28s)

small numerical errors from compounding

[03:10] (190.92s)

the result far more efficient training

[03:13] (193.52s)

across thousands of gpus cutting cost

[03:16] (196.40s)

while maintaining model quality but why

[03:19] (199.68s)

does this efficiency matter given its

[03:22] (202.52s)

Hardware constraints and US exports

[03:24] (204.76s)

controls on the sale of gpus to China

[03:27] (207.68s)

deep seek needed to find a way to get

[03:30] (210.48s)

more training and more bandwidth from

[03:32] (212.88s)

their existing cluster of gpus you see

[03:35] (215.56s)

at AI Labs these gpus which do number

[03:38] (218.76s)

crunching and matrix multiplication to

[03:40] (220.80s)

train these models are actually sitting

[03:43] (223.68s)

idle most of the time at fp8 it is

[03:47] (227.16s)

typical to only see around 35% model

[03:50] (230.60s)

flops utilization or mfu meaning gpus

[03:54] (234.52s)

are only being utilized at Peak

[03:57] (237.04s)

potential about a third of the time the

[03:59] (239.32s)

rest of the time these gpus are waiting

[04:01] (241.60s)

for data to be moved either between

[04:04] (244.44s)

caches or other gpus this is nvidia's

[04:07] (247.40s)

key Advantage it is not just about gpus

[04:10] (250.12s)

it is about an integrated solution

[04:12] (252.28s)

they've been building for over a decade

[04:14] (254.84s)

that includes the networking with

[04:16] (256.52s)

infinity band software with Cuda and

[04:18] (258.96s)

developer experience essentially Nvidia

[04:21] (261.96s)

provides a deeply integrated system that

[04:24] (264.08s)

lets AI researchers program GPU cluster

[04:27] (267.56s)

L is a distributed system and and closer

[04:30] (270.28s)

to what Jensen describes is one giant

[04:33] (273.68s)

GPU another clever way deep seek makes

[04:36] (276.64s)

the most out of their Hardware is their

[04:38] (278.72s)

particular implementation of a mixture

[04:40] (280.72s)

of experts architecture deep seek V3 has

[04:44] (284.84s)

671 billion modern parameters but only

[04:48] (288.48s)

37 billion are activated for a given

[04:51] (291.00s)

token prediction by contrast the largest

[04:53] (293.72s)

and most capable Lama 3 Model doesn't

[04:56] (296.60s)

use a mixture of expert architecture so

[04:59] (299.72s)

it activates its full 405 billion for

[05:03] (303.72s)

each token prediction in other words V3

[05:07] (307.12s)

activates 11x fewer parameters for each

[05:10] (310.64s)

forward pass saving tons of computation

[05:14] (314.20s)

mixture of experts isn't A New Concept

[05:17] (317.24s)

but it's been challenging to train

[05:19] (319.60s)

models with this architecture

[05:21] (321.44s)

efficiently deep seek introduced novel

[05:24] (324.04s)

techniques that stabilize performance

[05:26] (326.28s)

and increase GPU utilization

[05:29] (329.04s)

additionally to overcome key performance

[05:31] (331.92s)

bottlenecks V3 makes use of multi head

[05:35] (335.96s)

latent attention or MLA which deep seek

[05:39] (339.36s)

first revealed with its V2 paper which

[05:42] (342.36s)

was published in May 2024 MLA is a

[05:45] (345.48s)

solution designed to tackle KV cat

[05:48] (348.52s)

storage limitation one of the biggest

[05:51] (351.20s)

sources of bam overhead in large models

[05:54] (354.60s)

instead of storing full key and value

[05:56] (356.84s)

matrices MLA manages to compress press

[05:59] (359.88s)

them down into a latent representation

[06:02] (362.40s)

reconstructing them only when needed

[06:05] (365.16s)

this helped the B2 model reduce his KV

[06:07] (367.76s)

cach size by

[06:09] (369.88s)

93.3% and boosted its maximum generation

[06:12] (372.96s)

throughput to 5.76 times finally unlike

[06:17] (377.32s)

traditional models that predict only the

[06:19] (379.40s)

next token V3 makes use of multi- token

[06:22] (382.52s)

prediction or MTP MTP enables V3 to

[06:26] (386.88s)

anticipate multiple future tokens at

[06:29] (389.36s)

each each step this densifies training

[06:32] (392.24s)

signals providing more feedback per step

[06:34] (394.68s)

for better data efficiency and faster

[06:36] (396.84s)

learning it also improves representation

[06:39] (399.56s)

planning allowing the model to pre-plan

[06:42] (402.16s)

sequences for smoother more coherent

[06:45] (405.12s)

outputs during inference MTP modules can

[06:48] (408.88s)

be repurposed for speculative decoding

[06:51] (411.48s)

reducing sequential processing steps and

[06:53] (413.88s)

significantly speeding up generation

[06:56] (416.20s)

taken alog together this makes V3 one of

[06:58] (418.68s)

the most impressive based models on the

[07:00] (420.68s)

market and it's been out for some time

[07:02] (422.84s)

now however the recent release of deep

[07:05] (425.32s)

seeks R1 reasoning model is what really

[07:08] (428.36s)

made waves most llms can be improved by

[07:11] (431.96s)

being prompted to think step by step but

[07:14] (434.40s)

what sets reasoning models apart is that

[07:17] (437.00s)

they are specifically trained to break

[07:19] (439.40s)

down hard problems and think about them

[07:21] (441.76s)

for paragraphs at a time in September

[07:24] (444.44s)

open AI showed the power of this new

[07:26] (446.60s)

approach with 01 this achieved

[07:28] (448.72s)

state-of-the-art results in math coding

[07:31] (451.24s)

and science benchmarks with R1 deep seek

[07:34] (454.52s)

took a similar approach and published

[07:36] (456.28s)

The Secret Sauce open Ai and deep seek

[07:39] (459.08s)

achieve their impressive results through

[07:41] (461.12s)

reinforcement learning a technique to

[07:43] (463.76s)

shape an llms Behavior based on feedback

[07:46] (466.40s)

and reward signals modern llms use some

[07:49] (469.52s)

variation of reinforcement learning with

[07:51] (471.56s)

human feedback AKA RL HF or

[07:55] (475.84s)

reinforcement learning from AI feedback

[07:57] (477.96s)

AKA RL AI F to improve their models

[08:01] (481.48s)

usefulness and Alignment but reasoning

[08:04] (484.04s)

models apply RL specifically towards the

[08:07] (487.00s)

task of thinking step by step through

[08:09] (489.28s)

complex problems so how did deep seek

[08:12] (492.36s)

apply RL to get a reasoning model at a

[08:15] (495.36s)

high level they assemble a bunch of

[08:18] (498.04s)

problems with verifiable outputs

[08:20] (500.56s)

especially in math and coding problems

[08:23] (503.00s)

and then design a training pipeline to

[08:25] (505.80s)

get the model to think for a bit and

[08:28] (508.04s)

output the correct answers

[08:29] (509.96s)

but they don't give the model any

[08:32] (512.12s)

external examples of how to think

[08:34] (514.72s)

whether from humans or Ai and Grading

[08:37] (517.48s)

process was extremely simple rather than

[08:40] (520.12s)

using a complex AI to give the model

[08:42] (522.80s)

fine grain feedback deep seek uses

[08:46] (526.28s)

simple rules to evaluate the model's

[08:48] (528.40s)

final output on accuracy and formatting

[08:51] (531.16s)

they use these output scores to update

[08:53] (533.24s)

their model through a novel technique

[08:55] (535.32s)

they published in February 2024 called

[08:58] (538.68s)

group Rel relative policy optimization

[09:01] (541.40s)

or grpo remarkably with this process

[09:05] (545.12s)

alone deep seek saw reasoning emerg over

[09:08] (548.56s)

thousands of RL steps the model learned

[09:11] (551.68s)

skills like extended Chain of Thought

[09:14] (554.32s)

and even experience a aha moment where

[09:17] (557.68s)

it recognized its own mistakes and

[09:20] (560.36s)

backtracked to correct its reasoning

[09:22] (562.72s)

this model was r10 one of the first

[09:25] (565.80s)

large models to achieve top tier results

[09:28] (568.60s)

purely through enforcement learning pure

[09:31] (571.48s)

RL has long been a subject of

[09:33] (573.76s)

Investigation in Western research Labs

[09:36] (576.12s)

such as deep Minds alphao which

[09:38] (578.24s)

simulated thousands of random games of

[09:40] (580.32s)

self-play to beat Lisa do the world's

[09:43] (583.00s)

top go player in 2016 in 2019 open aai

[09:47] (587.20s)

achieved notable success using

[09:48] (588.84s)

reinforcement learning to train a

[09:50] (590.60s)

robotics hand to solve a Rubik's Cube

[09:52] (592.72s)

and beat a top human team in competitive

[09:55] (595.52s)

Dota 2 but unconstrained by human

[09:58] (598.44s)

examples R1 Z's thinking steps suffer

[10:01] (601.88s)

from poor readability switching between

[10:04] (604.60s)

English and Chinese at random so deep

[10:08] (608.32s)

seek introduced a cold start phase

[10:10] (610.84s)

fine-tuning on structured reasoning

[10:12] (612.64s)

examples before RL to get our one this

[10:16] (616.52s)

eliminated the language mixing issues

[10:19] (619.08s)

and made outputs far more comprehensible

[10:22] (622.40s)

the results are impressive R1 achieves

[10:25] (625.88s)

comparable performance to 01 on certain

[10:28] (628.64s)

math and coding benchmarks but the pace

[10:31] (631.56s)

of innovation is speeding up just 2

[10:34] (634.32s)

weeks after R1 was released open AI

[10:37] (637.04s)

release 03 Mei which outperforms R1 and

[10:40] (640.72s)

01 on key benchmarks so if R1 didn't

[10:44] (644.84s)

actually come out of nowhere what

[10:47] (647.12s)

explains the hype cycle one explanation

[10:50] (650.12s)

is the sheer accessibility of deep seeks

[10:52] (652.56s)

model ran is freely accessible through

[10:55] (655.48s)

their website and app and it is free to

[10:57] (657.48s)

download run locally and customized also

[11:00] (660.96s)

because of all the efficiency

[11:02] (662.80s)

improvements it offers near

[11:04] (664.56s)

state-of-the-art performance at a

[11:06] (666.28s)

fraction of the price of other reasoning

[11:08] (668.12s)

models another explanation is that a lot

[11:11] (671.28s)

of the hype cycle didn't actually have

[11:13] (673.36s)

to do with the specific algorithmic

[11:15] (675.64s)

improvements that we describ but with

[11:18] (678.12s)

misconceptions around v3's alleged $5.5

[11:21] (681.80s)

million in training cost there's some

[11:25] (685.04s)

important fine print here the 5.5

[11:28] (688.00s)

million figure refers only to the cost

[11:30] (690.44s)

of the final training run for V3 it

[11:33] (693.48s)

doesn't include any of the training cost

[11:36] (696.16s)

of R1 or the associated R&D or Hardware

[11:40] (700.16s)

operating expenses which are presumably

[11:42] (702.92s)

in the hundreds of millions given the

[11:45] (705.20s)

extreme algorithmic optimizations here

[11:48] (708.12s)

that 5.5 million training run number

[11:50] (710.68s)

actually seems perfectly possible and it

[11:53] (713.96s)

is worth noting that this work is

[11:56] (716.36s)

reproducible a UC Berkeley lab recently

[11:59] (719.40s)

apply R1 Z's key techniques to produce

[12:02] (722.68s)

complex reasoning in a smaller model for

[12:05] (725.56s)

just $30 what deep seek really proves is

[12:09] (729.24s)

that there is still room for new players

[12:11] (731.92s)

on the frontier in particular there's

[12:14] (734.48s)

room for rebuilding the stack for

[12:16] (736.60s)

optimizing GPU workloads improving

[12:19] (739.20s)

software at inference layer tooling and

[12:21] (741.32s)

developing AI generated kernels

[12:24] (744.00s)

ultimately this is fantastic news for AI

[12:26] (746.84s)

applications in consumer or B2B since it

[12:29] (749.96s)

means the cost of intelligence keeps

[12:32] (752.72s)

going down so the big takeaway here this

[12:36] (756.52s)

is the best possible time to be building

[12:39] (759.12s)

a

[12:41] (761.86s)

[Music]

[12:47] (767.20s)

startup the deadline to apply for the

[12:49] (769.80s)

first YC spring batch is February 11th

[12:53] (773.92s)

if you're accepted you'll receive

[12:55] (775.80s)

$500,000 in investment plus access to

[12:58] (778.56s)

the best startup community in the world

[13:01] (781.32s)

so apply now and come build the future

[13:04] (784.16s)

with us

YouTube Deep Summary

The Engineering Unlocks Behind DeepSeek | YC Decoded

📝 Transcript (282 entries):