Vision and Vibe Coding | Mindcraft Update - YouTube Transcript & Summary

[00:00] (0.16s)

Hello. Don't mind my voice. I'm getting

[00:02] (2.00s)

over a cold. Look at this spiraling

[00:04] (4.04s)

rainbow. And this is not a mod. It's

[00:06] (6.24s)

vanilla Minecraft. The structure is

[00:08] (8.16s)

being built and rebuilt over and over,

[00:10] (10.56s)

frame by frame, by the AI Clawude 3.7

[00:13] (13.68s)

who's standing beneath it. It's

[00:15] (15.60s)

essentially an animated building. I'll

[00:17] (17.92s)

explain later how this is possible and

[00:19] (19.76s)

how actually it was always possible. But

[00:22] (22.16s)

first, let's see what my friend Gemini

[00:23] (23.92s)

thinks of it. Okay, he's taking a look

[00:26] (26.72s)

using the new vision command. And yes,

[00:29] (29.20s)

they can now speak automatically using

[00:31] (31.12s)

your systems ugly robot voice. This is

[00:33] (33.68s)

the new Gemini 2.5 model, by the way.

[00:37] (37.04s)

Whoa, cool rainbow thingy behind you.

[00:39] (39.52s)

And is that a big tower way over there

[00:41] (41.12s)

across the water? Looks like some pink

[00:43] (43.36s)

stuff is on the ground, too. We're on

[00:45] (45.36s)

some rocks by the water. I don't know

[00:47] (47.44s)

what the pink stuff was, but yes, that

[00:49] (49.28s)

is indeed a rainbow thingy with a tower

[00:51] (51.92s)

and water in the background. The bots

[00:54] (54.40s)

can now actually see by processing

[00:56] (56.96s)

screenshots of the world. Welcome to an

[00:59] (59.52s)

episode of

[01:02] (62.20s)

Minecraft. This is an update video, a

[01:04] (64.80s)

devlog for the project. I'll be showing

[01:06] (66.72s)

off their new visual abilities. There's

[01:08] (68.72s)

an official paper for the project, and

[01:10] (70.64s)

we'll be doing some so-called vibe

[01:12] (72.48s)

coding with several new impressive

[01:14] (74.60s)

models. Also, I just reached 200,000

[01:18] (78.20s)

subscribers. Crazy. Even crazier is that

[01:21] (81.04s)

a whole six of them are female. Wow.

[01:23] (83.84s)

Thanks to everyone who subscribes. You

[01:25] (85.68s)

make this stuff possible. And to

[01:27] (87.28s)

everyone watching who is not a

[01:28] (88.64s)

subscriber, what's up? We got a problem.

[01:31] (91.44s)

Why don't we fix the problem? Okay, I've

[01:34] (94.72s)

finally updated to the new Minecraft

[01:36] (96.48s)

version

[01:37] (97.96s)

1.21.1 specifically. You actually have

[01:40] (100.48s)

to use that version for vision to work

[01:42] (102.32s)

properly. You must explicitly enable

[01:44] (104.88s)

vision in the settings file as well as

[01:46] (106.96s)

the texttospech thing. The bots have two

[01:49] (109.60s)

new visual commands. look at player and

[01:52] (112.16s)

look at position. It can look at

[01:54] (114.40s)

specific coordinates or at a specific

[01:57] (117.04s)

player or with that player, meaning look

[01:59] (119.68s)

in the same direction they're looking.

[02:01] (121.76s)

These commands take a simplified

[02:03] (123.60s)

screenshot of the world, which looks

[02:05] (125.36s)

like this, and asks the language model

[02:07] (127.36s)

to describe it in the context of the

[02:09] (129.28s)

current conversation. I see a nice

[02:11] (131.20s)

grassy plains biome with some flowers

[02:12] (132.88s)

around. There's a small wooden house

[02:14] (134.88s)

with lanterns and a tall unfinished

[02:16] (136.80s)

scaffolding structure in the background.

[02:19] (139.12s)

Oh, and there's a cow and a sheep

[02:20] (140.72s)

nearby, too. It thought the dog was a

[02:22] (142.96s)

sheep. They can make lots of mistakes.

[02:25] (145.60s)

You can find these screenshots in the

[02:27] (147.04s)

bot's individual folder. Vision is

[02:29] (149.76s)

actually an open- source contribution.

[02:31] (151.76s)

It was largely written by this guy. So,

[02:34] (154.00s)

thank you very much for your work. The

[02:36] (156.24s)

image is rendered by something called

[02:37] (157.84s)

Prismarine Viewer, which is not perfect.

[02:40] (160.80s)

Players always have the Steve skin. The

[02:42] (162.80s)

sky is always blue. Items are pink

[02:44] (164.88s)

cubes. Blocks below zero don't render.

[02:47] (167.44s)

It's buggy. I'll be honest, it throws

[02:49] (169.44s)

lots of errors and sometimes causes

[02:51] (171.20s)

crashes, but it's a hell of a lot better

[02:53] (173.36s)

than nothing, and it's working better

[02:54] (174.88s)

than it was in my survival competition

[02:56] (176.96s)

video. Internally, the model is asked to

[02:59] (179.68s)

summarize and describe the image. It is

[03:02] (182.32s)

also automatically given information

[03:04] (184.16s)

about the block in the center of its

[03:06] (186.24s)

view, the block's type and coordinates,

[03:08] (188.64s)

which should help ground the image in

[03:10] (190.24s)

the 3D world, so it can reason more

[03:12] (192.56s)

spatially about where it's looking at.

[03:14] (194.88s)

The image does not stay in context

[03:16] (196.64s)

forever. It is fed once to the model and

[03:19] (199.12s)

only the generated description is

[03:20] (200.96s)

maintained in chat history. This helps

[03:23] (203.44s)

cut down on API costs and it can always

[03:25] (205.68s)

take another screenshot. Now vision is

[03:28] (208.24s)

cool, but it isn't really that useful,

[03:31] (211.44s)

especially for survival. You're standing

[03:33] (213.44s)

on a diamond or block gemstone shiny

[03:36] (216.16s)

choice. Text usually provides more

[03:38] (218.40s)

actionable information about the

[03:40] (220.24s)

environment. However, it should be

[03:42] (222.16s)

helpful for building things since it can

[03:44] (224.32s)

actually see structures in the world.

[03:46] (226.32s)

Look with me. What do you see? Wow. I

[03:49] (229.12s)

see a stunning Gothic cathedral. It's

[03:51] (231.04s)

got tall spires and pointed arches with

[03:53] (233.12s)

huge stained glass windows that catch

[03:54] (234.72s)

the light beautifully. The afternoon sun

[03:56] (236.72s)

really highlights those architectural

[03:58] (238.16s)

details. This giant impressive cathedral

[04:00] (240.80s)

is AI generated, but not by my bots. It

[04:03] (243.68s)

was generated with Trippo AI, the

[04:05] (245.92s)

sponsor of this video. Trio is a tool

[04:08] (248.56s)

that can generate highquality 3D models

[04:10] (250.80s)

from any given prompt. Similar to how AI

[04:13] (253.12s)

images are generated, it is extremely

[04:15] (255.28s)

useful for 3D artists, but can also be

[04:17] (257.36s)

directly plugged into Minecraft using

[04:19] (259.20s)

the Lightmatica mod. Once you've

[04:21] (261.44s)

generated a cool model on Trippo, you

[04:23] (263.44s)

can download the file, then upload it to

[04:25] (265.44s)

this free website, Object2 schematic,

[04:27] (267.52s)

which turns it into Minecraft blocks.

[04:29] (269.84s)

Neat. Download the schematic file, move

[04:32] (272.08s)

it into your schematics folder, and you

[04:33] (273.92s)

can load it into your Minecraft world

[04:35] (275.68s)

with the Lightmatica mod. Trippo can

[04:38] (278.24s)

drastically speed up your builds and

[04:39] (279.92s)

save you time by giving you a great

[04:41] (281.60s)

starting point or an end product by

[04:43] (283.52s)

itself. And it's just fun to play with.

[04:45] (285.84s)

I really enjoy generating weird things.

[04:48] (288.32s)

You can make some genuinely amazing

[04:50] (290.32s)

massive things with just a few clicks.

[04:52] (292.16s)

And it's definitely much easier to use

[04:53] (293.84s)

than my bots. Bring your Minecraft world

[04:56] (296.32s)

to life with a simple prompt. Try Tripo

[04:58] (298.40s)

now. You can support my channel by

[05:00] (300.16s)

clicking their link in the pinned

[05:01] (301.44s)

comment below. Have fun building. So

[05:04] (304.24s)

again, one of the main potential

[05:05] (305.84s)

advantages of vision is that it allows

[05:07] (307.92s)

the bots to see what they've made,

[05:09] (309.76s)

notice any issues or room for

[05:11] (311.44s)

improvement, and make adjustments. Let's

[05:14] (314.00s)

have Claude 3.7 make a pixel art image,

[05:17] (317.04s)

and then take a look at it. She chose to

[05:19] (319.92s)

do Mario. It's almost perfect, but it's

[05:22] (322.32s)

not that impressive since there are

[05:23] (323.68s)

probably a million examples of Mario in

[05:25] (325.68s)

the training set. And the face is wrong.

[05:28] (328.16s)

Let's see if she can notice that

[05:29] (329.60s)

problem.

[05:32] (332.16s)

Ah, I see the issue. It's built

[05:34] (334.08s)

vertically, which makes it hard to see.

[05:36] (336.08s)

Let me rebuild it horizontally so we can

[05:38] (338.00s)

view it better. Yes, it's hard to see,

[05:40] (340.48s)

but that's because you're too close.

[05:42] (342.24s)

It's not because it's built

[05:44] (344.52s)

vertically. It's still vertical. Okay,

[05:47] (347.76s)

let's just move her back to get a better

[05:49] (349.36s)

view. Sure, I'll back up for a better

[05:52] (352.00s)

view.

[05:53] (353.92s)

Okay, that's a good screenshot. She

[05:56] (356.00s)

should be able to tell that there is a

[05:57] (357.60s)

problem with the face. Perfect view from

[06:00] (360.08s)

here. The colors really pop against the

[06:02] (362.32s)

sky and all the details are clear. His

[06:04] (364.32s)

red cap, blue overalls, and that classic

[06:06] (366.32s)

Mario mustache. What else would you like

[06:08] (368.72s)

to see built? And she didn't notice it.

[06:11] (371.52s)

What if I tell her to look for issues?

[06:14] (374.72s)

Ah, I see two issues. It's floating in

[06:17] (377.04s)

the air and Mario looks a bit too wide.

[06:19] (379.68s)

Let me fix both. No, no, those are not

[06:22] (382.56s)

real issues. Well, let's see if we can

[06:24] (384.80s)

get better performance with different

[06:26] (386.24s)

models. We'll do a little building with

[06:28] (388.32s)

vision competition. We have GPT 4.5,

[06:32] (392.24s)

Gemini 2.5, Claude 3.7 thinking. This is

[06:36] (396.08s)

a reasoning model and DeepSseek V3.

[06:39] (399.44s)

Actually, Deepseek does not have vision,

[06:41] (401.60s)

but we can mix and match different

[06:43] (403.76s)

models in the same bot for different

[06:45] (405.60s)

roles. So, DeepSeek here actually uses

[06:47] (407.92s)

their chat model for conversation, their

[06:50] (410.16s)

R1 reasoning model for coding, and

[06:52] (412.72s)

Gemini's model for image processing.

[06:55] (415.76s)

Likewise, Claude is only using the

[06:57] (417.92s)

thinking model for coding, not for

[07:00] (420.12s)

conversation. Let's give this prompt a

[07:02] (422.40s)

shot. Build a statue of a creeper in

[07:04] (424.80s)

front of you to the west. Then take a

[07:06] (426.96s)

look at it and fix any issues or

[07:09] (429.32s)

inaccuracies. This prompt should

[07:11] (431.20s)

hopefully make it easier for them to get

[07:12] (432.96s)

a good view, but I'm not very confident

[07:15] (435.20s)

it will work well. They're already

[07:16] (436.80s)

jumping around.

[07:20] (440.88s)

Okay, so Claud's is pretty bad, but it

[07:23] (443.04s)

has a good view and is trying to fix it.

[07:25] (445.52s)

Unfortunately, the viewer has a bug

[07:27] (447.36s)

where blocks below zero don't render, so

[07:29] (449.92s)

it looks like the builds are floating,

[07:31] (451.68s)

which can be

[07:35] (455.72s)

confusing. Hey, Geminis is not bad at

[07:38] (458.24s)

all, but he didn't actually look at it.

[07:45] (465.60s)

Okay, GPTS is not great, but it is

[07:48] (468.16s)

trying to look at

[07:52] (472.20s)

it. And Deep Seeks is not great

[07:55] (475.74s)

[Music]

[08:07] (487.48s)

either. Now, look at this. GPT keeps

[08:10] (490.48s)

looking at it and can't see the face.

[08:12] (492.56s)

So, it thinks there's an issue with the

[08:14] (494.40s)

face placement. It just needs to move

[08:16] (496.48s)

around to see the other side or build on

[08:18] (498.96s)

the side that it's facing, but it seems

[08:21] (501.12s)

incapable of realizing that by

[08:23] (503.64s)

itself. Vision, I think, has a lot of

[08:26] (506.56s)

potential, but it is not a magic bullet.

[08:29] (509.36s)

It is extremely hard for all of these

[08:31] (511.28s)

models to reason about positioning and

[08:33] (513.60s)

angling, which they already struggle

[08:35] (515.36s)

with. Adding vision does not

[08:37] (517.28s)

automatically resolve all of the issues

[08:39] (519.44s)

with building cohesively or acting

[08:41] (521.68s)

sequentially or thinking spatially. In

[08:44] (524.56s)

fact, it actually seems to be more

[08:46] (526.48s)

confusing than helpful. So, here's one

[08:49] (529.12s)

final build with no vision, just blind

[08:51] (531.36s)

coding as usual. I asked them to make

[08:53] (533.84s)

the Hajia Sophia from Istanbul. I mean

[08:56] (536.56s)

Constantinople, I mean Bzantium. It's

[08:59] (539.04s)

one of my favorite realworld pieces of

[09:00] (540.92s)

architecture. For some reason, the

[09:02] (542.88s)

claude API died out. So, I'll show

[09:05] (545.04s)

another attempt after this.

[09:11] (551.97s)

[Music]

[09:42] (582.40s)

Geminis's is something else. To say that

[09:44] (584.96s)

there's been a leap in performance is a

[09:47] (587.20s)

vast understatement. Gemini 2.5 might be

[09:50] (590.48s)

my favorite model right now. It is so

[09:53] (593.12s)

good, especially at maintaining

[09:54] (594.88s)

consistency and minimizing

[09:58] (598.83s)

[Music]

[10:04] (604.04s)

slop. Look, clear, usable stairways with

[10:07] (607.52s)

zero human help. I've seen GPT4.5 do

[10:10] (610.72s)

this once before, but it is still super

[10:12] (612.88s)

impressive, and it's a beautiful

[10:18] (618.12s)

structure. Those final domes kind of

[10:20] (620.40s)

ruin it, though.

[10:22] (622.92s)

Here is Claude's second attempt, which

[10:25] (625.68s)

is also very good. Clawed thinking is

[10:28] (628.32s)

the first reasoning model that actually

[10:30] (630.32s)

seems pretty creative, unlike OpenAI's

[10:32] (632.72s)

reasoning models, which I think are

[10:34] (634.48s)

pretty

[10:35] (635.40s)

boring. I wanted to give a quick shout

[10:37] (637.60s)

out to the Minecraft benchmark project,

[10:40] (640.00s)

which basically formalizes this process

[10:42] (642.24s)

of judging the building capabilities of

[10:44] (644.40s)

different models by letting people vote

[10:46] (646.56s)

on the ones they prefer. You can go vote

[10:48] (648.88s)

right now. I am not involved in this

[10:51] (651.20s)

project. It was originally inspired by

[10:53] (653.52s)

Minecraft, but it has since moved on to

[10:55] (655.68s)

using its own implementation. It's a

[10:57] (657.84s)

pretty unique way of benchmarking the

[10:59] (659.36s)

creativity of different models. And

[11:01] (661.76s)

speaking of benchmarks, we are working

[11:03] (663.44s)

on an official paper for Minecraft to

[11:05] (665.84s)

benchmark some of their abilities. I

[11:08] (668.00s)

thought it would be published by this

[11:09] (669.20s)

video, but it's not quite ready. Turns

[11:10] (670.80s)

out it's a lot of work. I will post a

[11:12] (672.64s)

link in the description once it's

[11:14] (674.00s)

published. Now, I am involved in this

[11:16] (676.00s)

one. I'm a co-author, but most of the

[11:18] (678.08s)

work was done by some fine folks at

[11:19] (679.68s)

UCSD, especially Izzy and Aush. Good

[11:22] (682.72s)

work, y'all. The focus of the paper is

[11:25] (685.20s)

on the collaborative abilities of

[11:26] (686.88s)

multiple agents working together to

[11:28] (688.88s)

craft recipes, collect items, or build

[11:31] (691.52s)

predefined structures. They've built out

[11:33] (693.76s)

a suite of tasks that automatically

[11:35] (695.92s)

start up agents with goals and

[11:37] (697.68s)

inventories, and constantly checks for

[11:39] (699.68s)

task completion. It then evaluates the

[11:42] (702.56s)

success rates of different models. Maybe

[11:44] (704.80s)

my favorite tasks are these cooking

[11:46] (706.40s)

ones, which automatically construct a

[11:48] (708.40s)

cute little cooking environment.

[11:50] (710.24s)

Different agents then have to talk to

[11:51] (711.76s)

each other and work together to combine

[11:53] (713.60s)

inventories and ingredients to make

[11:55] (715.52s)

whatever it is they're tasked to make.

[11:57] (717.52s)

It's so cute and it's fun to watch. This

[11:59] (719.84s)

paper officially introduces the

[12:01] (721.52s)

Minecraft framework as something that

[12:03] (723.28s)

researchers can cite if you want to

[12:04] (724.96s)

build off of our work.

[12:07] (727.28s)

Okay, to finish off, I want to do some

[12:09] (729.12s)

vibe coding, which is a new buzzword

[12:11] (731.12s)

that basically means let the AI do all

[12:13] (733.04s)

the coding for you and you don't worry

[12:14] (734.80s)

about the details. That is essentially

[12:16] (736.80s)

how Minecraft has always worked. They

[12:18] (738.96s)

write JavaScript programs under the hood

[12:20] (740.88s)

that let them perform complex behaviors

[12:22] (742.96s)

like building things and you don't have

[12:24] (744.64s)

to know anything about coding. But it

[12:26] (746.96s)

really struck me recently that this

[12:28] (748.48s)

means that they can execute programs

[12:30] (750.48s)

step by step in Minecraft in real time,

[12:33] (753.68s)

which is much more powerful than I

[12:35] (755.44s)

originally realized. So here's a basic

[12:38] (758.00s)

example. I've asked them to build a

[12:39] (759.92s)

sinewave animation by building,

[12:42] (762.08s)

clearing, rebuilding, clearing,

[12:43] (763.92s)

rebuilding over and over with a small

[12:46] (766.24s)

pause in between each frame. This is

[12:48] (768.88s)

actually relatively easy to do. It's a

[12:51] (771.12s)

simple mathematical function that just

[12:52] (772.80s)

requires a little timing.

[12:55] (775.20s)

Oh, look at how Gemini interpreted it.

[12:57] (777.60s)

It's squiggling off forever. That is

[13:00] (780.56s)

technically what I asked

[13:02] (782.60s)

for. Now, GPTs didn't do anything for

[13:05] (785.44s)

some reason, but Deepseek got it just

[13:08] (788.60s)

fine. We can do more with this, like

[13:11] (791.04s)

implementing Conway's game of life on a

[13:13] (793.36s)

screen made of blocks. This is a famous

[13:16] (796.08s)

cellular automaton that can be

[13:17] (797.68s)

implemented with a simple piece of code.

[13:20] (800.88s)

Or rather than updating the same screen

[13:23] (803.12s)

every frame, you can stack each frame on

[13:26] (806.00s)

top of the previous one, creating a 3D

[13:28] (808.88s)

game of life that grows upward

[13:30] (810.84s)

endlessly. Only the living cells show.

[13:33] (813.52s)

The dead cells are just

[13:38] (818.92s)

air. You can even build fully

[13:41] (821.36s)

functioning games. Gemini made this game

[13:43] (823.76s)

of Snake with automatic pathf finding.

[13:46] (826.32s)

The fact that you can simulate a screen

[13:48] (828.40s)

with blocks, I think, opens up some

[13:50] (830.40s)

major possibilities. It did eventually

[13:53] (833.12s)

trap

[13:56] (836.28s)

itself. A game of snake is a pretty

[13:58] (838.72s)

basic, boring thing to build with AI,

[14:01] (841.04s)

but what about a 3D game of snake

[14:03] (843.44s)

complete with 3D

[14:05] (845.40s)

pathfinding? Something about this just

[14:07] (847.44s)

blows my mind. And it was always

[14:09] (849.68s)

possible, though probably very difficult

[14:11] (851.60s)

to do with older models. I bet there is

[14:14] (854.64s)

so much more you can do with this that

[14:16] (856.48s)

we all have yet to

[14:19] (859.72s)

imagine. Okay, well that's it for today.

[14:22] (862.40s)

Thanks to my wonderful patrons. I

[14:24] (864.24s)

believe I will focus on another large

[14:26] (866.00s)

scale survival experiment for my next

[14:28] (868.24s)

Minecraft video, which will be quite an

[14:30] (870.00s)

undertaking. So, be patient with that.

[14:32] (872.00s)

I'll see you later.

[14:35] (875.27s)

[Music]

YouTube Deep Summary

Vision and Vibe Coding | Mindcraft Update

📚 Chapter Summaries (5)

🤖 AI-Generated Summary:

Summary History

Overview

Main Topics Covered

Key Takeaways & Insights

Actionable Strategies

Specific Details & Examples

Warnings & Common Mistakes

Resources & Next Steps

📝 Transcript Chapters (5 chapters):

📝 Transcript (365 entries):