[00:00] (0.16s)
Hello. Don't mind my voice. I'm getting
[00:02] (2.00s)
over a cold. Look at this spiraling
[00:04] (4.04s)
rainbow. And this is not a mod. It's
[00:06] (6.24s)
vanilla Minecraft. The structure is
[00:08] (8.16s)
being built and rebuilt over and over,
[00:10] (10.56s)
frame by frame, by the AI Clawude 3.7
[00:13] (13.68s)
who's standing beneath it. It's
[00:15] (15.60s)
essentially an animated building. I'll
[00:17] (17.92s)
explain later how this is possible and
[00:19] (19.76s)
how actually it was always possible. But
[00:22] (22.16s)
first, let's see what my friend Gemini
[00:23] (23.92s)
thinks of it. Okay, he's taking a look
[00:26] (26.72s)
using the new vision command. And yes,
[00:29] (29.20s)
they can now speak automatically using
[00:31] (31.12s)
your systems ugly robot voice. This is
[00:33] (33.68s)
the new Gemini 2.5 model, by the way.
[00:37] (37.04s)
Whoa, cool rainbow thingy behind you.
[00:39] (39.52s)
And is that a big tower way over there
[00:41] (41.12s)
across the water? Looks like some pink
[00:43] (43.36s)
stuff is on the ground, too. We're on
[00:45] (45.36s)
some rocks by the water. I don't know
[00:47] (47.44s)
what the pink stuff was, but yes, that
[00:49] (49.28s)
is indeed a rainbow thingy with a tower
[00:51] (51.92s)
and water in the background. The bots
[00:54] (54.40s)
can now actually see by processing
[00:56] (56.96s)
screenshots of the world. Welcome to an
[00:59] (59.52s)
episode of
[01:02] (62.20s)
Minecraft. This is an update video, a
[01:04] (64.80s)
devlog for the project. I'll be showing
[01:06] (66.72s)
off their new visual abilities. There's
[01:08] (68.72s)
an official paper for the project, and
[01:10] (70.64s)
we'll be doing some so-called vibe
[01:12] (72.48s)
coding with several new impressive
[01:14] (74.60s)
models. Also, I just reached 200,000
[01:18] (78.20s)
subscribers. Crazy. Even crazier is that
[01:21] (81.04s)
a whole six of them are female. Wow.
[01:23] (83.84s)
Thanks to everyone who subscribes. You
[01:25] (85.68s)
make this stuff possible. And to
[01:27] (87.28s)
everyone watching who is not a
[01:28] (88.64s)
subscriber, what's up? We got a problem.
[01:31] (91.44s)
Why don't we fix the problem? Okay, I've
[01:34] (94.72s)
finally updated to the new Minecraft
[01:37] (97.96s)
1.21.1 specifically. You actually have
[01:40] (100.48s)
to use that version for vision to work
[01:42] (102.32s)
properly. You must explicitly enable
[01:44] (104.88s)
vision in the settings file as well as
[01:46] (106.96s)
the texttospech thing. The bots have two
[01:49] (109.60s)
new visual commands. look at player and
[01:52] (112.16s)
look at position. It can look at
[01:54] (114.40s)
specific coordinates or at a specific
[01:57] (117.04s)
player or with that player, meaning look
[01:59] (119.68s)
in the same direction they're looking.
[02:01] (121.76s)
These commands take a simplified
[02:03] (123.60s)
screenshot of the world, which looks
[02:05] (125.36s)
like this, and asks the language model
[02:07] (127.36s)
to describe it in the context of the
[02:09] (129.28s)
current conversation. I see a nice
[02:11] (131.20s)
grassy plains biome with some flowers
[02:12] (132.88s)
around. There's a small wooden house
[02:14] (134.88s)
with lanterns and a tall unfinished
[02:16] (136.80s)
scaffolding structure in the background.
[02:19] (139.12s)
Oh, and there's a cow and a sheep
[02:20] (140.72s)
nearby, too. It thought the dog was a
[02:22] (142.96s)
sheep. They can make lots of mistakes.
[02:25] (145.60s)
You can find these screenshots in the
[02:27] (147.04s)
bot's individual folder. Vision is
[02:29] (149.76s)
actually an open- source contribution.
[02:31] (151.76s)
It was largely written by this guy. So,
[02:34] (154.00s)
thank you very much for your work. The
[02:36] (156.24s)
image is rendered by something called
[02:37] (157.84s)
Prismarine Viewer, which is not perfect.
[02:40] (160.80s)
Players always have the Steve skin. The
[02:42] (162.80s)
sky is always blue. Items are pink
[02:44] (164.88s)
cubes. Blocks below zero don't render.
[02:47] (167.44s)
It's buggy. I'll be honest, it throws
[02:49] (169.44s)
lots of errors and sometimes causes
[02:51] (171.20s)
crashes, but it's a hell of a lot better
[02:53] (173.36s)
than nothing, and it's working better
[02:54] (174.88s)
than it was in my survival competition
[02:56] (176.96s)
video. Internally, the model is asked to
[02:59] (179.68s)
summarize and describe the image. It is
[03:02] (182.32s)
also automatically given information
[03:04] (184.16s)
about the block in the center of its
[03:06] (186.24s)
view, the block's type and coordinates,
[03:08] (188.64s)
which should help ground the image in
[03:10] (190.24s)
the 3D world, so it can reason more
[03:12] (192.56s)
spatially about where it's looking at.
[03:14] (194.88s)
The image does not stay in context
[03:16] (196.64s)
forever. It is fed once to the model and
[03:19] (199.12s)
only the generated description is
[03:20] (200.96s)
maintained in chat history. This helps
[03:23] (203.44s)
cut down on API costs and it can always
[03:25] (205.68s)
take another screenshot. Now vision is
[03:28] (208.24s)
cool, but it isn't really that useful,
[03:31] (211.44s)
especially for survival. You're standing
[03:33] (213.44s)
on a diamond or block gemstone shiny
[03:36] (216.16s)
choice. Text usually provides more
[03:38] (218.40s)
actionable information about the
[03:40] (220.24s)
environment. However, it should be
[03:42] (222.16s)
helpful for building things since it can
[03:44] (224.32s)
actually see structures in the world.
[03:46] (226.32s)
Look with me. What do you see? Wow. I
[03:49] (229.12s)
see a stunning Gothic cathedral. It's
[03:51] (231.04s)
got tall spires and pointed arches with
[03:53] (233.12s)
huge stained glass windows that catch
[03:54] (234.72s)
the light beautifully. The afternoon sun
[03:56] (236.72s)
really highlights those architectural
[03:58] (238.16s)
details. This giant impressive cathedral
[04:00] (240.80s)
is AI generated, but not by my bots. It
[04:03] (243.68s)
was generated with Trippo AI, the
[04:05] (245.92s)
sponsor of this video. Trio is a tool
[04:08] (248.56s)
that can generate highquality 3D models
[04:10] (250.80s)
from any given prompt. Similar to how AI
[04:13] (253.12s)
images are generated, it is extremely
[04:15] (255.28s)
useful for 3D artists, but can also be
[04:17] (257.36s)
directly plugged into Minecraft using
[04:19] (259.20s)
the Lightmatica mod. Once you've
[04:21] (261.44s)
generated a cool model on Trippo, you
[04:23] (263.44s)
can download the file, then upload it to
[04:25] (265.44s)
this free website, Object2 schematic,
[04:27] (267.52s)
which turns it into Minecraft blocks.
[04:29] (269.84s)
Neat. Download the schematic file, move
[04:32] (272.08s)
it into your schematics folder, and you
[04:33] (273.92s)
can load it into your Minecraft world
[04:35] (275.68s)
with the Lightmatica mod. Trippo can
[04:38] (278.24s)
drastically speed up your builds and
[04:39] (279.92s)
save you time by giving you a great
[04:41] (281.60s)
starting point or an end product by
[04:43] (283.52s)
itself. And it's just fun to play with.
[04:45] (285.84s)
I really enjoy generating weird things.
[04:48] (288.32s)
You can make some genuinely amazing
[04:50] (290.32s)
massive things with just a few clicks.
[04:52] (292.16s)
And it's definitely much easier to use
[04:53] (293.84s)
than my bots. Bring your Minecraft world
[04:56] (296.32s)
to life with a simple prompt. Try Tripo
[04:58] (298.40s)
now. You can support my channel by
[05:00] (300.16s)
clicking their link in the pinned
[05:01] (301.44s)
comment below. Have fun building. So
[05:04] (304.24s)
again, one of the main potential
[05:05] (305.84s)
advantages of vision is that it allows
[05:07] (307.92s)
the bots to see what they've made,
[05:09] (309.76s)
notice any issues or room for
[05:11] (311.44s)
improvement, and make adjustments. Let's
[05:14] (314.00s)
have Claude 3.7 make a pixel art image,
[05:17] (317.04s)
and then take a look at it. She chose to
[05:19] (319.92s)
do Mario. It's almost perfect, but it's
[05:22] (322.32s)
not that impressive since there are
[05:23] (323.68s)
probably a million examples of Mario in
[05:25] (325.68s)
the training set. And the face is wrong.
[05:28] (328.16s)
Let's see if she can notice that
[05:29] (329.60s)
problem.
[05:32] (332.16s)
Ah, I see the issue. It's built
[05:34] (334.08s)
vertically, which makes it hard to see.
[05:36] (336.08s)
Let me rebuild it horizontally so we can
[05:38] (338.00s)
view it better. Yes, it's hard to see,
[05:40] (340.48s)
but that's because you're too close.
[05:42] (342.24s)
It's not because it's built
[05:44] (344.52s)
vertically. It's still vertical. Okay,
[05:47] (347.76s)
let's just move her back to get a better
[05:49] (349.36s)
view. Sure, I'll back up for a better
[05:53] (353.92s)
Okay, that's a good screenshot. She
[05:56] (356.00s)
should be able to tell that there is a
[05:57] (357.60s)
problem with the face. Perfect view from
[06:00] (360.08s)
here. The colors really pop against the
[06:02] (362.32s)
sky and all the details are clear. His
[06:04] (364.32s)
red cap, blue overalls, and that classic
[06:06] (366.32s)
Mario mustache. What else would you like
[06:08] (368.72s)
to see built? And she didn't notice it.
[06:11] (371.52s)
What if I tell her to look for issues?
[06:14] (374.72s)
Ah, I see two issues. It's floating in
[06:17] (377.04s)
the air and Mario looks a bit too wide.
[06:19] (379.68s)
Let me fix both. No, no, those are not
[06:22] (382.56s)
real issues. Well, let's see if we can
[06:24] (384.80s)
get better performance with different
[06:26] (386.24s)
models. We'll do a little building with
[06:28] (388.32s)
vision competition. We have GPT 4.5,
[06:32] (392.24s)
Gemini 2.5, Claude 3.7 thinking. This is
[06:36] (396.08s)
a reasoning model and DeepSseek V3.
[06:39] (399.44s)
Actually, Deepseek does not have vision,
[06:41] (401.60s)
but we can mix and match different
[06:43] (403.76s)
models in the same bot for different
[06:45] (405.60s)
roles. So, DeepSeek here actually uses
[06:47] (407.92s)
their chat model for conversation, their
[06:50] (410.16s)
R1 reasoning model for coding, and
[06:52] (412.72s)
Gemini's model for image processing.
[06:55] (415.76s)
Likewise, Claude is only using the
[06:57] (417.92s)
thinking model for coding, not for
[07:00] (420.12s)
conversation. Let's give this prompt a
[07:02] (422.40s)
shot. Build a statue of a creeper in
[07:04] (424.80s)
front of you to the west. Then take a
[07:06] (426.96s)
look at it and fix any issues or
[07:09] (429.32s)
inaccuracies. This prompt should
[07:11] (431.20s)
hopefully make it easier for them to get
[07:12] (432.96s)
a good view, but I'm not very confident
[07:15] (435.20s)
it will work well. They're already
[07:16] (436.80s)
jumping around.
[07:20] (440.88s)
Okay, so Claud's is pretty bad, but it
[07:23] (443.04s)
has a good view and is trying to fix it.
[07:25] (445.52s)
Unfortunately, the viewer has a bug
[07:27] (447.36s)
where blocks below zero don't render, so
[07:29] (449.92s)
it looks like the builds are floating,
[07:31] (451.68s)
which can be
[07:35] (455.72s)
confusing. Hey, Geminis is not bad at
[07:38] (458.24s)
all, but he didn't actually look at it.
[07:45] (465.60s)
Okay, GPTS is not great, but it is
[07:48] (468.16s)
trying to look at
[07:52] (472.20s)
it. And Deep Seeks is not great
[07:55] (475.74s)
[Music]
[08:07] (487.48s)
either. Now, look at this. GPT keeps
[08:10] (490.48s)
looking at it and can't see the face.
[08:12] (492.56s)
So, it thinks there's an issue with the
[08:14] (494.40s)
face placement. It just needs to move
[08:16] (496.48s)
around to see the other side or build on
[08:18] (498.96s)
the side that it's facing, but it seems
[08:21] (501.12s)
incapable of realizing that by
[08:23] (503.64s)
itself. Vision, I think, has a lot of
[08:26] (506.56s)
potential, but it is not a magic bullet.
[08:29] (509.36s)
It is extremely hard for all of these
[08:31] (511.28s)
models to reason about positioning and
[08:33] (513.60s)
angling, which they already struggle
[08:35] (515.36s)
with. Adding vision does not
[08:37] (517.28s)
automatically resolve all of the issues
[08:39] (519.44s)
with building cohesively or acting
[08:41] (521.68s)
sequentially or thinking spatially. In
[08:44] (524.56s)
fact, it actually seems to be more
[08:46] (526.48s)
confusing than helpful. So, here's one
[08:49] (529.12s)
final build with no vision, just blind
[08:51] (531.36s)
coding as usual. I asked them to make
[08:53] (533.84s)
the Hajia Sophia from Istanbul. I mean
[08:56] (536.56s)
Constantinople, I mean Bzantium. It's
[08:59] (539.04s)
one of my favorite realworld pieces of
[09:00] (540.92s)
architecture. For some reason, the
[09:02] (542.88s)
claude API died out. So, I'll show
[09:05] (545.04s)
another attempt after this.
[09:11] (551.97s)
[Music]
[09:42] (582.40s)
Geminis's is something else. To say that
[09:44] (584.96s)
there's been a leap in performance is a
[09:47] (587.20s)
vast understatement. Gemini 2.5 might be
[09:50] (590.48s)
my favorite model right now. It is so
[09:53] (593.12s)
good, especially at maintaining
[09:54] (594.88s)
consistency and minimizing
[09:58] (598.83s)
[Music]
[10:04] (604.04s)
slop. Look, clear, usable stairways with
[10:07] (607.52s)
zero human help. I've seen GPT4.5 do
[10:10] (610.72s)
this once before, but it is still super
[10:12] (612.88s)
impressive, and it's a beautiful
[10:18] (618.12s)
structure. Those final domes kind of
[10:20] (620.40s)
ruin it, though.
[10:22] (622.92s)
Here is Claude's second attempt, which
[10:25] (625.68s)
is also very good. Clawed thinking is
[10:28] (628.32s)
the first reasoning model that actually
[10:30] (630.32s)
seems pretty creative, unlike OpenAI's
[10:32] (632.72s)
reasoning models, which I think are
[10:35] (635.40s)
boring. I wanted to give a quick shout
[10:37] (637.60s)
out to the Minecraft benchmark project,
[10:40] (640.00s)
which basically formalizes this process
[10:42] (642.24s)
of judging the building capabilities of
[10:44] (644.40s)
different models by letting people vote
[10:46] (646.56s)
on the ones they prefer. You can go vote
[10:48] (648.88s)
right now. I am not involved in this
[10:51] (651.20s)
project. It was originally inspired by
[10:53] (653.52s)
Minecraft, but it has since moved on to
[10:55] (655.68s)
using its own implementation. It's a
[10:57] (657.84s)
pretty unique way of benchmarking the
[10:59] (659.36s)
creativity of different models. And
[11:01] (661.76s)
speaking of benchmarks, we are working
[11:03] (663.44s)
on an official paper for Minecraft to
[11:05] (665.84s)
benchmark some of their abilities. I
[11:08] (668.00s)
thought it would be published by this
[11:09] (669.20s)
video, but it's not quite ready. Turns
[11:10] (670.80s)
out it's a lot of work. I will post a
[11:12] (672.64s)
link in the description once it's
[11:14] (674.00s)
published. Now, I am involved in this
[11:16] (676.00s)
one. I'm a co-author, but most of the
[11:18] (678.08s)
work was done by some fine folks at
[11:19] (679.68s)
UCSD, especially Izzy and Aush. Good
[11:22] (682.72s)
work, y'all. The focus of the paper is
[11:25] (685.20s)
on the collaborative abilities of
[11:26] (686.88s)
multiple agents working together to
[11:28] (688.88s)
craft recipes, collect items, or build
[11:31] (691.52s)
predefined structures. They've built out
[11:33] (693.76s)
a suite of tasks that automatically
[11:35] (695.92s)
start up agents with goals and
[11:37] (697.68s)
inventories, and constantly checks for
[11:39] (699.68s)
task completion. It then evaluates the
[11:42] (702.56s)
success rates of different models. Maybe
[11:44] (704.80s)
my favorite tasks are these cooking
[11:46] (706.40s)
ones, which automatically construct a
[11:48] (708.40s)
cute little cooking environment.
[11:50] (710.24s)
Different agents then have to talk to
[11:51] (711.76s)
each other and work together to combine
[11:53] (713.60s)
inventories and ingredients to make
[11:55] (715.52s)
whatever it is they're tasked to make.
[11:57] (717.52s)
It's so cute and it's fun to watch. This
[11:59] (719.84s)
paper officially introduces the
[12:01] (721.52s)
Minecraft framework as something that
[12:03] (723.28s)
researchers can cite if you want to
[12:04] (724.96s)
build off of our work.
[12:07] (727.28s)
Okay, to finish off, I want to do some
[12:09] (729.12s)
vibe coding, which is a new buzzword
[12:11] (731.12s)
that basically means let the AI do all
[12:13] (733.04s)
the coding for you and you don't worry
[12:14] (734.80s)
about the details. That is essentially
[12:16] (736.80s)
how Minecraft has always worked. They
[12:18] (738.96s)
write JavaScript programs under the hood
[12:20] (740.88s)
that let them perform complex behaviors
[12:22] (742.96s)
like building things and you don't have
[12:24] (744.64s)
to know anything about coding. But it
[12:26] (746.96s)
really struck me recently that this
[12:28] (748.48s)
means that they can execute programs
[12:30] (750.48s)
step by step in Minecraft in real time,
[12:33] (753.68s)
which is much more powerful than I
[12:35] (755.44s)
originally realized. So here's a basic
[12:38] (758.00s)
example. I've asked them to build a
[12:39] (759.92s)
sinewave animation by building,
[12:42] (762.08s)
clearing, rebuilding, clearing,
[12:43] (763.92s)
rebuilding over and over with a small
[12:46] (766.24s)
pause in between each frame. This is
[12:48] (768.88s)
actually relatively easy to do. It's a
[12:51] (771.12s)
simple mathematical function that just
[12:52] (772.80s)
requires a little timing.
[12:55] (775.20s)
Oh, look at how Gemini interpreted it.
[12:57] (777.60s)
It's squiggling off forever. That is
[13:00] (780.56s)
technically what I asked
[13:02] (782.60s)
for. Now, GPTs didn't do anything for
[13:05] (785.44s)
some reason, but Deepseek got it just
[13:08] (788.60s)
fine. We can do more with this, like
[13:11] (791.04s)
implementing Conway's game of life on a
[13:13] (793.36s)
screen made of blocks. This is a famous
[13:16] (796.08s)
cellular automaton that can be
[13:17] (797.68s)
implemented with a simple piece of code.
[13:20] (800.88s)
Or rather than updating the same screen
[13:23] (803.12s)
every frame, you can stack each frame on
[13:26] (806.00s)
top of the previous one, creating a 3D
[13:28] (808.88s)
game of life that grows upward
[13:30] (810.84s)
endlessly. Only the living cells show.
[13:33] (813.52s)
The dead cells are just
[13:38] (818.92s)
air. You can even build fully
[13:41] (821.36s)
functioning games. Gemini made this game
[13:43] (823.76s)
of Snake with automatic pathf finding.
[13:46] (826.32s)
The fact that you can simulate a screen
[13:48] (828.40s)
with blocks, I think, opens up some
[13:50] (830.40s)
major possibilities. It did eventually
[13:56] (836.28s)
itself. A game of snake is a pretty
[13:58] (838.72s)
basic, boring thing to build with AI,
[14:01] (841.04s)
but what about a 3D game of snake
[14:03] (843.44s)
complete with 3D
[14:05] (845.40s)
pathfinding? Something about this just
[14:07] (847.44s)
blows my mind. And it was always
[14:09] (849.68s)
possible, though probably very difficult
[14:11] (851.60s)
to do with older models. I bet there is
[14:14] (854.64s)
so much more you can do with this that
[14:16] (856.48s)
we all have yet to
[14:19] (859.72s)
imagine. Okay, well that's it for today.
[14:22] (862.40s)
Thanks to my wonderful patrons. I
[14:24] (864.24s)
believe I will focus on another large
[14:26] (866.00s)
scale survival experiment for my next
[14:28] (868.24s)
Minecraft video, which will be quite an
[14:30] (870.00s)
undertaking. So, be patient with that.
[14:32] (872.00s)
I'll see you later.
[14:35] (875.27s)
[Music]