Nobel Laureate John Jumper: AI is Revolutionizing Scientific Discovery

Nobel Laureate John Jumper: AI is Revolutionizing Scientific Discovery

Y Combinator • 27:26 minutes • Published 2025-07-15 • YouTube

📚 Chapter Summaries (14)

• Personal Background - 0:00
• Transition to Computational Biology - 1:26
• Journey into Machine Learning - 2:01
• Joining Google DeepMind - 2:59
• AlphaFold and Its Impact - 3:47
• The Complexity of Cells and Proteins - 4:54
• Challenges in Protein Structure Determination - 7:44
• Building the AlphaFold AI System - 10:28
• The Importance of Research in AI - 11:29
• AlphaFold's Breakthrough and Public Data - 13:28
• Making AlphaFold Accessible - 18:09
• Real-World Applications and Success Stories - 21:20
• Engineering New Proteins with AlphaFold - 22:33
• Future of AI in Structural Biology - 25:23

🤖 AI-Generated Summary:

📚 Video Chapters (14 chapters):

Personal Background - 00:00
Transition to Computational Biology - 01:26
Journey into Machine Learning - 02:01
Joining Google DeepMind - 02:59
AlphaFold and Its Impact - 03:47
The Complexity of Cells and Proteins - 04:54
Challenges in Protein Structure Determination - 07:44
Building the AlphaFold AI System - 10:28
The Importance of Research in AI - 11:29
AlphaFold's Breakthrough and Public Data - 13:28
Making AlphaFold Accessible - 18:09
Real-World Applications and Success Stories - 21:20
Engineering New Proteins with AlphaFold - 22:33
Future of AI in Structural Biology - 25:23

📹 Video Information:

Title: Nobel Laureate John Jumper: AI is Revolutionizing Scientific Discovery
Channel: Y Combinator
Duration: 27:26
Views: 13,726

Overview

This video chronicles the speaker’s journey from physicist to AI scientist, culminating in the creation and world-changing impact of AlphaFold, an AI system for predicting protein structures. The chapters progress from personal motivation and career pivots to deep technical challenges, the development and public release of AlphaFold, its scientific and societal ramifications, and finally, a forward-looking perspective on AI’s role in accelerating scientific discovery. Each chapter builds on the last, painting a coherent picture of how interdisciplinary expertise, open research, and thoughtful dissemination of technology can amplify science.

Chapter-by-Chapter Deep Dive

Personal Background (00:00)

Core Concepts & Main Points:
- The speaker introduces themselves as a physicist-turned-AI researcher passionate about leveraging AI to accelerate science, particularly for improving health outcomes.
- They recount their initial ambitions in physics, aiming for textbook-defining discoveries, but ultimately felt unfulfilled and left their PhD program.

Key Insights & Takeaways:
- Personal fulfillment and alignment with impactful goals can drive significant career pivots.
- The speaker’s journey is marked by a desire to apply technical skills toward practical, life-improving outcomes.

Actionable Advice:
- Reflect on whether your current trajectory is truly meaningful; don’t be afraid to change directions if not.

Connection to Overall Theme:
- Sets the tone for a narrative about finding purpose at the intersection of science, technology, and societal benefit.

Transition to Computational Biology (01:26)

Core Concepts & Main Points:
- After leaving physics, the speaker joined a computational biology company, discovering a passion for using computational tools to solve biological problems.

Key Insights & Takeaways:
- Computational biology provided a way to apply mathematical and coding strengths to real-world challenges, especially drug discovery.

Actionable Advice:
- Identify domains where your current skills can have outsized impact, especially in interdisciplinary fields.

Connection to Overall Theme:
- Introduces the convergence of computation and biology, foreshadowing the AI-for-science focus.

Journey into Machine Learning (02:01)

Core Concepts & Main Points:
- Limited by lack of computational resources in graduate school, the speaker pivoted to statistical methods and early machine learning, aiming to learn from data rather than brute computational force.

Key Insights & Takeaways:
- Constraints can inspire creative problem-solving and skill development (here, in statistics and machine learning).
- The evolution of “machine learning” from a niche, even disreputable, field to a powerful tool for scientific discovery.

Actionable Advice:
- Embrace constraints as opportunities for growth and innovation.

Connection to Overall Theme:
- Illustrates the evolution of technical approaches toward AI-driven science.

Joining Google DeepMind (02:59)

Core Concepts & Main Points:
- The move to DeepMind allowed the speaker to work at the intersection of cutting-edge AI and scientific advancement, with robust resources and talented colleagues.

Key Insights & Takeaways:
- Industrial research settings can accelerate progress, especially when combining top talent, resources, and ambitious goals.

Actionable Advice:
- Seek environments that push you to achieve more, especially those with a fast pace and high expectations.

Connection to Overall Theme:
- A platform like DeepMind enables the large-scale, high-impact projects that follow.

AlphaFold and Its Impact (03:47)

Core Concepts & Main Points:
- AlphaFold’s guiding principle is to build tools that empower scientists to make discoveries impossible for any one individual.
- The tool has been cited tens of thousands of times and used in a broad array of scientific advancements (vaccines, drug development, understanding biology).

Key Insights & Takeaways:
- The true value of foundational tools lies in their ripple effect—enabling discoveries by many others.
- Impact is measured not just in citations, but in the diversity and scale of downstream applications.

Actionable Advice:
- Focus on building tools that amplify the capabilities of others.

Connection to Overall Theme:
- Highlights the societal and scientific leverage provided by AI-driven solutions.

The Complexity of Cells and Proteins (04:54)

Core Concepts & Main Points:
- Biology is far more complex than textbook diagrams; cells are crowded environments with 20,000 types of proteins assembling into nanomachines.
- DNA encodes the order of amino acids, which then fold into intricate 3D protein structures that are essential for life.

Key Insights & Takeaways:
- The process from DNA to functional protein is non-trivial and central to biological function.
- Understanding protein structures is crucial for predicting disease and developing drugs.

Actionable Advice:
- Appreciate the complexity of biological systems before attempting computational solutions.

Connection to Overall Theme:
- Lays the biological groundwork for why protein structure prediction is both important and challenging.

Challenges in Protein Structure Determination (07:44)

Core Concepts & Main Points:
- Traditional experimental methods for determining protein structure are slow, complex, and require both ingenuity and patience (e.g., crystallization can take a year or more).
- The Protein Data Bank (PDB) stores about 200,000 structures, but billions of protein sequences are now known—structures lag far behind.

Key Insights & Takeaways:
- Experimental bottlenecks severely limit our ability to understand proteins at scale.
- The availability of public data (like PDB) is crucial for computational advances.

Actionable Advice:
- Leverage public datasets and recognize the value of collective data infrastructure in science.

Connection to Overall Theme:
- Establishes the pressing need for computational methods to bridge the structure-sequence gap.

Building the AlphaFold AI System (10:28)

Core Concepts & Main Points:
- The AlphaFold project aimed to predict 3D protein structures from amino acid sequences, focusing on practical outcomes over technological purity.
- Success required three elements: data, compute, and research.

Key Insights & Takeaways:
- It’s less about the specific technology (AI or otherwise) and more about achieving the objective efficiently.
- The “triangle” of data, compute, and research is essential for any machine learning breakthrough.

Actionable Advice:
- Be technology-agnostic when solving problems; focus on outcomes and leverage all available resources.

Connection to Overall Theme:
- Marks the transition from problem identification to solution development.

The Importance of Research in AI (11:29)

Core Concepts & Main Points:
- While data and compute are often highlighted, research (novel ideas and experimentation) is the critical, differentiating factor.
- AlphaFold’s breakthroughs were driven by “midscale” ideas—many small advances, not just headline-grabbing innovations like transformers.

Key Insights & Takeaways:
- Research amplifies the value of data and compute, sometimes by orders of magnitude.
- Most machine learning breakthroughs come from small, focused teams.

Actionable Advice:
- Invest in original research and iterative experimentation, not just scaling data or compute.

Connection to Overall Theme:
- Underscores the human, creative aspect of scientific and technical progress.

AlphaFold's Breakthrough and Public Data (13:28)

Core Concepts & Main Points:
- AlphaFold 2 dramatically improved over previous systems, with rigorous benchmarking showing the outsized impact of new ideas versus just more data.
- External, blind assessments (like CASP) are crucial for measuring real progress and avoiding overfitting to known benchmarks.

Key Insights & Takeaways:
- Real-world, independent evaluation is essential for credible scientific claims.
- Scientific progress often depends on incremental, cumulative ideas rather than “magic bullet” solutions.

Actionable Advice:
- Use external benchmarks and blind tests to evaluate your work rigorously.

Connection to Overall Theme:
- Validates the importance of both research innovation and scientific transparency.

Making AlphaFold Accessible (18:09)

Core Concepts & Main Points:
- AlphaFold was made accessible via open-source code and a massive public database of predictions (eventually covering nearly every known protein sequence).
- The release catalyzed adoption as biologists everywhere could instantly validate AlphaFold’s predictions against their own unpublished data.

Key Insights & Takeaways:
- Accessibility and ease of use are essential for real-world impact.
- Social proof and word-of-mouth within the scientific community drive trust and adoption.

Actionable Advice:
- Pair technical breakthroughs with thoughtful, user-friendly dissemination strategies.

Connection to Overall Theme:
- Shows the critical role of distribution and user engagement in amplifying scientific impact.

Real-World Applications and Success Stories (21:20)

Core Concepts & Main Points:
- Users quickly applied AlphaFold in unanticipated ways, such as predicting protein complexes by “prompting” the system with multiple proteins.
- The tool’s flexibility led to emergent capabilities and a proliferation of new scientific approaches.

Key Insights & Takeaways:
- Powerful tools will often be used in ways their creators never imagined.
- Community-driven innovation can unlock additional value from foundational technologies.

Actionable Advice:
- Design with flexibility and openness in mind to encourage creative re-use.

Connection to Overall Theme:
- Demonstrates how open scientific tools can drive unexpected breakthroughs.

Engineering New Proteins with AlphaFold (22:33)

Core Concepts & Main Points:
- AlphaFold enabled new protein engineering feats, such as re-engineering a “molecular syringe” for targeted drug delivery.
- The case study from MIT’s Jang Lab illustrates how AlphaFold predictions inform hypotheses and rapid experimental iteration.

Key Insights & Takeaways:
- AlphaFold accelerates science by making hypothesis generation and testing more efficient, not by replacing experiments but by guiding them.
- The tool is being used to make fundamental discoveries (e.g., fertilization mechanisms) by narrowing experimental focus.

Actionable Advice:
- Use computational predictions to prioritize and design more effective experiments.

Connection to Overall Theme:
- Highlights the synergy between AI predictions and experimental science.

Future of AI in Structural Biology (25:23)

Core Concepts & Main Points:
- AlphaFold has made structural biology significantly faster, serving as an “amplifier” for experimentalists.
- Foundational models trained on broad datasets will continue to generalize, with the potential to unlock accelerating discoveries in other scientific fields.

Key Insights & Takeaways:
- The future lies in building ever more general AI systems capable of extracting and applying scientific knowledge across domains.
- The key question: will AI’s impact remain in a few narrow fields, or will it become truly broad?

Actionable Advice:
- Look for foundational data and opportunities to generalize AI capabilities for scientific progress.

Connection to Overall Theme:
- Concludes with an optimistic vision for AI as a universal accelerator for science.

Cross-Chapter Synthesis

Recurring Themes and Strategies:
- Interdisciplinary Integration: The journey from physics to biology to AI (Chapters 1–3) demonstrates the value of cross-domain expertise.
- Amplification over Replacement: AI is positioned as an amplifier for experimental science, not a replacement (Chapters 5, 13, 14).
- Open Access and Community Impact: Democratizing tools and data (Chapters 10–12) catalyzes broad and deep scientific advances.
- Iterative, Idea-Driven Progress: Success comes from a multitude of midscale research innovations, rigorous testing, and adaptation (Chapters 8–9).
- Measurable, Real-World Validation: Blind assessments and user feedback ensure true impact, not just theoretical advancement (Chapters 9–12).
- Emergent, Unanticipated Applications: Users will find creative ways to leverage foundational tools (Chapters 12–13).

Video’s Learning Journey:
- The narrative starts with personal motivation, builds a foundation in biology and computation, frames the core challenge, then describes AlphaFold’s development, validation, distribution, and impact. The story culminates in a vision for the future, empowering viewers to see how AI, when thoughtfully applied and openly shared, can transform entire scientific disciplines.

Most Important Points and Their Chapters:
- The importance of research and novel ideas over brute force data/compute (Ch. 9).
- The critical role of open data and accessibility for real-world adoption (Ch. 11).
- The necessity of rigorous, independent validation (Ch. 10).
- AI’s greatest impact is as an amplifier for experimental science (Ch. 5, 14).
- Foundational tools drive unexpected, community-driven innovation (Ch. 12).

Actionable Strategies by Chapter

Personal Background (Ch. 1)
- Reflect on your motivation and be open to changing fields for greater impact.

Transition to Computational Biology (Ch. 2)
- Leverage your strengths in new, interdisciplinary applications.

Journey into Machine Learning (Ch. 3)
- Use constraints as catalysts for learning new skills or approaches.

Joining Google DeepMind (Ch. 4)
- Seek environments that combine resources, talent, and ambitious goals to maximize your impact.

AlphaFold and Its Impact (Ch. 5)
- Build tools that multiply the capabilities of others.

The Complexity of Cells and Proteins (Ch. 6)
- Ground your solutions in a deep understanding of the problem domain.

Challenges in Protein Structure Determination (Ch. 7)
- Make use of public, communal datasets and recognize the value of shared scientific infrastructure.

Building the AlphaFold AI System (Ch. 8)
- Remain agnostic to technology and focus on solving the core problem.

The Importance of Research in AI (Ch. 9)
- Prioritize research and experimentation, not just scaling data/compute.

AlphaFold's Breakthrough and Public Data (Ch. 10)
- Rely on rigorous, blinded assessments to validate breakthroughs.

Making AlphaFold Accessible (Ch. 11)
- Prioritize accessibility and user experience in distributing technical tools.

Real-World Applications and Success Stories (Ch. 12)
- Design for flexibility and encourage creative community usage.

Engineering New Proteins with AlphaFold (Ch. 13)
- Use AI predictions to guide and accelerate experimental science.

Future of AI in Structural Biology (Ch. 14)
- Pursue foundational models and seek to generalize AI’s impact across scientific disciplines.

Warnings/Pitfalls Mentioned:
- Don’t focus solely on data or compute at the expense of new ideas (Ch. 9).
- Be wary of overfitting to benchmarks; real-world validation is essential (Ch. 10).
- Science is not just validation—hypothesis generation and creative experimentation are equally important (Ch. 13).

Resources, Tools, Next Steps:
- Public datasets like the PDB (Ch. 7).
- Open-source AlphaFold code and public prediction databases (Ch. 11).
- External, blind assessment competitions like CASP (Ch. 10).

Chapter structure for reference:
- Personal Background (00:00)
- Transition to Computational Biology (01:26)
- Journey into Machine Learning (02:01)
- Joining Google DeepMind (02:59)
- AlphaFold and Its Impact (03:47)
- The Complexity of Cells and Proteins (04:54)
- Challenges in Protein Structure Determination (07:44)
- Building the AlphaFold AI System (10:28)
- The Importance of Research in AI (11:29)
- AlphaFold's Breakthrough and Public Data (13:28)
- Making AlphaFold Accessible (18:09)
- Real-World Applications and Success Stories (21:20)
- Engineering New Proteins with AlphaFold (22:33)
- Future of AI in Structural Biology (25:23)

📝 Transcript Chapters (14 chapters):

• Personal Background - 0:00
• Transition to Computational Biology - 1:26
• Journey into Machine Learning - 2:01
• Joining Google DeepMind - 2:59
• AlphaFold and Its Impact - 3:47
• The Complexity of Cells and Proteins - 4:54
• Challenges in Protein Structure Determination - 7:44
• Building the AlphaFold AI System - 10:28
• The Importance of Research in AI - 11:29
• AlphaFold's Breakthrough and Public Data - 13:28
• Making AlphaFold Accessible - 18:09
• Real-World Applications and Success Stories - 21:20
• Engineering New Proteins with AlphaFold - 22:33
• Future of AI in Structural Biology - 25:23

📝 Transcript (701 entries):

## Personal Background [00:00]
[00:00] This is something of a nice change. I've given a lot of scientific talks and no one claps and cheers when I come on. Not normally even when I come on. It's really exciting. It's really

[00:15] wonderful to be here. I guess I should start off assuming that not everyone in this cavernous hall knows who I am. Who am I? I'm I'm someone who has done some work in AI for science who really believes that we can use the AI systems, these technologies, these ideas to change the world in a very specific way to make science go faster to enable new discoveries. I think it's really really

[00:42] wonderful. We have the opportunity to take these tools, these ideas and aim them toward the question of how can we build the right AI systems so that sick people can become healthy and go home from the hospital. And it's been kind of a a really wonderful and winding journey for me to end up here. I was originally trained as a physicist. I

[01:04] thought I was going to be a laws of the [01:06] universe physicist. If I was very very lucky, I could do something that would end up one sentence in a textbook. And I did physics and I went to actually do a PhD in physics. And then kind of what I was working on didn't really grab me. I just it didn't feel like what I

[01:24] wanted to do. So I dropped out. I didn't

## Transition to Computational Biology [01:26]
[01:26] start a startup. That would have been very on point for this event, but I uh dropped out and I ended up working at a company that was doing computational biology. How do we get computers to say something smart about biology? And I loved it. I loved it not just because it

[01:43] was fun, but it was something that would [01:44] let me do what I thought I was good at. Write code, manipulate equations, think hard thoughts about the nature of the world and use it toward this very applied purpose that at the end we want to ena we want to make medicines or we want to enable others to make medicines.

## Journey into Machine Learning [02:01]
[02:01] Then I really kind of became a biologist [02:04] and a machine learner. Actually a machine learner because I left that job and I went back to grad school in biohysics and chemistry and uh I no longer had access to this incredible computer hardware that I had when I was working at my previous job and in fact they had custom asics for simulating how proteins this part of your body that I'll talk about move. And since I didn't have that anymore but I still wanted to work on the same problems. Well, I didn't want to just do the same thing with less compute. And so I started to

[02:33] learn and I was getting very interested [02:35] in statistics, in machine learning. We didn't call it AI back then. In fact, we didn't even call it machine learning. That was a bit disreputable. I said, I'm

[02:43] working in statistical physics. But you know, how are we going to develop algorithms? How are we going to learn from data and do that instead of very large compute? And I guess it turns out in terms of AI in addition to very large compute to answer new problems. And

## Joining Google DeepMind [02:59]
[02:59] after this I joined uh Google DeepMind [03:03] and really joining a company that wanted [03:06] to say how are we going to take these [03:09] powerful technologies and all kind of [03:11] these ideas and we they were becoming [03:13] very very readily apparent how powerful [03:15] these technologies were with [03:17] applications [03:18] uh to especially games but also to [03:22] things like data centers and others. How are we going to take these technologies and use them to advance science and really push forward scientific frontier? And how can we do this in an industrial setting with an incredibly fast pace working with some really smart people working with great computer resources and with all that you darn well better make some progress and it's been really really fun and the fact that I'm on this stage indicates that we made some progress and I think it really the

## AlphaFold and Its Impact [03:47]
[03:49] guiding principle for me has that when [03:52] we do this work that ultimately we are [03:56] building tools that will enable [03:58] scientists to make discoveries. And what I think is really heartening about the work we've done and the part that really I think still just resonates with me at my core is there about I think 35,000 citations of Alphafold. But within that is there are tens of thousands of examples of people using our tools to do science that I couldn't do on my own but are using it to make discoveries. be it vaccines, be it drug development, be it how the body works.

[04:30] And I think that's really really [04:31] exciting. And the part I want to talk to you about today and the story I want to tell you is a bit about the problem, a bit about how we did it. And I think especially the role of research and machine learning research and the fact that it isn't just off-the-shelf machine learning and then I want to tell you a little bit about what happens when you make something great and how people use it and what it does for the world. So,

## The Complexity of Cells and Proteins [04:54]
[04:54] I'll start with the world's shortest [04:56] biology lesson. The cell is complex. Um, for people who have only studied biology in high school or in college, you might have this idea that the cell is a couple parts that have labels attached to them. And it's kind of simple, but really it looks much more like what you see on the screen. It's

[05:16] dense. It's complex. Uh, in terms of crowding, it's like the swimming pool on the 4th of July and it's in full of enormous complexity. Humans have about 20,000 different types of proteins.

[05:30] Those are some of the blobs you see on [05:31] the screen. They come together to do practically every function in your cell. You can see that uh kind of green tail is the psyllium of uh an ecoli. That's how it moves around. And you can see in

[05:44] fact how it moves around. And you can see that thing that looks like it turns and in fact it turns and drives this motor. All of this is made of proteins. When people say that DNA is the instruction manual for life, well, this is what it's telling you how to do. It's

[05:59] telling you how to build these tiny [06:01] machines. And biology has evolved an incredible mechanism to build the machines it needs, literal nano machines, and build them out of atoms. And so your DNA gives you instructions that say build a protein. Now you might say your DNA is a line and so are proteins in a certain sense. It's

[06:20] instructions on how to attach one bead [06:22] after another where each bead is a [06:24] specific kind of molecular arrangement [06:26] of atoms. And you should wonder if I my DNA is aligned and I am very much not one-dimensional, what happens in between? And the answer is after you make this protein and assemble it one piece at a time, it will fold up spontaneously into a shape like you've opened your IKEA bookshelf and instead of having to do the hard work, it simply builds itself and you get this quite complex structure. You can see quite typical protein, a kynise for those of you who are biologists in the audience over there. And you can see this very complex

[07:00] arrangement of atoms and that [07:02] arrangement is functional and and the [07:06] majority not everyone of the proteins uh [07:08] in your body undergo this transformation [07:11] and that is what functions and that is [07:13] incredibly small. So light itself is a few hundred nanometers in size and that's a few nanometers in size. So it's smaller than you can see in a microscope. And for a long time scientists have wanted to understand this structure because they use it to predict how changes in that protein might affect disease. How does

[07:37] that work? How does biology work? Often if you make a drug it is to interrupt the function of a certain protein like this one.

## Challenges in Protein Structure Determination [07:44]
[07:44] Now scientists have through an [07:47] incredible amount of cleverness figured [07:49] out the structure of lots of proteins [07:51] and it remains to this day exceptionally [07:54] difficult. Right? You shouldn't imagine this as I want to determine the structure of a protein. So I shall open the lab protocol for protein structure determination. I shall follow the steps.

[08:06] It consists of cleverness of ideas of [08:10] finding many ways. In this case, I'm describing one type of protein structure prediction in or protein structure, sorry, determination, experimental measurement, where you convince that big ugly molecule I just showed you to form a regular crystal kind of like table salt. No one has an easy recipe for this. So, they try many things. They

[08:28] have ideas and it's exceptionally [08:32] difficult and filled with failure like [08:34] many things in science. And you're really looking at kind of one way to get an idea of how difficult this is. Just one kind of ordinary paper that we were using. I flipped to the back and it said, you know, in their protocol, after more than a year, crystals began to form. Right?

[08:52] So, not only did they do all these hard [08:54] experiments, but they had to wait about [08:56] a year to find out if it worked. And probably that year wasn't spent waiting. It was trying a thousand other things that didn't work as well. Once you do that, you can take this to a uh synretron, a modest thing. You can

[09:09] see the cars rigging the outside of this [09:11] instrument so that you can shine [09:13] incredibly bright X-rays on it and get [09:15] what is called a defraction pattern and [09:18] you can solve that and you can deposit [09:20] it in what's called the PDB or the [09:22] protein datab bank. And one of the things that enabled the work we did is that scientists 50 years ago had the foresight to say these are important, these are hard. We should collect them all in one place. So there's a data set that represents ex essentially all the academic output of protein structures in the community and available to everyone.

[09:46] So our work was on very public data. About 200,000 protein structures are known. They pretty regularly increase at about 12,000 a year. But this is much much smaller than the need.

[10:01] Getting the kind of input information, [10:03] the DNA that tells you about a protein [10:06] is much much much much easier. So billions of protein sequences are being discovered. About 3,000 times faster are we learning about protein sequence than protein structure. Okay, that's all scientific content, but I should talk to you about the little thing we did which has this kind of schematic diagram.

## Building the AlphaFold AI System [10:28]
[10:28] We wanted to build an AI system. In fact, we didn't even care if it was an AI system. That's one of the nice things about uh working in AI for science is you don't care how you solve it. If it ended up being a computer program, if it ended up being anything else, we want to find some way to get from the left where each of those letters represents a specific building block of the protein considered an order. We want to put

[10:51] something in the middle in the alpha [10:53] fold and we want to end up with [10:55] something on the right. And you'll see uh two structures there if you look closely where the blue is our prediction and the green is the experimental structure that took someone a year or two of effort. If you want to put an economic value on it on the order of $100,000 and you can see we were able to do this and I want to tell you how and there were really three components to doing this or to do any machine learning problem and you can say you have data and you have compute and you have research

## The Importance of Research in AI [11:29]
[11:29] and I feel like we tell too many stories [11:32] about the first two and not enough about [11:34] the third. In data, we had 200,000 protein structures. Everyone has the same data. In terms of compute, this isn't LLM scale. It's the final model itself was

[11:48] 128 TPU v3 cores, roughly equivalent to [11:52] a GPU per core for two weeks. This is again within the scope of say academic resources but it's worth saying really most of your compute when you think about how much compute you need don't get distracted by the number for the final model the real cost of compute is the cost of ideas that didn't work all the things you had to do to get there and then finally research and I would say this is all but about two people that worked on this it's a small group of people that end up doing this So really when you look at these machine learning breakthroughs they're probably fewer people than you imagine and really this is where our work was differentiated. We came up with a new set of ideas on how do we bring machine learning to this problem and I can say earlier systems largely based on convolutional neural networks did okay. They certainly made progress. If you

[12:48] replace that with a transformer you're [12:50] honestly about the same. If you take the ideas of a transformer and much experimentation and many more ideas, then that's when you start to get real change. And in almost all the AI systems you can see today, a tremendous amount of research and ideas and what I would call midscale ideas are involved. It isn't just about the headlines where people will say transformers, you know, scaling, test time inference.

[13:18] These are all important but they're one [13:20] of many ingredients in a really powerful [13:22] system and in fact we can measure how [13:26] much our research was worth. So someone

## AlphaFold's Breakthrough and Public Data [13:28]
[13:29] Alphafold 2 is the system that is quite [13:31] famous the one that uh was quite a large [13:33] improvement. Alpha fold one was the best in the world but someone did uh the Alcesi lab did a very uh careful experiment where they took Alphold 2 the architecture and they trained it on 1% of the available data and they could show that alpha fold 2 trained on 1% of the data was as accurate or more accurate as alphafold one which was the state-of-the-art system previously. So there's a very clean thing that says that the third uh the third of these ingredients research was worth a hundfold of the first of these ingredients data. And I think this is generally really really important that one of the big as you're all thinking as you're all in startups or thinking about startups think about the amount to which ideas research discoveries amplify data amplify compute they work together with it we wouldn't want to use less data than we have we wouldn't want to use less compute than we have available but ideas are a core component when you're doing machine learning research and they really helped to transform the world.

[14:41] YC's Next Batch is now taking [14:44] applications. Got a startup in you? Apply at y combinator.com/apply.

[14:49] It's never too early. And filling out the app will level up your idea. Okay, back to the video. We can even go back and we can do ablations and we can say what parts matter. And don't focus too

[15:00] much on the details. We pulled this from our paper. You can see here this is the difference compared to the baseline. And you take either of those and you can see that each of the ideas that you might remove from our final system kind of discreet identifiable ideas some of which were incredibly popular research areas within the field like this work came out and a part of it was equivariant and people said equivariance that is the answer alphafold is an equivariant system and it's great we must do more research on equivarians to get even more great systems well I was very confused by this because the sixth uh row there no IPA invariant point attention that removes all the equavariance in alpha fold and it hurts a bit but only a bit. Alpha fold itself

[15:48] on this GDT scale that you can see on [15:51] the left graph. Alphafold 2 was about 30 GDT better than alphafold one and equivariance explains two or three of this. It isn't about one idea. It's about many midscale ideas that add up to a transformative system. And it's very

[16:07] very important when you're building [16:08] these systems to think about what we [16:11] would call in this context biological [16:13] relevance. We would have ideas that were better. We kind of got our system grinding 1% at a time. But what really mattered was when we crossed the accuracy that it mattered to an experimental biologist who didn't care about machine learning. And you have to

[16:29] get there through a lot of work and a [16:31] lot of effort. And when you do, it is incredibly transformative. And we can measure against uh this axis where the dark blue axis the other systems available at the time. And this was assessed. Protein structure prediction

[16:45] is in some ways far ahead of uh LLMs or [16:49] the general machine learning space and [16:50] having blind assessment. Since 1994, every two years, everyone interested in predicting the structure of proteins gets together and predicts the structure of a hundred proteins whose answer isn't known to anyone except the research group that just solved it, right? Unpublished. And so, you really do know what works. And we had about a third of

[17:08] the error of any other group on this [17:10] assessment. But it matters because once you are working on problems in which you don't know the answer, you get to really measure how good things are. And you can really find that a lot of systems don't live up to what people believe over the course of their research. And because even if you have a benchmark, we all overfit to our ideas to the benchmark, right? Unless you have held out. And in

[17:33] fact, the problems you have in the real [17:36] world are almost always harder than the [17:38] problems you train on, right? Because you have to learn from much data and you apply it to very important singular problems. So it is very very important that you measure well both as you're developing and when people are trying to decide whether they should use your system. External benchmarks are absolutely critical to figuring out what works and that's what really helps drive the world forward. So just some

[18:02] wonderful examples of this is typical [18:04] performance for us. These are blind predictions. You can see they're pretty darn good. also important we made it

## Making AlphaFold Accessible [18:09]
[18:10] available and we thought it was and we [18:12] did a lot of assessment but we decided [18:13] that it was very important to make it [18:15] available in two ways. One is that we open source the code and we actually open sourced the code about a week before we released a database of predictions starting originally at 300,000 predictions and later going to 200 million essentially every protein um from an organism whose genome has been sequenced. And this made an enormous difference. And one of the most interesting kind of sociological things is this huge difference between when we released a piece of code that specialists could use and we got some information and then when we made it available to the world in this database form. It was really interesting kind of

[18:51] you know you release something and every [18:52] day you check Twitter to find out or [18:54] check X to find out what's going on. And what we would really see is even after that CASP assessment, I would say that the structure predictors were convinced this obviously was this enormous advance solved the problem. But general biologists, the people we wanted to use, the people who didn't care about structure prediction, they cared about proteins to do their experiments, they weren't as sure. They said, "Well, maybe CASP was easy. I don't know." And then

[19:21] this database came out and people got [19:23] curious and they clicked in and the [19:26] amount to which the proof was social was [19:28] extraordinary that people would look and [19:31] say how did deep mind get access to my [19:34] unpublished structure. you know, this moment at which they really believed it that everyone had a a protein either had a protein that they hadn't solved or had a friend who had a protein that was unpublished and they could compare and that's what really made the difference. And having this database, this accessibility, this ease led everyone to try it and figure out how it worked. Word of mouth is really how this trust is built. And you can kind of see some

[20:00] of these testimonials, right? I wrestled for three to four months trying to do this uh scientific task. You know, this morning I got an alpha fold prediction and now it's much better. I want my time back, right? You know, you really

[20:17] appreciate alphafold when you run it on [20:19] a protein that for a year refused to get [20:22] expressed and purified. Meaning they for a year they couldn't even get the material to start experiments. These are really important. When you build the right tool, when you solve the right problem, it matters and it changes the lives of people who are doing things not that you would do but building on top of your work. And I think it's just

[20:41] extraordinary to see these and the [20:43] number of people I talked to. The time that I really knew this tool mattered. In fact, there was a special issue of science on the nuclear pore complex a few months after the tool came out. And the special issue was all about this particular very large kind of several hundred protein system. And three out of

[21:02] the four uh papers in science about this [21:05] made extensive use of alpha fold. I think I counted over a hundred mentions of the word alphafold in science and we had nothing to do with it. We didn't know it was happening. We weren't collaborating. It was just people doing

[21:16] new science on top of the tools we had [21:18] built and that is the greatest feeling [21:19] in the world. And in fact, users do the

## Real-World Applications and Success Stories [21:20]
[21:22] darnest things. They will use tools in ways you didn't know were possible. The tweet on the left from Yoshaka Morowaki came out two days after our code was available. We had predicted the structure of individual proteins, but we consider we were working on building a system that would predict how proteins came together. But uh this researcher

[21:43] said, "Well, I have alphapold. Why don't I just put two proteins together and I'll put something in between?" You could think of this as prompt engineering but for proteins. And suddenly they find out this is the best protein interaction prediction in the world, right? That when you train on

[21:58] these a really really powerful system, [22:00] it will have additional in some sense [22:03] emergent skills as long as they're [22:05] aligned. People started to find all sorts of problems that Alphafold would work on that we hadn't anticipated. It was so interesting to see the field of science in real time reacting to the existence of these tools, finding their limitations, finding their possibilities and this continues and people do all sorts of exciting work be it in protein design be it in others on top of either the ideas and often the systems we have

## Engineering New Proteins with AlphaFold [22:33]
[22:34] built. One application that really uh I thought was really important is that people have started to learn how to use it to engineer big proteins or to use it in part of and I want to tell this story for two reasons. One is I think it's a really cool application but the second is how it really changes the work of science and often people will say science is all about experiments and validation. So it's great that you have all these alpha fold predictions. Now

[23:03] all we have to do is solve all the [23:05] proteins the classic way so that we can [23:08] tell whether your predictions are right [23:10] or wrong. And they're right about one thing. Science is about experiments. Science is about doing these experiments.

[23:19] But they're wrong about another thing. Um science is about making hypotheses and testing them not about the structure of a particular protein. In this case, the question was they took this protein on the left called the contractile inject injection system, but that's a mouthful. They like to call it the molecular syringe. And what it does is

[23:40] it attaches to a cell and injects a [23:43] protein into it. And the scientists at the Jang Lab at uh MIT were saying, well, can we use this protein to do targeted drug delivery? Can we use it to get gene editors like cast 9 into the cell? They tried over a hundred methods to figure out how to take this protein, which they didn't have a structure of. This is just kind of a

[24:05] rendition after the fact, and say, how [24:08] can we change what it recognizes? I think it's originally involved in plant defense or something like that, and they didn't know how to do it. And they ran an alpha fold prediction. You can see the one on the left. I wouldn't even say

[24:18] it's a great alpha fold prediction, but [24:20] almost immediately they looked at that [24:21] and said, "Wait a minute. those legs at the bottom are how it must recognize and attach to cells. Why don't we just replace those with a designed protein? And so almost immediately as soon as they got the alpha fold prediction, they re-engineered to add this design protein that you see in red uh to target a new type of cell. And they take this system

[24:45] and then they show in fact that they can [24:47] choose cells within a mouse and they can [24:50] inject proteins in this case fluorescent [24:52] proteins. So there you'll see the color and they can target the cells they want within a mouse brain. And so they are using this to develop a new type of system of targeted drug discovery. And we see many more examples. We see some in which

[25:07] scientists are using this tool to try [25:10] thousands and thousands of interactions [25:11] to figure out which ones are likely to [25:14] be the case. In fact, discovered a new component of how eggs and sperm come together in fertilization. Many many of these discoveries that are built on top

## Future of AI in Structural Biology [25:23]
[25:23] of this. And I like to think that our work made the whole field of what's called structural biology, biology that deals with structures, you know, five or 10% faster. But the amount to which that matters for the world is enormous and we will have more of these discoveries. And I think ultimately structure prediction and larger AI for science should be thought of as an incredible capability to be an amplifier for the work of experimentalists that we start from these scattered observations, these natural data. This is our equivalent of

[25:58] all the words on the internet. And then we train a general model that understands the rules underneath it and can fill in the rest of the picture. And I think that we will continue to see this pattern and it will get more general that we will find the right foundational data sources in order to do this. And I think the other thing that has really been a property is that you start where you have data but then you find what problems it can be applied to.

[26:25] And so we find enormous advance, [26:28] enormous capability to understand [26:30] interactions in the cell or others that [26:33] are downstream of extracting the [26:35] scientific content of these predictions [26:39] and then the rules they use can be [26:41] adapted to new purposes. And I think this is really where we see the foundational model aspect of alpha fold or other narrow systems. And in fact, I think we will start to see this on more general systems, be them LLMs or others, that we will find more and more scientific knowledge within them and we'll use them for important important purposes. And I think this is really where this is going. And I think the

[27:04] most exciting question in AI for science [27:08] is how general will it be. Will we find a couple of narrow places where we have transformative impact or will we have very very broad systems? And I expect it will ultimately be the latter as we figure it out. Thank you.

YouTube Deep Summary