Episode 21 — Transformers Explained: Attention Without Equations

When we speak of tool use in artificial intelligence, we are describing the ability of a model to extend beyond its role as a generator of words and engage with external processes that expand its usefulness. A language model on its own is like a skilled conversationalist: it can recall facts, explain ideas, and make predictions based on patterns it has seen, but it cannot step outside of that conversation. With tool use, the model becomes more than a conversational partner; it gains the ability to request help from calculators, databases, or search engines. This transforms the model into a kind of collaborator, one that recognizes when a task lies outside its internal knowledge and knows how to delegate to the right helper. Tool use, then, is not about replacing the model’s strengths but about supplementing them, ensuring that users receive answers that are accurate, current, and actionable.

The importance of tools in AI systems becomes clearer when we examine the limitations of models that work in isolation. No matter how much data a system has seen, its knowledge is frozen at the point of training. It cannot access today’s headlines, update itself with this morning’s financial reports, or check the most recent medical guidelines without external support. Similarly, even large models struggle with tasks like arithmetic or logic puzzles, often producing results that are confident but incorrect. By integrating tools, these gaps are filled. The calculator ensures that numbers add up correctly. A retrieval system provides the freshest information available. A scheduling tool makes sure that a proposed meeting time does not clash with other events. The model’s role shifts from doing everything itself to orchestrating interactions between specialized components, making it far more reliable and versatile.

At the heart of tool use is the idea of function calling, a term that may sound technical but can be explained in simple terms. Think of function calling as the model’s way of formally requesting an action to be carried out by something else. Just as a human might fill out a request form to order a book from a library or submit a ticket to an IT helpdesk, the model produces a structured request that clearly states what it needs, such as looking up the weather for a particular city or performing a multiplication. The critical part here is structure. Unlike casual conversation, these requests must follow predictable formats so that the external system can understand and respond. Function calling therefore serves as the bridge between language and action, translating a model’s understanding of a user’s intent into a precise request that another system can fulfill.

Schema design is what makes function calling reliable. A schema is essentially a template or blueprint that defines what kind of information must be provided for a tool to work correctly. Returning to the analogy of filling out a form, a schema specifies which fields are required, what type of content goes in each field, and how everything should be arranged. If the schema is well designed, the model can fill in the form correctly every time, making its tool requests predictable and usable. If the schema is vague or poorly structured, errors multiply, and the system that receives the request may misinterpret what is being asked. Good schema design is therefore not just a matter of neatness; it is the key to ensuring smooth collaboration between models and tools, enabling external systems to respond accurately and consistently.

Examples of common tools illustrate how natural this concept can be. A calculator is one of the simplest, ensuring that numbers are handled with perfect precision rather than the approximate guesses models often produce. Retrieval tools are another, allowing models to search large databases or knowledge bases and provide answers grounded in documents rather than memory. External knowledge services, such as weather information or financial data feeds, allow models to access current information that lies outside their training. These tools may sound specialized, but the underlying principle is the same: the model recognizes that the task requires more than its own capabilities, and it creates a structured request for the external service to fulfill. Together, these common tools turn the model into a practical assistant, not just a generator of plausible text.

Tool use plays an especially central role in the design of agents, which are systems that coordinate multiple steps and multiple tools to achieve a goal. An agent is more than a model with access to one helper; it is a planner that can sequence actions. Consider a travel assistant agent: it might use a flight search tool to find available options, a calculator to compare prices, and a calendar tool to check whether proposed times fit into a schedule. Each tool is called in turn, with outputs feeding into the next step. The model orchestrates this chain of activity, deciding what information is needed, which tool to request it from, and how to combine results into a final answer. Tool use, therefore, is the foundation that enables agents to move from single-turn responses to complex, multi-step reasoning across diverse domains.

The benefits of tool use for users are clear and compelling. Accuracy improves because tools handle the kinds of tasks that models struggle with, such as arithmetic or the retrieval of up-to-date information. Trust increases because users can see that answers are grounded in external evidence rather than guesses. Utility expands because models can now help with tasks that range far beyond language, from booking appointments to analyzing data. The user experience shifts from a conversation with a model to a dialogue with a system that can reason, act, and verify. This transformation makes AI more than a text engine: it becomes a collaborator that can assist with decisions, manage workflows, and provide outputs that are both useful and actionable.

Challenges, however, must be acknowledged. Tool use introduces new points of failure. If a schema is misaligned, the model may call a tool incorrectly, leading to errors. Latency increases because each tool invocation adds time to the interaction. Errors can propagate across steps, especially in multi-tool agents: if one tool returns incomplete data, later steps may build on those mistakes. These challenges remind us that tool use is not free; it adds complexity to the system and must be carefully managed. Without robust design and monitoring, tool use risks creating new frustrations even as it solves old problems. Recognizing these challenges early is critical to building tool use that is trustworthy rather than fragile.

Evaluating tool performance requires its own framework. It is not enough to test whether a model can generate plausible language; we must also test whether its tool calls are correct, whether the tools respond reliably, and whether the integration between model and tool succeeds consistently. A good evaluation looks at reliability, correctness, and end-to-end success. Reliability means the tool works as expected without frequent errors. Correctness means the answers returned are accurate and useful. Integration success means that the model and the tool communicate smoothly without mismatches. By treating tool evaluation as part of system evaluation, designers ensure that the entire pipeline—not just the model in isolation—meets the standards users expect.

Safety considerations loom large when external tools are introduced. Every external call creates opportunities for misuse, whether intentional or accidental. A poorly designed system might allow sensitive data to leak into tools that should never receive it, or it might accept malicious responses from untrusted services. Tools themselves may be misused if schemas are not carefully constrained, allowing models to trigger actions that were not intended. Ensuring safety requires strict design, monitoring, and governance. Just as an organization would not allow employees to access every system without controls, AI systems must enforce boundaries on what tools can be used, what data can be shared, and how results are validated. Safety is not an add-on in tool use; it is a fundamental requirement.

Observability is another essential feature in tool-enabled AI. Once models begin calling external systems, designers must track what is being invoked, how often, and with what success. Observability means having logs, metrics, and monitoring tools that reveal whether tool use is functioning as intended. Without observability, errors may go unnoticed until they impact users directly. For example, if a financial assistant is silently failing to fetch updated stock prices, users may act on outdated information. By maintaining observability, organizations can detect failures early, trace their causes, and fix them before trust is lost. Tool use cannot be treated as invisible plumbing; it must be observable infrastructure that can be managed actively.

User experience improves dramatically when outputs are augmented by tools. Instead of vague or approximate answers, users receive results that are precise, current, and actionable. A student asking a math problem does not just get a guess but the exact solution confirmed by a calculator. An employee asking about company policy receives an answer linked directly to the latest handbook. These experiences feel more trustworthy and useful because they are backed by external functions. They also set new expectations: once users see what tool-augmented outputs can do, they are less tolerant of unsupported guesses. Tool use thus shifts the standard for what “good” AI means, raising the bar from fluency to grounded usefulness.

Industry adoption of tool use reflects this shift in expectations. Major AI platforms now advertise tool integration as a core feature, allowing users to connect models with calendars, email, spreadsheets, and databases. Enterprises increasingly demand these integrations, not as optional extras but as table stakes for deploying AI in real workflows. The market is moving toward systems that combine language with action, where tool use defines competitiveness. This adoption also drives innovation, as vendors race to provide more sophisticated schemas, more reliable orchestration, and more comprehensive safety frameworks. Tool use is no longer a niche research topic; it is a defining characteristic of modern AI systems.

Without tools, models remain limited. They are confined to what they saw during training and to the boundaries of their context windows. They cannot access real-time knowledge, cannot guarantee accuracy in tasks like arithmetic, and cannot interact with external systems that hold the data or perform the actions users need. This limitation highlights why tool use matters so much: it is the difference between an isolated generator of language and a connected assistant that can participate meaningfully in workflows. Without tools, AI remains a conversation partner. With tools, it becomes an operational collaborator.

The logical next step after enabling single tools is orchestration—the ability to manage multiple tools in coordinated workflows. Orchestration takes tool use from isolated calls to structured sequences, where the model can plan, decide, and execute across multiple steps. This evolution transforms tool use from a technical feature into a design philosophy: AI as a system that thinks, acts, and collaborates with an ecosystem of helpers. Orchestration will be our next topic, as we examine how multiple tools can be woven into coherent, goal-directed pipelines.

Schema reliability is one of the most important aspects of building trust in tool use. When a schema is well designed, it acts like a sturdy bridge between the model and the external service. The model knows exactly what kind of information to provide, the tool knows exactly how to interpret that request, and the user receives a result that makes sense. Poor schemas, by contrast, are like poorly written instructions. Imagine trying to fill out a government form where the boxes are unlabeled or the questions are ambiguous—you would almost certainly make mistakes, and the office processing your form might misunderstand what you meant. In AI systems, such schema errors translate into misfired tool calls, incorrect outputs, or failures to respond at all. Ensuring reliability means thinking carefully about every field a schema requires, balancing clarity with flexibility, and making sure that what the model produces aligns precisely with what the tool expects.

Handling tool failures is another unavoidable part of designing systems where models rely on external services. Even if schemas are perfect, the tools themselves may encounter errors—networks can go down, databases may be temporarily unreachable, or services may return incomplete results. If the model is not prepared for this possibility, the failure cascades into the user experience, leaving them with half-answers or confusing errors. Resilient design involves fallback strategies, such as retrying the call, switching to a backup service, or gracefully acknowledging that the requested operation could not be completed. For example, a travel assistant unable to reach the airline database might still provide partial results from cached information rather than returning nothing. This approach mirrors how humans work around problems: if one door is locked, they find another route. Tool use systems must be equally adaptive, building resilience into every stage of the process.

Tool chaining builds on the basic idea of tool use by allowing the output of one tool to serve as the input for another. This is the foundation of more complex workflows, where single tools are not enough to answer a user’s request. Consider the example of planning a dinner party. The system might first call a recipe database to generate a menu. It might then send the ingredient list to a grocery ordering tool. Finally, it could send the delivery time to a scheduling tool to ensure groceries arrive before the event. Each tool adds value, but it is the chaining of these steps that produces a seamless end-to-end solution. Chaining demonstrates how tool use is not just about isolated calls but about building pipelines that reflect how real tasks unfold in everyday life. By mastering chaining, AI systems begin to act less like single-purpose assistants and more like project managers coordinating multiple moving parts.

The balance between determinism and flexibility in schemas highlights a central design tension. Determinism ensures predictability: the model knows exactly what to provide, and the tool knows exactly how to interpret it. This is valuable for tasks where precision matters, such as financial transactions or compliance reporting. Flexibility, however, acknowledges the messiness of human queries and the diversity of real-world data. A schema that is too rigid may reject perfectly reasonable requests simply because they do not fit neatly into predefined categories. One that is too flexible risks ambiguity, making it harder for tools to know what to do. Designers must therefore calibrate schemas to allow variation without sacrificing clarity, much as a teacher creates assignments with structured instructions but leaves room for creativity in how students answer. This balance ensures that AI systems remain useful across a wide range of scenarios while still functioning reliably.

Efficiency in tool invocation is another practical challenge. Each time a tool is called, time and computational resources are consumed. If the model calls the same tool repeatedly for overlapping tasks, latency grows, costs rise, and the user experience suffers. Imagine asking a colleague to calculate five sums, one after another, when they could have been added together in a single spreadsheet. The principle is the same: efficiency matters. Designing systems to minimize redundant tool calls, batch similar requests, or cache frequent results can make a dramatic difference. Efficiency does not only affect speed; it also affects scalability. In enterprise environments where thousands of queries may involve tool use every second, inefficiency quickly becomes unsustainable. Tool use must therefore be designed not only for correctness but also for operational practicality, ensuring that power and precision do not come at the expense of usability.

Human oversight remains critical in tool use, especially when systems operate in sensitive or high-impact contexts. Even when schemas are precise and tools are reliable, the possibility of unexpected interactions or subtle errors cannot be eliminated. In healthcare, for example, a diagnostic support system might retrieve evidence from medical literature and suggest possible treatments, but a physician must review these suggestions before acting. The same principle applies in finance, law, and security-sensitive industries. Oversight ensures accountability, reduces risk, and provides a safeguard against blind trust in automation. Far from undermining the value of AI, oversight strengthens it, creating a partnership where machines handle speed and scale while humans provide judgment and ethical responsibility. In this way, tool use becomes part of a broader socio-technical system, one that blends efficiency with responsibility.

Cross-domain tool use illustrates the versatility of this approach. In healthcare, tools might include medical knowledge bases, patient record systems, or imaging analysis services. In finance, tools may handle currency conversion, risk modeling, or compliance tracking. In law, tools could retrieve precedents, parse statutes, or generate structured case summaries. Each of these domains has its own schemas, requirements, and safety constraints, but the principle is the same: the model recognizes when it cannot complete a task alone, and it calls on specialized resources. Cross-domain integration demonstrates that tool use is not limited to a narrow set of functions but can expand into virtually any area of human expertise, provided the right schemas and safeguards are in place. This flexibility is what makes tool use such a powerful foundation for the future of AI.

Evaluation benchmarks for tool use are emerging as researchers and practitioners recognize the need to test these systems systematically. Benchmarks must go beyond language accuracy and measure whether tools are invoked correctly, whether responses meet schema requirements, and whether multi-step workflows complete successfully. They also measure reliability under stress: do tools fail gracefully when inputs are incomplete? Do they scale effectively when requests are frequent? These benchmarks create common ground for comparing different systems, just as benchmarks for retrieval have shaped progress in that field. By standardizing how tool use is evaluated, the industry can ensure that claims of capability are backed by measurable evidence. Benchmarks thus serve as both a proving ground for innovation and a safeguard for users who rely on these systems in critical settings.

Open source ecosystems are accelerating the spread of tool integrations by providing accessible libraries and frameworks. Developers no longer need to build every schema from scratch; they can draw from shared collections of tools for retrieval, calculation, translation, or scheduling. This democratization allows smaller organizations and research groups to experiment with tool use without massive resources. Open ecosystems also encourage innovation by enabling communities to extend, refine, and adapt tools for specific domains. In many ways, open source has become the laboratory for tool use, where new ideas are tested and shared widely. These ecosystems reduce barriers to entry, expand experimentation, and push the entire field forward faster than proprietary development alone could achieve.

Security in tool interfaces is one of the most urgent concerns in this domain. Whenever a model communicates with an external service, there is a risk of malicious inputs or unintended actions. Schema validation becomes a defense mechanism, ensuring that requests remain within defined boundaries and preventing injection attacks or misuse. Security also requires careful consideration of what data is passed to tools and what results are accepted back. Without these protections, sensitive information could leak or untrusted responses could corrupt the model’s outputs. Treating security as integral to tool use is essential, especially in environments where data privacy or regulatory compliance is at stake. Safe schema design, strict validation, and constant monitoring are non-negotiable elements of secure tool use.

Scaling tool use across enterprise systems introduces additional challenges. When hundreds of thousands of users interact with a platform, the orchestration of tools must be robust enough to handle volume without sacrificing reliability. This means building monitoring frameworks that track tool performance, detect failures, and balance loads across servers. It also means designing orchestration layers that can handle multiple tool calls in parallel, preventing bottlenecks. At scale, even minor inefficiencies become costly, and small errors can cascade into significant disruptions. Enterprises therefore require tool use systems that are not just clever but industrial-grade, capable of sustaining high throughput, high reliability, and strong governance. Scaling is not merely a matter of technical capacity; it is a question of organizational trust.

Research into tool learning explores how models can improve at deciding when and how to use tools effectively. Today, much of tool use depends on hand-crafted prompts and manually defined schemas. Research aims to make models more autonomous, enabling them to learn through examples, feedback, and trial and error. For example, a model might learn that certain tasks, like checking weather conditions, always require a tool call rather than generating an answer from memory. Over time, it could refine its strategies for when to call tools, which tools to prioritize, and how to balance efficiency with accuracy. This research represents a move toward adaptive intelligence, where models are not just given tools but learn to wield them wisely, much like humans develop instincts for when to use calculators, reference books, or software applications.

Limitations and risks remain a constant theme in discussions of tool use. While tools expand capabilities, they also add complexity. Each integration introduces new dependencies, new potential points of failure, and new demands for monitoring. Over-reliance on tools can create brittleness, where systems fail entirely if a single tool becomes unavailable. Misaligned schemas or poorly designed interfaces can lead to frustrating errors. These risks remind us that tool use is not a panacea; it is a design choice that requires ongoing vigilance. The benefits are real, but they come with responsibilities to manage complexity, enforce safeguards, and design for resilience.

Looking ahead, the future of tool use is one of deeper integration and broader adoption. Models will increasingly act as orchestrators of services, calling on external systems not as exceptions but as routine parts of their operation. Tool use will likely be expected in enterprise deployments, where integration with databases, compliance systems, and communication platforms is mandatory. The trajectory points toward models that are not only fluent but also capable of action, bridging language understanding with service execution. Tool use will therefore remain central to the evolution of AI, serving as the connective tissue between general intelligence and domain-specific capability.

Finally, the conversation about tool use leads naturally to orchestration, the subject of the next episode. Orchestration is the art of managing multiple tools in coordinated workflows, where planning, sequencing, and decision-making become just as important as the tools themselves. While tool use provides the ability to call individual services, orchestration determines how those services work together to accomplish complex goals. It is the logical extension of everything discussed here, and it is the next frontier in making AI systems not just reactive but proactive collaborators in human tasks.

Episode 21 — Transformers Explained: Attention Without Equations
Broadcast by