The complete builder's guide to creating a Claude-powered chatbot from scratch: architecture, features, code patterns, and everything Anthropic's leaked source code revealed about how production AI chatbots actually work.
Anthropic's revenue hit $30 billion annualized by March 2026, up 1,400% year-over-year - CNBC. Claude Code alone grew to over $2.5 billion in run-rate revenue. Eight of the Fortune 10 are Claude customers. The company closed a $30 billion Series G at a $380 billion valuation in February 2026 - Anthropic, and is now fielding offers above $800 billion - TechCrunch.
Behind those numbers is a product. And that product, at its core, is a chatbot. A very sophisticated chatbot with artifacts, deep research, memory, file processing, design generation, and an agentic coding assistant, but structurally, it is still a conversation interface backed by an LLM, a set of tools, and a context management system.
On March 31, 2026, a security researcher discovered that Anthropic accidentally shipped the complete source code of Claude Code inside an npm package: 512,000 lines of TypeScript across 1,900 files - VentureBeat. That leak gave the developer community an unprecedented look at how Anthropic builds production AI systems. As we analyzed line by line in our leaked source deep dive, the architecture is surprisingly simple at its core, and very buildable if you understand the principles.
This guide covers everything you need to build a Claude-like chatbot from the ground up. We start with first principles (what is a chatbot, really?), then work through every major feature Claude offers, explain how each one works architecturally, and show you how to implement it. Whether you are building a customer support bot, an internal tool, or a full agent platform, the patterns here are the same ones Anthropic uses in production.
Contents
- First Principles: What a Claude-Like Chatbot Actually Is
- Inside Claude Code: What the Leaked Source Revealed
- The TAOR Loop: The 50 Lines That Run Everything
- The Claude Feature Map: Every Product Feature Explained
- Building Block 1: The Conversation Engine (Messages API + Streaming)
- Building Block 2: Tool Use and Function Calling
- Building Block 3: Extended Thinking and Deep Research
- Building Block 4: Artifacts (Sandboxed Interactive Content)
- Building Block 5: Memory and Personalization
- Building Block 6: File Upload and Document Processing
- Building Block 7: Context Management (The Hard Problem)
- Building Block 8: The Frontend Stack
- Building Block 9: Auth, Database, and Infrastructure
- The Complete Architecture Diagram
- Cost Analysis: What It Actually Costs to Run
1. First Principles: What a Claude-Like Chatbot Actually Is
Before writing a single line of code, strip away the marketing. Strip away "AI assistant," "conversational AI," "agentic experience." Look at what Claude.ai actually does when a user types a message, and reduce it to its structural components.
A Claude-like chatbot is four things stacked on top of each other. The first layer is a conversation loop: user sends text, system sends text back. This is as old as IRC bots. The second layer is an LLM inference call: instead of pattern-matched responses, the system calls a language model API that generates the response. The third layer is tool execution: the model can request actions (search the web, run code, read a file), and the system executes those actions and feeds results back. The fourth layer is context management: as conversations get long, the system must compress, summarize, and prioritize what fits in the model's context window.
Everything else, artifacts, deep research, memory, design generation, file processing, is a feature built on top of these four layers. Artifacts are tool-generated content rendered in a sandbox. Deep research is extended thinking combined with a web search tool. Memory is a persistent store that injects into the system prompt. Once you understand the four layers, every feature becomes a composition of them.
This is the structural insight that the Claude Code leak confirmed. Despite 512,000 lines of code, the core engine is a while(true) loop that calls the model, executes any requested tools, appends the results, and loops again. Everything else is infrastructure around that loop: UI rendering, permission checks, context compression, analytics, security. The loop itself is approximately 50 lines - PromptLayer.
The reason this matters for builders: you do not need 512,000 lines to build a production chatbot. You need the loop, then you add layers incrementally. Our guide to making LLMs autonomous covers the progression from simple chatbot to full agent. This guide focuses on the chatbot end of that spectrum, with enough depth to extend into agent territory when you are ready.
2. Inside Claude Code: What the Leaked Source Revealed
On March 31, 2026, security researcher Chaofan Shou found a 57MB source map file inside the npm package @anthropic-ai/claude-code version 1.0.33. That file contained URLs pointing to the complete, unobfuscated TypeScript source hosted on Anthropic's R2 storage. Within hours, the code was mirrored to GitHub (Kuberwastaken/claude-code, 1,100+ stars, 1,900+ forks), and the developer community had the most detailed look ever at how a production AI system actually works - Layer5.
The internal project codename was "Tengu." The bundler was Bun (not Node.js). The UI framework was React with Ink (for terminal rendering). The codebase spanned 1,900 TypeScript files across 55+ directories. According to Pragmatic Engineer's analysis, 90% of the codebase was written by Claude itself, with the team running 60-100 internal builds daily and averaging 5 PRs per developer per day during peak development.
Here is what mattered most for chatbot builders. The leak revealed that Claude Code uses exactly four capability primitives instead of hundreds of integrations: Read (file reading, image viewing, PDF parsing), Write/Edit (file creation, string replacement), Execute (arbitrary shell via Bash), and Connect (MCP servers for extensibility). Every complex operation, from codebase refactoring to website deployment, is composed from these four primitives. The model decides which primitive to use and how. The harness just executes.
The system prompt in src/constants/prompts.ts was approximately 1,000 lines with 60 system prompt components and 40 system reminders assembled conditionally based on mode, available tools, MCP servers, and feature flags. A key directive: "Target response fewer than 4 lines (excluding tool use). You should be concise, direct, and to the point." The prompt was split by a SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker, with 70-80% of content above the marker (globally cacheable across sessions) and 20-30% below (per-session customization). This optimization alone cuts API costs significantly through Anthropic's prompt caching.
The most architecturally interesting revelation was the multi-model strategy. Claude Code uses three different models for different purposes: Sonnet 3.5/4.0 for the main reasoning loop, Haiku 3.5/4.5 for lightweight operations (title generation, security classification, permission checking), and Opus 4.6 for deep planning (ULTRAPLAN). This is a pattern every chatbot builder should adopt: use the most capable (expensive) model only when needed, and route simpler tasks to cheaper models. For the complete line-by-line analysis, see our inside Claude Code article.
3. The TAOR Loop: The 50 Lines That Run Everything
The core of Claude Code (and by extension, the core of any Claude-powered chatbot) is what the source code reveals as the TAOR pattern: Think, Act, Observe, Repeat. It is implemented in src/query.ts as an async generator function called queryLoop.
The architecture is deliberately simple. A while(true) loop runs until the model signals it is done (via end_turn stop reason) or until a budget is exhausted. Each iteration has three phases. Think: assemble the system prompt, conversation history, and available tools, then stream an API call to Claude. Act: if the model's response contains tool_use blocks, execute those tools. Observe: collect tool results, append them as tool_result blocks to the message history. Then loop back to Think.
The only mutable state across iterations is an append-only message array and a small State object tracking token budgets, compaction status, and turn counts. This simplicity is intentional: an append-only message log enables persistence (save/restore sessions), replay (rerun from any point), and compression (summarize older messages without losing newer ones).
For builders, the takeaway is that the core engine of a Claude chatbot is approximately this:
messages = [{"role": "user", "content": user_input}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
system=system_prompt,
messages=messages,
tools=available_tools,
max_tokens=8192,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
break # Model is done, return response to user
# Model requested tool use, execute it
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
That is the loop. Everything else, streaming, context management, permissions, UI rendering, is infrastructure around it. The Temporal framework provides an excellent reference implementation of this pattern in Python - Temporal.
The cost model of this loop is worth understanding before you build. Each iteration consumes tokens: input tokens (the entire message history plus system prompt) and output tokens (the model's response). Because input tokens are re-sent every iteration, long conversations with many tool calls become expensive fast. This is why context management (section 11) is not optional. Claude Code's leaked source contains five different compaction strategies to keep the context window manageable. As we explored in our guide to AI agent costs, agents consume 4x more tokens than standard chat, and up to 15x in multi-agent systems.
4. The Claude Feature Map: Every Product Feature Explained
Before building, you need to understand what you are building. Claude.ai has grown from a simple chat interface into a multi-feature platform. Here is every major feature as of May 2026, what it is structurally, and which building blocks (sections 5-13) implement it.
Chat (Core): User sends messages, Claude responds. Building blocks: Conversation Engine (section 5), Streaming (section 5), Frontend (section 12).
Artifacts: Interactive code, documents, and React components rendered in a sandboxed panel alongside the chat. Supported types include HTML pages, React with hooks and state, SVGs, Mermaid diagrams, and code in any language. In October 2025, Anthropic shipped 3-4x faster artifact updates through inline text replacement instead of full regeneration - Hyperdev. Building blocks: Tool Use (section 6), Artifacts sandbox (section 8).
Deep Research / Extended Thinking: Claude reasons step-by-step in a thinking block before delivering a final answer. Research mode runs multiple connected web searches, refining queries iteratively. Advanced Research extends this to 45 minutes of autonomous investigation across hundreds of sources - Second Talent. Building blocks: Extended Thinking API (section 7), Tool Use for web search (section 6).
Memory: Claude automatically distills long-term-worthy information (profession, preferences, context) roughly every 24 hours. This is loaded into every future conversation. It is not RAG (it does not store and search all conversations). Users can view, edit, and delete memories - ShareUHack. Building blocks: Memory system (section 9).
Learning Mode: Socratic questioning instead of direct answers. When you ask "How do I solve this calculus problem?", Learning Mode responds with "What do you think happens to the function as x approaches this value?" Universities like Northeastern are deploying it institution-wide - Northeastern. Building blocks: System prompt modification (section 5), Writing Styles.
Claude Design: Launched April 17, 2026. Conversational prompt-to-prototype tool powered by Opus 4.7. Generates UI mockups, pitch decks, slides, and prototypes from text prompts. Exports to Canva, PDF, PPTX, or standalone HTML - TechCrunch. Building blocks: Tool Use (section 6), Artifacts (section 8), specialized system prompt. We covered this product in depth in our Claude Design guide.
Projects: Persistent workspaces with files, custom instructions, and project-specific context. Instructions layer on top of account-wide preferences - Claude Help Center. Building blocks: Context Management (section 11), Database (section 13).
File Upload: Up to 30MB per file, 20 files per chat. Supports PDF, DOCX, CSV, TXT, HTML, images. PDFs under 100 pages get both text and visual analysis - Claude Help Center. Building blocks: File Processing (section 10).
Writing Styles: Presets (Normal, Concise, Explanatory) or custom styles from text instructions or uploaded writing samples - Claude Help Center. Building blocks: System prompt modification (section 5).
Claude Code: Terminal-based agentic coding assistant. Uses the TAOR loop with 18+ built-in tools. Included in Pro ($20/month) and Max ($100-200/month) - Claude Code Docs. Building blocks: All nine building blocks plus terminal UI.
Claude Cowork: Desktop agent that can see and control your computer. Three primitives: observe (screenshot), act (mouse/keyboard), remember (conversation history). Uses a virtual X11 display (Xvfb). Each action takes 2-5 seconds - Claude API Docs. For the full Cowork analysis see our Cowork insider guide.
5. Building Block 1: The Conversation Engine (Messages API + Streaming)
The foundation of every Claude chatbot is the Messages API. This is Anthropic's primary interface for communicating with Claude models. You send a structured request with a model identifier, a system prompt, a message history, and optional tools. You get back a structured response with the model's output.
The Python SDK is the fastest way to start:
from anthropic import Anthropic
client = Anthropic() # Uses ANTHROPIC_API_KEY env var
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system="You are a helpful assistant.",
messages= [
{"role": "user", "content": "Explain quantum computing in simple terms"}
]
)
print(response.content [0].text)
For TypeScript (the same language Anthropic uses internally):
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
system: "You are a helpful assistant.",
messages: [
{ role: "user", content: "Explain quantum computing in simple terms" }
],
});
Streaming is non-negotiable for production chatbots. Without streaming, the user stares at a blank screen for 2-10 seconds while the model generates its full response. With streaming, tokens arrive as they are generated, creating the typewriter effect users expect. The Anthropic SDK provides streaming through client.messages.stream():
const stream = client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 4096,
messages: [{ role: "user", content: "Tell me about Mars" }],
});
stream.on("text", (text) => {
process.stdout.write(text); // Each token as it arrives
});
const finalMessage = await stream.finalMessage();
On the backend, you expose this as a Server-Sent Events (SSE) endpoint. SSE is the de facto standard for LLM streaming (used by OpenAI, Anthropic, and virtually every LLM API). It is one-way (server to client), simpler than WebSockets, and natively supported by browsers. Set the response headers to Content-Type: text/event-stream, Cache-Control: no-cache, Connection: keep-alive. Format each event as data: <json>\n\n - Upstash.
The model selection matters enormously for cost and quality. Anthropic's current lineup:
Claude Code's leaked source revealed the multi-model approach: Sonnet for the main reasoning loop, Haiku for cheap classification tasks (title generation, permission checking, security analysis), Opus for deep planning. Adopt this pattern. Route simple tasks to Haiku ($1/$5 per MTok), standard conversations to Sonnet ($3/$15), and complex reasoning to Opus ($5/$25). The prompt caching feature makes this even more cost-effective: cache reads cost only 10% of standard input price. Structure your system prompt with a stable prefix (cacheable) and a dynamic suffix (per-session).
6. Building Block 2: Tool Use and Function Calling
Tool use is what separates a chatbot from a chat interface. Without tools, Claude can only generate text. With tools, it can search the web, run code, query databases, send emails, and interact with any external system. Anthropic's tool use API is the standard implementation of the function calling pattern - Claude API Docs.
The flow works in four steps. Define: describe your tools with a name, description, and JSON schema for inputs. Send: include the tool definitions in your API request. Receive: if Claude decides to use a tool, it responds with stop_reason: "tool_use" and tool_use content blocks containing the tool name and arguments. Execute and return: your code runs the tool, then sends the result back as a tool_result content block.
tools = [{
"name": "web_search",
"description": "Search the web for current information",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"}
},
"required": ["query"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages= [{"role": "user", "content": "What happened in tech news today?"}]
)
# Check if Claude wants to use a tool
for block in response.content:
if block.type == "tool_use":
# Execute the tool
result = search_web(block.input ["query"])
# Send result back
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": result
}]
})
The strict: true option (added in 2026) guarantees that Claude's tool calls always match your schema exactly, eliminating the need for client-side validation of tool arguments - Claude API Docs.
Anthropic also offers server-side tools that run on their infrastructure: web_search (real-time web search, $10 per 1,000 searches), code_execution (sandboxed Python), and web_fetch (fetch any URL, no additional cost). These are particularly useful for chatbot builders because they require zero infrastructure on your side. For capabilities beyond what Anthropic provides natively, platforms like Suprsonic offer a unified API that gives your chatbot access to scraping, enrichment, speech, image generation, and 15+ other capabilities through a single integration, which is significantly simpler than building individual integrations for each.
The Claude Code leak revealed an important optimization: tools are sorted alphabetically in the system prompt. This is not random. It maximizes Anthropic's prompt cache hit rate because the tool list is deterministic regardless of the order tools were registered. A small detail, but it shows the level of cost optimization in production systems.
7. Building Block 3: Extended Thinking and Deep Research
Extended thinking gives Claude a scratchpad where it reasons step-by-step before delivering a final answer. This is not a separate model. It is the same model given permission to think longer. The reasoning happens in a thinking block that appears before the text response - Anthropic.
Enable it with the thinking parameter:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # Max tokens for thinking
},
messages= [{"role": "user", "content": "Analyze this codebase for security vulnerabilities"}]
)
for block in response.content:
if block.type == "thinking":
print(f" [Thinking]: {block.thinking}") # Show reasoning
elif block.type == "text":
print(f" [Answer]: {block.text}") # Final answer
Adaptive Thinking (introduced with Sonnet 4.6) replaces the binary on/off mode with granular control via an effort parameter. Instead of a fixed token budget, you specify how hard Claude should think. Opus 4.7 only supports adaptive mode (manual mode returns a 400 error) - Claude5.ai.
For multi-turn conversations with thinking, you must pass thinking blocks back unchanged in the message history. Each thinking block includes an encrypted signature field that maintains reasoning continuity across turns. Display options: "summarized" (condensed summary, still charged for full tokens) or "omitted" (faster streaming, signature only) - Claude API Docs.
Deep Research combines extended thinking with tool use. The model thinks about what information it needs, uses web search tools to find it, thinks about the results, searches again with refined queries, and iterates until it has enough information to answer comprehensively. Claude's Advanced Research mode extends this to 45 minutes of autonomous investigation. For a builder, implementing deep research means: enable extended thinking, provide web search tools (either Anthropic's server-side web_search or your own), and allow the TAOR loop to run for multiple iterations.
The cost implications are significant. Extended thinking tokens are billed at the same rate as output tokens. A 10,000-token thinking budget on Opus 4.7 costs $0.25 per request just for thinking. For cost-sensitive applications, Sonnet with adaptive thinking gives a strong balance. The leaked source confirmed that Claude Code uses Sonnet as its default reasoning model, not Opus, reserving Opus only for the ULTRAPLAN deep planning feature.
8. Building Block 4: Artifacts (Sandboxed Interactive Content)
Artifacts are Claude's most distinctive UI feature: interactive code, documents, and visualizations rendered in a panel alongside the chat. When Claude generates a React component, the user does not see code. They see a running application. When Claude writes an SVG, it renders visually. When Claude produces a document, it is formatted and editable.
Building artifacts requires two components: generation (getting Claude to produce the content) and rendering (executing it safely in the browser).
For generation, you define artifact creation as a tool. Claude calls the tool when it determines the user's request would benefit from interactive content rather than plain text. The tool's output includes the content type (HTML, React, SVG, Markdown, Mermaid) and the content itself.
For rendering, the standard approach is sandboxed iframes. The open-source LibreChat project demonstrates this pattern using CodeSandbox's Sandpack library. Sandpack provides a sandboxed JavaScript execution environment that runs in the browser. Your CSP headers need frame-src 'self' https://*.codesandbox.io. The assistant-ui library provides an Artifacts component for React that handles the rendering pipeline.
The security model is critical. User-generated code (which is what Claude outputs) must never have access to your application's DOM, cookies, or JavaScript context. The iframe sandbox attribute (sandbox="allow-scripts allow-forms") prevents the artifact from accessing the parent page. For React components, Sandpack compiles and runs them in isolation. For HTML, a srcdoc attribute injects the content without a network request.
Anthropic's October 2025 optimization (shipping 3-4x faster artifact updates via inline text replacement instead of full regeneration) is worth replicating. Instead of regenerating the entire artifact when the user requests a change, stream only the diff. This is architecturally similar to how code editors work: apply patches rather than rewriting files. The user experience improvement is dramatic.
9. Building Block 5: Memory and Personalization
Claude's memory system operates on three levels, a pattern directly confirmed by the leaked source code's six-layer memory hierarchy. For chatbot builders, the practical implementation involves two distinct systems: session memory (what happened in this conversation) and persistent memory (what the system knows about this user across conversations).
Session memory is the message history. Every message the user sends and every response Claude generates is stored as part of the conversation. This is the context the model sees on each turn. The challenge is that this grows without bound, which is why context management (section 11) exists.
Persistent memory is the harder problem. Claude's approach, as described in the leaked source, uses a MEMORY.md file with YAML frontmatter that stores four types of memories: user (role, preferences), feedback (corrections and confirmations), project (ongoing work context), and reference (pointers to external resources). Memories have a hard cap of 200 lines or 25KB. A Sonnet-powered relevance selector picks up to 5 matching memories per turn to inject into the system prompt.
For builders, the simplest viable implementation:
- After each conversation, use a lightweight LLM call (Haiku) to extract any long-term-worthy information: user preferences, corrections, learned context
- Store extracted memories in a database (MongoDB works well for flexible schemas) with the user's ID, a text description, and an embedding vector
- On each new conversation, retrieve the top-k most relevant memories using vector similarity search
- Inject retrieved memories into the system prompt before the conversation begins
The key insight from Claude's implementation is that memory is not RAG over the entire conversation history. It is a curated, compressed representation of what matters. A user who has had 500 conversations does not need all 500 transcripts searchable. They need: "This user is a backend engineer who prefers Python, works at a fintech company, and gets frustrated when I use Java examples." That is 30 tokens of memory, not 500,000 tokens of conversation logs.
Writing Styles are a form of memory too. Claude lets users define custom communication styles via text instructions or uploaded writing samples. Architecturally, this is just a system prompt modification: append the style instructions to the system prompt before each conversation. Store the user's selected style in your database and load it at conversation start.
10. Building Block 6: File Upload and Document Processing
Claude handles files up to 30MB, up to 20 files per chat. Supported formats include PDF, DOCX, CSV, TXT, HTML, ODT, RTF, EPUB, and images. PDFs under 100 pages get both text extraction and visual element analysis (charts, diagrams, graphics). Each PDF page consumes 1,500-3,000 tokens - Claude API Docs.
For builders, file processing has two paths. The Anthropic-native path uses the Files API to upload documents directly to Anthropic's infrastructure. The model processes them server-side, including vision analysis for PDFs with visual content. This is the simplest approach and handles most use cases.
The self-hosted path is necessary when you need to pre-process files (extract specific sections, apply OCR, convert formats) or when you want to use the content for RAG. The standard stack: pypdf for PDF text extraction, docx2txt for Word documents, pandas for CSV/Excel. For RAG, chunk the extracted text into ~400 token segments, generate embeddings (Anthropic does not offer an embedding model, so use OpenAI, Gemini, or a local model), and store in a vector database (PostgreSQL with pgvector, Pinecone, or Qdrant) - FutureSmart.
The file upload frontend is a drag-and-drop zone that sends files to your backend via multipart/form-data. Use FastAPI with python-multipart or Next.js API routes with formidable. Store the file metadata in your database, the file itself in S3/R2/GCS, and the extracted text in your conversation context.
11. Building Block 7: Context Management (The Hard Problem)
Context management is the one building block that separates amateur chatbots from production systems. Without it, your chatbot breaks after 20-30 messages because the conversation exceeds the model's context window. Claude Code's leaked source contains five distinct compaction strategies, which tells you everything about how seriously Anthropic takes this problem.
The five strategies, in order of aggressiveness:
Snip Compact: Removes old messages below a threshold. Fast, lossy, suitable for distant history. Think of it as truncation with a cutoff point.
Microcompact: Compresses within-turn tool results. When a tool returns 10,000 tokens of output but only 500 tokens are relevant, microcompact replaces the full output with a reference or summary. This runs before other strategies because tool results are often the largest context consumers.
Auto-Compact: The primary strategy. Triggers at a configurable context window percentage (around 92% based on the leaked code). Uses a model call to generate a summary of the conversation so far, then replaces the older messages with the summary. Runs in a forked subprocess to avoid blocking the main conversation.
Reactive Compact: Emergency fallback triggered by "prompt too long" API errors. When the context window is exceeded despite auto-compact, reactive compact aggressively compresses the entire history and retries.
Context Collapse: The most aggressive option. A staged collapse that intelligently prunes the conversation, preserving recent exchanges and key decision points while discarding intermediate exploration.
For builders, the minimum viable implementation is auto-compact:
- Track the total token count of your message history (use tiktoken or Anthropic's token counting)
- When tokens exceed 80% of the model's context window, trigger compaction
- Send the conversation history to a cheap model (Haiku) with the instruction: "Summarize this conversation, preserving key decisions, user preferences, and unresolved questions"
- Replace all messages before a threshold (e.g., everything except the last 5 exchanges) with the summary
- Continue the conversation with the compressed history
The auto-compact pattern cuts context costs dramatically. A 100-message conversation that would otherwise consume 200,000 tokens per turn gets compressed to 5,000 tokens of summary plus the recent 10,000 tokens. That is a 95% reduction in input token costs.
12. Building Block 8: The Frontend Stack
The standard frontend stack for a Claude chatbot in 2026 is Next.js 15 with the App Router and the Vercel AI SDK (v6.0, December 2025). The AI SDK provides a unified API for 25+ LLM providers including Anthropic, with built-in streaming, React Server Components support, and useChat hooks. It has 20 million+ monthly downloads - Vercel AI SDK.
// app/api/chat/route.ts (Next.js API route)
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: anthropic("claude-sonnet-4-6"),
system: "You are a helpful assistant.",
messages,
});
return result.toDataStreamResponse();
}
// components/Chat.tsx (React client component)
"use client";
import { useChat } from "@ai-sdk/react";
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<div>
{messages.map((m) => (
<div key={m.id}>{m.role}: {m.content}</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
</form>
</div>
);
}
Vercel provides an open-source chatbot template that includes streaming, message persistence, file uploads, and a polished UI. It is the fastest path from zero to production.
For the artifact panel (section 8), you need a split-pane layout: chat on the left, artifact on the right. Libraries like react-resizable-panels handle the draggable divider. The artifact panel renders an iframe with the sandboxed content. Communication between the chat and the artifact panel happens through React state or a shared context provider.
The terminal UI that Claude Code uses (React with Ink, Meta's Yoga flexbox engine, custom double-buffered rendering) is specifically for CLI applications. For web chatbots, standard React with CSS is the right approach.
13. Building Block 9: Auth, Database, and Infrastructure
The backend infrastructure of a production chatbot consists of three systems: authentication, database, and hosting.
Authentication: Clerk is the recommended choice for Next.js chatbots. It provides native Server Component helpers (auth(), currentUser()), 11 React hooks, and embeddable prebuilt components. Most teams deploy to production in a single day. For enterprise requirements (SAML, SCIM, advanced compliance), Auth0 is the standard, but at 3-15x higher cost - Clerk.
Database: A production chatbot needs three data stores. Redis for active session state and real-time caching (sub-millisecond latency). MongoDB for conversation histories, user profiles, and flexible document storage (memories, preferences, project context). PostgreSQL for relational data (billing, subscriptions) and vector search via pgvector (embedding storage for RAG and memory retrieval). This hybrid approach is what Claude Code uses internally: the leaked source shows Redis for session state, with persistent storage for conversation logs and memory - Medium.
Hosting: Vercel for the Next.js frontend (automatic scaling, edge functions, preview deployments). A Python backend (FastAPI) on Railway, Render, or AWS ECS for the AI inference layer, tool execution, and heavy processing. S3 or Cloudflare R2 for file storage.
The infrastructure cost for a production chatbot serving 1,000 daily active users is roughly: Vercel Pro ($20/month), backend hosting ($50-100/month), MongoDB Atlas ($50/month), Redis Cloud ($30/month), PostgreSQL ($20/month), and Anthropic API usage ($500-5,000/month depending on volume and model mix). The API cost dominates. Everything else is noise.
14. The Complete Architecture Diagram
This architecture handles all nine building blocks. The TAOR loop is the center. Everything else feeds into it or consumes from it. The key design principle (confirmed by the Claude Code leak): keep the loop thin, push complexity to the edges.
15. Cost Analysis: What It Actually Costs to Run
The cost of running a Claude chatbot is dominated by API usage. Here is a realistic breakdown for a chatbot serving 1,000 daily active users with an average of 5 conversations per user per day and 10 messages per conversation:
Token consumption per conversation (with tool use and context management): approximately 50,000 input tokens and 5,000 output tokens on Sonnet 4.6. With prompt caching (80% cache hit rate on system prompt), effective input cost drops to about 30% of list price.
The multi-model strategy from Claude Code's architecture cuts this significantly. Route 60% of requests to Haiku ($1/$5 per MTok) for simple responses, 35% to Sonnet ($3/$15) for standard conversations, and 5% to Opus ($5/$25) for complex reasoning. This reduces the effective API cost by approximately 40% compared to using Sonnet for everything.
Prompt caching provides another 20-30% reduction. Structure your system prompt with a stable prefix (your core instructions, tool definitions, memory) that does not change between requests. Anthropic caches this prefix for 5 minutes at standard price, then subsequent reads cost only 10% of the input token price. For a chatbot with a 2,000-token system prompt used across thousands of requests, the savings compound rapidly.
One developer reported using 10 billion tokens over 8 months with Claude Code. On API pay-as-you-go, that would have cost over $15,000. On Max at $100/month, the same period cost $800, a 93% savings - Verdent. For chatbot builders, the lesson is clear: if your personal usage is heavy, the subscription plans are dramatically cheaper than API pricing. For serving end users, API pricing is the only option, so cost optimization (multi-model routing, prompt caching, context compaction) is essential.
Yuma Heymans (@yumahey), who leads agent development at O-mega and wrote the leaked source analysis referenced throughout this guide, has been building Claude-powered systems since the API launched. The patterns in this guide are not theoretical. They are the same patterns used in production at O-mega, refined through thousands of hours of agent development. The complete Anthropic ecosystem, from chatbot to code agent to desktop automation, is covered in our Anthropic ecosystem guide.
This guide reflects Claude's capabilities and pricing as of May 2026. Anthropic updates models and features frequently. Verify current details on platform.claude.com before building.