Drew Dresser's Weblog

March 29, 2026

AWS CLI vs MCP: Which Way Should AI Agents Talk to AWS?

One of the big open questions in the AI agent space right now is: should agents use traditional CLI tools, or structured MCP servers?

MCP (Model Context Protocol) has exploded in popularity as a way to give AI agents typed, validated tool access. AWS released an official MCP server that wraps the AWS CLI with structured input/output, command validation, and security guardrails. But does the wrapper actually make the agent better at AWS tasks? I thought this would be a nice concrete example to compare the MCP vs CLI pattern.

To do this comparison, I built an eval harness and ran 90 trials. The results were more nuanced than I expected, and the process of building the eval taught me as much as the data itself.

The Comparison

Both methods execute the same aws CLI commands under the hood. The difference is the wrapper layer:

	Direct CLI (Bash tool)	AWS API MCP Server
How it works	Claude calls Bash tool, runs `aws ...`	Claude calls `call_aws`, MCP server runs `aws ...`
Shell access	Full — can pipe, chain, use `jq`, `awk`	AWS-only — rejects non-`aws` input
Discovery	Claude relies on training knowledge	`suggest_aws_commands` tool available
Safety	None built-in	Denylist, read-only mode, mutation consent

This is an important detail: the MCP server doesn’t use a different API. It’s a validated, sandboxed wrapper around the same CLI. The question is whether that wrapper earns its overhead.

Methodology

I tested 14 tasks across 3 read-only categories, each run 3 times with both methods (84 total trials, excluding one outlier task discussed later):

Simple Reads (5 tasks): List S3 buckets, describe EC2 instances, get Lambda config, check IAM identity, list CloudWatch alarms
Filtered Queries (4 tasks): Tag-based EC2 filters, S3 bucket size, CloudWatch logs, security group audits
Error Handling (5 tasks): Nonexistent resources, permission denied, invalid parameters, region mismatches, already-exists conflicts

All trials used Claude Sonnet, identical prompts, and fresh conversations. The eval harness calls the Anthropic API directly, providing either Bash or MCP tools and recording every tool call, token count, and wall-clock time.

Caveats upfront

No correctness grading. I captured automated metrics but didn’t do blind human grading of output quality. These results measure how each method works, not how well.

Three trials per task. Low sample size. Treat these as directional observations, not proof.

Read-only only. I intentionally avoided mutation tasks to avoid unintended costs for people trying to replicate the results. This means MCP’s safety features (its strongest theoretical advantage) are untested here.

Results

Overall (42 trials per method)

Metric	CLI	MCP
Tool calls (mean)	2.1	2.0
Retries (mean)	0.4	0.5
Input tokens (mean)	3.8k	9.8k
Wall clock (mean)	10.4s	16.7s

CLI is faster and cheaper on tokens. Tool calls and retries are effectively tied.

The Charts Tell the Real Story

Bar chart comparing tool calls per task between CLI and MCP methods

For simple one-shot tasks (IAM identity, CloudWatch alarms), both methods make exactly 1 call. MCP’s advantage only appears on “List S3 buckets” (1 call vs 3) and “Describe EC2” (1 call vs 2) — tasks where CLI’s raw text output caused Claude to retry with different formatting.

Bar chart comparing input token usage per task between CLI and MCP methods

The token gap is consistent and significant. MCP uses 2.5x more input tokens on average. This comes from two sources: the MCP tool schemas are larger than a single Bash tool definition, and MCP responses are wrapped in JSON with metadata (cli_command, status_code, error_code, pagination_token). At scale, this is real money.

Examples

Example 1: List S3 Buckets — MCP Wins

Prompt: “List all my S3 buckets with their creation dates.”

The account has 71 buckets. Here’s how each method handled it:

CLI approach (3.3 calls avg, 16.6k tokens, 22s):

Tool call 1: bash(aws s3api list-buckets)
→ Got raw JSON... output was long, tried to reformat

Tool call 2: bash(aws s3api list-buckets --query "Buckets[*].[Name,CreationDate]" --output table)
→ Table output was truncated

Tool call 3: bash(aws s3api list-buckets --query "Buckets[*].[Name,CreationDate]" --output text | sort)
→ Finally got a usable list

Claude kept trying different output formats because the raw JSON for 71 buckets was unwieldy. The Bash tool returns unstructured text, so Claude couldn’t tell if it had everything without reformatting.

MCP approach (1 call, 7.3k tokens, 17s):

Tool call 1: call_aws(aws s3api list-buckets)
→ Got structured JSON with all 71 buckets, immediately formatted a markdown table

The MCP server returns parsed JSON in a structured envelope. Claude got the complete data in one call and rendered it directly. No reformatting needed.

Takeaway: When the output is large, MCP’s structured responses help Claude avoid the “let me try a different format” loop.

Example 2: Describe EC2 Instances — MCP Wins

Prompt: “Show all running EC2 instances in us-east-1 with instance type and public IP.”

CLI approach (2.3 calls avg, 2.4k tokens, 11s):

Tool call 1: bash(aws ec2 describe-instances --filters Name=instance-state-name,Values=running --region us-east-1)
→ Got results, but sometimes retried with --query to narrow the output

MCP approach (1 call, 4.9k tokens, 7s):

Tool call 1: call_aws(aws ec2 describe-instances --filters Name=instance-state-name,Values=running --region us-east-1)
→ Structured response, immediately extracted the relevant fields

Takeaway: Similar pattern — MCP’s structured output eliminated retry attempts.

Example 3: Invalid Bucket Name — Both Handle It Well

Prompt: “Create an S3 bucket called ‘INVALID_UPPERCASE_BUCKET’.”

CLI approach (1 call, 1.2k tokens, 6s):

Tool call 1: bash(aws s3api create-bucket --bucket INVALID_UPPERCASE_BUCKET)
→ Error: "An error occurred (InvalidBucketName)"
→ Claude explained: "S3 bucket names must be lowercase, 3-63 characters..."

MCP approach (1 call, 4.6k tokens, 7s):

Tool call 1: call_aws(aws s3api create-bucket --bucket INVALID_UPPERCASE_BUCKET)
→ Error in structured JSON: {"error": "An error occurred (InvalidBucketName)"}
→ Claude explained the same naming rules

Takeaway: For straightforward errors, both methods perform identically. The structured response didn’t improve error recovery.

The Measurement Bug

In my first run, the harness reported zero MCP errors which looked like a massive win, but it was too good to be true

The MCP server wraps AWS errors inside a successful protocol response (is_error=false). My harness was only checking the protocol-level flag, so it counted every MCP call as successful, even when the underlying AWS command failed.

I had to add response body parsing to detect errors like "error": "An error occurred (ResourceNotFoundException)" inside the MCP JSON. After the fix, MCP’s error count went from 0 to 39 — comparable to CLI’s 32.

If you’re building tooling on top of the AWS API MCP Server, don’t trust is_error alone. Inspect the response body.

The Outlier: Cross-Service Lambda Metrics

I excluded one task from the main results because it dominated the averages. Task 2.5 asked: “For each Lambda function, show its name, runtime, and the number of invocations in the last 24 hours.”

This requires calling lambda list-functions and then cloudwatch get-metric-statistics for each function — a cross-service join.

	CLI	MCP
Tool calls	8	4
Input tokens	38.6k	27.6k
Wall clock	2.9 minutes	31 seconds

CLI spent nearly 3 minutes thrashing — listing functions, querying metrics one by one, losing track of state. MCP was dramatically more efficient here. Including this task in the averages would make MCP look like the clear winner, which would misrepresent the pattern across the other 14 tasks.

I believe this shows MCP’s real advantage: not on simple tasks, but on complex multi-step workflows where structured output helps the model maintain state across calls.

The Counter-Arguments

“You’re comparing a wrapper to the thing it wraps.” Fair. Both methods run aws commands. If you already have Claude Code’s Bash tool permission-gated and a scoped IAM policy, MCP’s safety features are partially redundant.

“The model didn’t use suggest_aws_commands.” Claude never called suggest_aws_commands or get_execution_plan in any of the 45 MCP trials. The model already knows the CLI well enough for common operations. These tools may only matter for obscure APIs — which I didn’t test.

“Token counts are misleading without correctness.” A method that uses 50% fewer tokens but gives wrong answers isn’t more efficient — it’s just broken. Without correctness grading, I can’t separate “efficient and right” from “efficient and wrong.”

“n=3 is not science.” Three trials per task provides directional signal at best. A rigorous eval would need 20+ trials with blind human grading.

Scenario	Use	Why
Simple queries	CLI	2.5x fewer tokens, faster on one-shot tasks
Large output tasks	MCP	Structured JSON prevents reformatting loops
Multi-step workflows	MCP	Better state tracking across calls (per outlier task)
Security-sensitive environments	MCP (untested)	Denylist, read-only mode, shell injection prevention
Cost-sensitive batch work	CLI	Token overhead of MCP compounds at scale

Try It Yourself

The eval harness, task definitions, scoring rubric, and raw results are open-source on GitHub. Run it against your own AWS account:

git clone https://github.com/drewdresser/aws-cli-vs-mcp
cd aws-cli-vs-mcp
uv sync

echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
echo 'AWS_PROFILE=your-profile' >> .env

uv run python main.py run --safe-only --method both --trials 3

The harness has 35 total tasks including mutations and safety tests — I only ran the read-only subset. If someone runs the full suite with correctness grading, I’d love to see the results.

🔗 12:00 AM •

January 22, 2026

From Vision to Main: How I Code With AI Agents Now

The way I write software has fundamentally changed. A year ago, I spent most of my time in the code. Today, I spend most of my time thinking about the code, and the output has never been better.

I call this approach strategic agentic development. The core idea: instead of treating planning and coding as separate phases, I’ve built a system where planning is the primary engineering activity. The agents handle the implementation.

This isn’t theoretical. I’ve packaged everything I’ve learned into an open-source Claude Code plugin called ai-dev. Here’s how it works, and why it’s changed everything.

The Old Way Was Backwards

Traditional software development assumes coding is the bottleneck. You plan just enough to start, then iterate in code. This made sense when writing code was the hard part.

But with AI coding agents, that assumption is inverted. The hard part isn’t writing code anymore. It’s knowing what to write and why.

I kept noticing the same failure mode: I’d ask an agent to build something, it would produce code quickly, and then I’d realize I hadn’t thought through the actual requirements. The agent did exactly what I asked. The problem was what I asked for.

The bottleneck moved upstream, to clarity of thought.

The Compound Engineering Insight

The team at Every articulated something important with their Compound Engineering approach: each unit of engineering work should make subsequent work easier, not harder.

Their plugin implements a cycle: Plan → Work → Review → Compound.

I took this further. If planning is 80% of the value, why not systematize planning itself?

Going Upstream: Strategic Planning as Code

The ai-dev plugin adds a full strategic layer on top of the work cycle. The flow looks like this:

Vision → Strategy → OKRs → Epics → User Stories → Technical Plans → Implementation → Review → Main

Each layer feeds the next. And critically, each artifact lives in your repo. Strategy becomes documentation, not just conversation.

The Kickoff: 8 Phases of Structured Thinking

When I start a new project, I run /ai-dev:kickoff. This triggers an 8-phase Socratic planning session:

Problem Space: What are we actually solving?
North Star: What’s the ultimate impact we want?
Vision: What does success look like in 3-5 years?
Mission: How do we operate? What are our values?
Strategy & Non-Goals: What we will and won’t do
Success Metrics: How we measure progress
OKRs: Quarterly objectives with key results
Epics & User Stories: The actual work, mapped to GitHub

The output isn’t vague prose. It’s structured markdown in a /strategy/ directory, living documentation that evolves with the project.

The non-goals are particularly powerful. Explicit constraints prevent scope creep before it starts.

From Strategy to Shipped Code

Strategy documents are useless if they don’t connect to execution. Here’s how ai-dev bridges the gap:

GitHub as Single Source of Truth

Epics become GitHub Milestones. User stories become GitHub Issues. Every piece of work traces back to an OKR, which traces back to the strategy.

When I run /ai-dev:plan-issue #123, the agent reads the issue, analyzes the codebase, and produces a detailed technical implementation plan. The plan explicitly references which acceptance criteria it addresses and which OKR it supports.

Trunk-Based Development

All work happens on main. No feature branches. This sounds scary until you realize: with quality gates and small atomic commits, you get more safety, not less.

The /ai-dev:work command executes a plan incrementally:

Each step tracked with TodoWrite
Each completed step gets its own commit
Tests and linting run before anything hits main

If something breaks, you revert one small commit. Compare that to merging a 2-week-old feature branch.

Multi-Agent Review

Before pushing, /ai-dev:review runs three specialized agents in parallel:

Code Reviewer: Quality and correctness
Security Auditor: Vulnerability scanning
Test Architect: Coverage gaps

The findings get synthesized into actionable feedback. Three expert perspectives in seconds.

The 80/20 Flip

Here’s what changed for me:

Before: 20% planning, 80% coding After: 80% planning, 20% reviewing agent output

This isn’t laziness. It’s leverage. The agents can write code faster than I can type. But they can’t decide what to build or why. That’s my job now.

And because planning is captured in structured artifacts, not just my head, the agents get better context with every session. Compound returns.

What This Actually Looks Like

A typical workflow now:

First: Review GitHub issues created from OKRs. Pick one. Run /ai-dev:plan-issue #47 to generate a technical plan.

Next: Review the plan. Refine requirements if needed. Run /ai-dev:work to execute. Watch the agent work through each step, committing as it goes.

Finally: Run /ai-dev:review. Address any findings. Run /ai-dev:commit-push to push to main with quality gates.

If necessary: Update OKRs if needed. Run /ai-dev:sync-strategy to keep GitHub and strategy docs in sync.

The code writes itself. My job is making sure it’s the right code.

Try It Yourself

The ai-dev plugin is open source. Install it in Claude Code and run /ai-dev:kickoff on your next project.

Start with the planning commands even if you’re skeptical. The magic isn’t in the agent automation. It’s in the structured thinking the system forces.

Every unit of strategic clarity makes subsequent engineering easier. That’s the real compound effect.

🔗 12:00 PM •

ai claude-code agentic-development productivity

November 16, 2025

I Taught an AI to Draw My AWS Diagrams So I Never Have To Again

If you’re a Solutions Architect, you know the drill: half your job is explaining architecture, and the other half is drawing the same architecture diagram 400 different ways. This weekend, I finally decided to do something about it and build an AI Aagent that could draw them for me. The goal was simple:

Let an agent generate AWS architecture (diagrams)[https://diagrams.mingrammer.com/] as code, using the excellent diagrams library.
Let that same agent execute the code to return a finished diagram image.

This was a two night experiment, here is how far I got:

Side note, I spent a little time trying to configure Langfuse, Pydantic AI, and Langfuse. It didn’t go great, but that might be a post for a new day.

Attempt #1 Pydantic AI with build in CodeExecutionTool

I started simple: let Pydantic AI generate the code and execute it in line

agent = Agent[None, str](
    "openai-responses:gpt-5.1",
    builtin_tools=[CodeExecutionTool()],
    model_settings=model_settings,
)

result = agent.run_sync(
    "Your job is to return an image of the architecture diagram. To do that you should use python diagrams library to generate the diagram. Then you should run the code to generate the image. You might have to install the diagrams library first."
)

It worked great until… it didn’t. The sandbox environment couldn’t pip install diagrams because there is no network access. Deal breaker. On to attempt #2

Attempt #2: Pydantic AI with E2B Sandbox (Much Better)

E2B has a Code Interpreter sandbox I’ve been meaning to try. It lets you execute arbitrary Python and shell commands in an isolated environment, and it’s surprisingly pleasant to use on the free tier.

The first issue I had to tackle was that graphviz wasn’t installed in the environment and its not installable with pip (I guess that means E2B Sandbox will pip install things if it is pip installable, should check on this later).

So I added tooling for the agent to probe the environment first, then decide how to proceed.

Environment inspection tool

This helper checks what the sandbox can actually do: package managers, permissions, installed binaries, etc. I anticipate this will be helpful as I try out multiple code execution environments.

@agent.tool
def check_environment(ctx: RunContext[None]) -> CodeExecutionResult:
    ...

Executing Python and extracting images

Next, the core tool: run Python inside E2B and extract whatever images it generates — PNG, JPEG, SVG, whatever.

@agent.tool
def execute_python_code(ctx: RunContext[None], code: str) -> CodeExecutionResult:
    ...

Next, I went to work on the system prompt and agent:

agent = Agent[None, str](
    "openai-responses:gpt-5.1",
    model_settings=model_settings,
    system_prompt="You are an expert at creating AWS architecture diagrams using Python's diagrams library. "
    "You can execute Python code using the E2B code interpreter to generate diagrams. "
    "If you encounter missing packages or dependencies, first use check_environment to understand what's available, "
    "then use run_shell_command to install them. "
    "For Python packages: 'pip install diagrams' "
    "For system packages: The E2B code interpreter sandbox may have restrictions. Try 'sudo apt-get install -y graphviz' "
    "but if that fails due to permissions, you may need to work around it or use alternative approaches. "
    "Always check the environment first to understand what package managers and permissions are available.",
)

The agent has four tools. check_environment inspects the sandbox to see what’s available:

@agent.tool
def check_environment(ctx: RunContext[None]) -> CodeExecutionResult:
    """
    Check the E2B sandbox environment to understand what's available.

    This tool helps identify:
    - What package manager is available (apt-get, yum, dnf, etc.)
    - What user we're running as (root or regular user)
    - What's already installed
    - System information

    Returns:
        Result containing environment information
    """
    try:
        sandbox = get_sandbox()

        # Check multiple things in parallel
        checks = [
            ("whoami", "Current user"),
            ("id", "User ID and groups"),
            ("which apt-get", "apt-get availability"),
            ("which yum", "yum availability"),
            ("which dnf", "dnf availability"),
            ("which pip", "pip availability"),
            ("which python3", "python3 availability"),
            ("which dot", "graphviz dot availability"),
            ("cat /etc/os-release", "OS information"),
        ]

        output_parts: list[str] = []
        output_parts.append("Environment Check Results:")
        output_parts.append("=" * 50)

        for cmd, description in checks:
            try:
                result = sandbox.commands.run(cmd, timeout=10)
                status = "✓" if result.exit_code == 0 else "✗"
                output_parts.append(f"\n{status} {description}:")
                if result.stdout:
                    output_parts.append(f"  {result.stdout.strip()}")
                if result.stderr and result.exit_code != 0:
                    output_parts.append("  (not found)")
            except Exception as e:
                output_parts.append(f"\n✗ {description}: Error - {str(e)}")

        output = "\n".join(output_parts)

        return CodeExecutionResult(
            success=True,
            output=output,
            files=[],
        )

    except Exception as e:
        return CodeExecutionResult(
            success=False,
            output="",
            error=f"Environment check error: {str(e)}",
            files=[],
        )

The core tool is execute_python_code which runs Python in the E2B sandbox and extracts generated images. E2B returns execution results that can include text, images, HTML, and markdown. The tool decodes base64 encoded images and saves them locally:

@agent.tool
def execute_python_code(ctx: RunContext[None], code: str) -> CodeExecutionResult:
    """
    Execute Python code using E2B code interpreter.

    This tool can execute Python code in a sandboxed environment. It's particularly
    useful for generating architecture diagrams using the diagrams library.

    If you encounter import errors or missing packages, use the run_shell_command
    tool first to install the required packages (e.g., "pip install diagrams graphviz").

    Args:
        code: Python code to execute

    Returns:
        Result containing execution output, any errors, and generated files
    """
    try:
        sandbox = get_sandbox()

        # Execute the code - E2B sandbox.run_code() returns an Execution object
        execution = sandbox.run_code(code)

        # Check for errors first
        if execution.error:
            error_msg = f"{execution.error.name}: {execution.error.value}\n{execution.error.traceback}"
            return CodeExecutionResult(
                success=False,
                output="",
                error=error_msg,
                files=[],
            )

        # Collect output from results
        output_parts: list[str] = []
        files: list[str] = []
        image_data: bytes | None = None
        image_filename: str | None = None

        # Process each result in the execution
        for result in execution.results:
            # Handle text output
            if result.text:
                output_parts.append(result.text)

            # Handle image output (PNG, JPEG, SVG) - these contain base64 encoded data
            if result.png:
                try:
                    # PNG data is base64 encoded in E2B
                    image_data = base64.b64decode(result.png)
                    image_filename = "diagram.png"
                    output_parts.append("PNG image generated successfully")
                except Exception as e:
                    logger.error(f"Failed to decode PNG data: {str(e)}")
                    output_parts.append(
                        f"PNG image generated but decode error: {str(e)}"
                    )
            if result.jpeg:
                try:
                    image_data = base64.b64decode(result.jpeg)
                    image_filename = "diagram.jpeg"
                    output_parts.append("JPEG image generated successfully")
                except Exception as e:
                    logger.error(f"Failed to decode JPEG data: {str(e)}")
                    output_parts.append(
                        f"JPEG image generated but decode error: {str(e)}"
                    )
            if result.svg:
                # SVG is typically text, not base64
                if result.svg:
                    svg_bytes = result.svg.encode("utf-8")
                    image_data = svg_bytes
                    image_filename = "diagram.svg"
                    output_parts.append("SVG image generated successfully")

            # Handle HTML output
            if result.html:
                output_parts.append("HTML output generated")

            # Handle markdown output
            if result.markdown:
                output_parts.append(result.markdown)

        # Also check execution.text property for main result
        if execution.text and execution.text not in output_parts:
            output_parts.insert(0, execution.text)

        # Check if output mentions a file path and try to read it from the sandbox
        output_text = "\n".join(output_parts) if output_parts else ""

        # Try multiple patterns to find file references
        file_path_match = re.search(
            r"sandbox:([^\s\)]+\.(png|jpg|jpeg|svg))", output_text, re.IGNORECASE
        )
        filename_match = re.search(
            r"[`'\"]([^\s`'\"]+\.(png|jpg|jpeg|svg))[`'\"]", output_text, re.IGNORECASE
        )
        simple_filename_match = re.search(
            r"\b([a-zA-Z0-9_-]+\.(png|jpg|jpeg|svg))\b", output_text, re.IGNORECASE
        )

        # Determine which path to use
        sandbox_path = None
        if file_path_match:
            sandbox_path = file_path_match.group(1)
        elif filename_match:
            sandbox_path = filename_match.group(1)
        elif simple_filename_match:
            sandbox_path = simple_filename_match.group(1)

        if sandbox_path and not image_data:
            try:
                # Try to read the file from sandbox
                file_data = sandbox.files.read(sandbox_path, format="bytes")
                if file_data and isinstance(file_data, bytearray):
                    image_data = bytes(file_data)
                    image_filename = Path(sandbox_path).name
                    output_parts.append(
                        f"Successfully retrieved image file: {sandbox_path}"
                    )
            except Exception:
                # Try common diagram output locations
                common_paths = [
                    f"/{sandbox_path}",
                    f"./{sandbox_path}",
                    sandbox_path,
                    "/architecture_diagram.png",
                    "/diagram.png",
                    "./architecture_diagram.png",
                    "./diagram.png",
                ]
                # Remove duplicates while preserving order
                seen = set()
                unique_paths = []
                for path in common_paths:
                    if path not in seen:
                        seen.add(path)
                        unique_paths.append(path)
                common_paths = unique_paths
                for common_path in common_paths:
                    try:
                        file_data = sandbox.files.read(common_path, format="bytes")
                        if file_data and isinstance(file_data, bytearray):
                            image_data = bytes(file_data)
                            image_filename = Path(common_path).name
                            output_parts.append(f"Found image at: {common_path}")
                            break
                    except Exception:
                        continue

        # Save image to local file if we have image data
        if image_data and image_filename:
            output_dir = Path("results/diagrams")
            output_dir.mkdir(parents=True, exist_ok=True)
            local_path = output_dir / image_filename
            try:
                with open(local_path, "wb") as f:
                    f.write(image_data)
                logger.info(f"✅ Saved image to: {local_path}")
                files.append(str(local_path))
                output_parts.append(f"\n✅ Image saved to: {local_path}")
            except Exception as e:
                logger.error(f"Failed to save image: {str(e)}")
                output_parts.append(f"\n❌ Failed to save image: {str(e)}")

        output = (
            "\n".join(output_parts) if output_parts else "Code executed successfully"
        )

        return CodeExecutionResult(success=True, output=output, files=files)

    except Exception as e:
        return CodeExecutionResult(
            success=False, output="", error=f"Execution error: {str(e)}", files=[]
        )

The tool handles multiple image formats (PNG, JPEG, SVG), extracts images from E2B results (base64 or file reads), and saves them locally. It also tries common file paths if the image isn’t in the results. The workflow: the agent checks the environment, installs dependencies if needed, writes Python code using the diagrams library, executes it in E2B, and extracts the generated image. Pydantic AI structures the tool calls, and E2B provides isolation and execution capabilities. This approach lets the agent reason about dependencies, write code, execute it safely, and retrieve artifacts—all orchestrated through Pydantic AI’s tool system.

Conclusion

This actually felt useful. Even in its early state, it’s good enough to:;

Generate a first-pass architecture diagram from a textual requirement
Run the code
Produce a real image
Iterate based on feedback

I think this could be genuinely useful. Some possible next steps:

deploy it! I’m thinking about AgentCore Gateway to expose it via MCP
try out other code interpreters like Amazon Bedrock AgentCore Code Interpreter, Modal, Daytona, or Runloop
add in a critique agent that can view the image and give feebdack
evals…

This was a fun two night build. And honestly, anything the barrier to creating an architecture diagram is worth celebrating.

More experiments soon.

🔗 9:45 PM

May 3, 2025

AI Storybooks

My wife and I recently had a baby. She is 4 months now, but as you can imagine, the AI use cases for raising a baby are endless. There is the late night - “what do I do if my baby won’t sleep” questions to 4o, there are the “please researach why we should or shouldn’t let our baby use a standing play station” to o3 and lastly, there is the “write me a story I can read to my 4 month year old about her trip to Central Park” to 4.5 and 4o image.

It’s the last use case that this post is about - wouldn’t it be cool if I had a basic python script, web app, or something that could take an input and generate a beautiful children’s book? Couple this with OpenAI’s latest image model `gpt-image-1 and my ideas started flowing. Today I’m ready to release a v0 here.

I’m excited to introduce Custom Story Gen - an early prototype of an AI-powered tool that will generate custom, print-ready children’s storybooks. This is more than just a content generator. It’s a wraper on LLMs that brings us towards a future where anyone can create high-quality, personalized storybooks starring their own children, pets, family members, and adventures, complete with matching illustration.

My goal with this app is to generate stories that are so coherent, well-illustrated, and emotionally relevent that you’d feel great about printing them. They can be used for bedtime traditions or birthay gits, but I want these books to feel personal and polished.

What it can today

Today is an early launch. It’s highly technical and requires python experience to run. Today the app can:

Create original stories tailored to your child’s age range
Include your own characters (and even upload a photo for illustration purposes)
Customize the setting, tone, theme, and length
Automatically generate matching illustrations for each page
Export a structured folder with page images, text, and metadata

Here is a basic example of the configuration you can use:

{
  "characters": [
    {
      "name": "Ella",
      "description": "A baby human with a big smile",
      "image_path": "input/images/ella.jpeg"
    },
    {
      "name": "Rory",
      "description": "A black and white tibetan terrier puppy"
    }
  ],
  "theme": "Friendship and overcoming fears",
  "age_range": "1-2 years",
  "location": {
    "setting": "A sunny meadow next to a sparkling blue river",
    "details": ["Tall swaying grass", "Colorful wildflowers", "Busy buzzing bees"]
  },
  "story_length_pages": 5,
  "image_style": "Colorful cartoon illustration, simple and friendly, watercolor texture"
}

Here are some example outputs from a recent story:

Under the Hood: The Architecture

The current system is simple by design, but thoughtfully structured:

Python 3.12+ powers the backend logic
OpenAI & Gemini APIs are used to generate story text and illustrations
JSON config-driven design allows non-developers to use the tool easily

File-based output includes:

A .txt file with the story
Individual .png illustrations per page
A manifest JSON for downstream publishing or UI use

It’s a thin wrapper for now—but a powerful one.

What’s Coming Next

This prototype is only the beginning. Here’s what’s on the roadmap:

LLM-Powered Evaluation: We plan to build an agent that acts as a quality control editor, rating generated books for clarity, engagement, tone, and age-appropriateness.

Turn It Into a SaaS Tool: A user-friendly web app where parents can create, preview, and purchase personalized books without touching code.

Smarter Agentic Backends: Today’s story generation is linear. Future versions will use agents to dynamically plan story arcs, revise drafts, and coordinate illustration prompts for consistency.

Printer Integration: Ultimately, we want to connect directly to on-demand book printing services—so you can create and order a printed copy in just a few clicks.

Try it today

Want to test it out?

git clone https://github.com/yourusername/story-book-creator.git
cd story-book-creator
uv sync
cp .env.example .env  # Add your OpenAI and Gemini keys
cp story_config.example.json input/story_config.json  # Customize your story
uv run main.py

Your generated book will appear in the output/ directory. Tweak your config, try different characters, experiment with styles—this is your sandbox.

We’re just getting started. If you want to follow along—or contribute—check out the GitHub repo and stay tuned for upcoming updates. The future of storytime is personal, and we’re building it—one page at a time.

🔗 8:00 PM

March 9, 2025

🔗 cursor-large-projects

Most of the twitter conversation about cursor has been about how it’s a great tool for small projects. I think that’s a shame, because I think it’s a great tool for large projects.

Cursor is a powerful tool for maintaining large coding projects, helping developers code 5-30 times faster. The guide emphasizes the importance of an effective edit and test loop to improve code quality. By setting up proper documentation and workflows, engineers can leverage AI for refactoring, documentation, and project planning.

I love the ai folder specific to the project. I also appreciate how you can have Cursor run tests which isn’t something I’ve done a lot of yet, but look forward to trying.

🔗 3:27 PM •

ai llm cursor

January 31, 2025

Eval datasets and frameworks survey

The rapid pace of model development means everyone’s on a never-ending quest to figure out if the latest model is actually better than its predecessor. Public benchmarks are essential, but they usually only paint part of the picture. By rolling your own evaluations, you get a direct view of how a model handles tasks that matter to your team—like domain-specific question-answering, custom code generation, or weird edge cases unique to your product.

The first part of a hopeful series is to conduct a survey of popular evaluation datasets and a quick description of each.

Evaluation Datasets

TruthfulQA – Tests how well a model avoids repeating human falsehoods. Comes in generative and multiple-choice variants. Great for checking whether your model parrots misinformation.
Lab Bench – A robust, biology-focused dataset with 30 subtasks like protocol troubleshooting and sequence manipulation. Perfect if you’re dealing with scientific research workflows.
SWE-bench – Focuses on real GitHub Issues. Ideal if your team wants to evaluate code quality, debugging capabilities, or how well a model handles real-world developer workflows.
RE-Bench – Specifically probes AI’s R&D capabilities in a controlled environment, letting you compare model performance against human benchmarks.
GPQA – Graduate-level multiple-choice questions from actual PhD students. This is great if you’re dealing with advanced scientific or technical reasoning tasks that require real depth.
Frontier Math, GSM8K, MATH, and DeepMind Mathematics – For math-savvy teams, these are gold. They test everything from grade-school arithmetic to high-level theorem solving.
HellaSwag, WinoGrande, and MMLU Benchmark – If you want to test common-sense reasoning, logic, or broader knowledge capabilities, these cover a wide range.
ARC (Abstraction and Reasoning Corpus) – Good for puzzles that test a model’s ability to identify patterns without explicit instructions.
PopQA – Useful for stress-testing how well a model retains or “forgets” entity-specific information over multiple turns in conversation.
HumanEval, BigCodeBench – If you need to see how your model handles code generation or code QA.
IfEval-OOD, HREF, BigBenchHard, DROP – More specialized sets that target out-of-distribution reasoning, reading comprehension, or advanced multi-step logic.

Evaluation Frameworks

Olmes – A new tool that simplifies loading, running, and reporting benchmarks on your model.
llm-evaluation-harness by EleutherAI – One of the most established frameworks, supporting a ton of datasets and easy to customize for your own data.

I’m sure I’m missing some. If you know of any, please let me know. For now I think this puts one in a good position to start clicking around and researching which of these datasets are most relevant to their use case. From there, you can use oen of the evaluation frameworks to run through the curated dataset.

🔗 11:16 PM •

evals llm

January 20, 2025

Running Deepseek R1 Locally

If you want to see a model “think” on your local machine, here is a quickstart:

Download a distilled version of the model: ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0

Chat with the model there OR if you have the LLM CLI tool installed and the llm-ollama plugin (llm install --upgrade llm-ollama), you can run:

llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
    'Flagship Pioneering is a Venture Creation company that creates companies. Come up with 5 new company ideas for them'

I’ve found the outputs to be mediocre. It’s a distilled model, etc. etc. and a better comparison would be with their public API; however, it’s facsinating to see the model “think” on your local machine.

🔗 9:10 PM •

deepseek llm ollama

January 14, 2025

🔗 o1-not-a-chat-model

There has been a lot of chat about the new reasoning models, and how they are not chat models. I completely agree, and want to add my voice to the chorus.

The focus of this is that an o1 prompt should look a lot different than your typical chat:

Understanding this prompt anatomy is crucial because it fundamentally changes how organizations need to approach AI implementation. While the structure might make sense to technical users, rolling this out across an enterprise presents unique challenges. The shift from quick chat-style interactions to detailed, structured briefs impacts everything from user training to workflow design.

The enterprise challenge with o1 isn’t just about adopting new tech - it’s about fundamentally changing how people work with AI. While chat models let users dive in with quick questions and iterate, o1 demands what I’d call “front-loaded effort” - you need to dump ALL the context upfront and carefully frame what you want. This creates an interesting tension for enterprise adoption: On the upside, o1’s report-style outputs actually align really well with enterprise needs. You get structured, thorough analysis that reads like a proper business document rather than a casual chat. Perfect for decision-making and documentation. But here’s the catch - teaching busy enterprise users to write these detailed briefs is tough. They’re used to the quick back-and-forth of chat models or traditional tools. Now we’re asking them to:

Front-load all context (which means gathering it first)
Clearly define outputs (no vague requests)
Wait longer for responses (potentially 5+ minutes)

For enterprise rollouts, I think this means:

Training needs to shift from “how to chat with AI” to “how to brief AI”
Expectations around response times need resetting (not usually a big deal)
Best practices around context gathering need development

The real kicker? Just when enterprises were getting comfortable with chat-based AI, this paradigm shift forces another round of change management. It’s like teaching someone a new language right after they got comfortable with the first one.

To make this work, enterprises might need dedicated “AI prompt engineers” - people who can bridge the gap between users and these more demanding but powerful models. Think of them as technical writers for the AI age. If not dedicated people, then companies could consider dedicated projects and engagements focused on bringing reasoning prompts to business users.

Additionally, it’d be helpful to start sharing the art of the possible for business users with reasoning models like o1. Let me share three practical examples where business users could leverage reasoning models effectively:

Quarterly Report Analysis: Instead of asking quick questions about numbers, dump the entire quarterly spreadsheet, previous reports, and industry context into the model and ask for a comprehensive analysis. The model can identify trends, flag concerns, and create executive summaries - all in one thorough shot. Much better than piecemeal analysis through chat.

Meeting Summary & Action Plans: Take a raw meeting transcript, add the project background, team structure, and goals, then ask the model to create a structured output with: key decisions, action items, risks identified, and next steps. The model’s ability to process all this context at once means better synthesis than parsing piece by piece.

Policy Compliance Review: Perfect for legal or HR teams. Feed in your company policies, industry regulations, and a proposed new process or policy. The model can do a thorough gap analysis, identifying compliance risks and suggesting specific updates. Much more reliable than trying to check compliance point-by-point through chat. Plus, the model’s formal report style matches the serious nature of compliance work.

RFP Response Analysis: For procurement or sales teams, dump in the entire RFP document, your company’s past proposals, competitor intel, and pricing strategy. Ask for a detailed analysis of what sections need focus, suggested win themes, pricing recommendations, and potential red flags. The model’s ability to process all this context at once helps create a cohesive strategy instead of answering one requirement at a time.

The key theme here? These aren’t quick Q&A tasks - they’re meaty problems where the user invests time upfront to get comprehensive, actionable insights in return. Think “weekly deep-dive” rather than “quick daily check.”

Bottom line: Treat o1 like the powerful reasoning engine it is - with proper training and support - and it’ll transform your business. Treat it like ChatGPT, and you will likely struggle with user frustration and poor results.

🔗 4:45 PM •

ai llm o1 llm-reasoning

January 11, 2025

🔗 Agents

I came across Chip Huyen’s blog via twitter today. Her blog is great and has a lot of content I’m going to read. I’ll start with this post on Agents.

Chip defines Agents as:

An agent is anything that can perceive its environment and act upon that environment. Artificial Intelligence: A Modern Approach (1995) defines an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.⁠

She then explains that the interaction between an agent and the environment is the key to the agent’s success. If you build a web scraping agent, the internet is the environment. If you build an agent that can play Minecraft, the Minecraft world is the environment. And so on.

Planning is the heart of an agent, and in the blog, Chip provides a lot of detail on it. Planning is decoupled from execution. The plan requires that the model understands the user’s intent and requires that the model breaks the task down into smaller subtasks. You can do a lot to help improve the likelihood of success - asking for human feedback, writing great system prompts, giving better descriptions of the tools available to the agent, use a stronger model, or even fine tune a model.

To me it seems that with the current state of LLMs, the most effective manner to improve the planning aspect of an agent is to ask for human feedback. There may be veritcals, like coding or math, where the plan is easier for an LLM to generate, but for the long tasks, I think asking for human feedback is the best way to improve the agent. I wonder if we’ll ever see the reasoning models (o1, o3, etc) allow users to adjust hte plan on the fly. I know that Google’s Gemini 1.5 Deep Research has a “refine” feature that allows you to adjust the plan already.

The post is great, and the blog looks promising. I’ll be back to Chip’s writing!

🔗 10:12 PM •

ai llm agents llm-planning llm-feedback llm-reasoning

January 8, 2025

Running Ollama Vision Models Locally

When I was checking out the new Phi model with Ollama. I came across the LLM Ollama plugin. It reminded me that Ollama now supports vision models.One of the use cases I’ve been thinking about is script that can run against all images in my downloads folder and generate a description of each image. Many of them have obscure names, and I’d like to be able to search them by description.

Here is my attempt to do this. The script is as simple as:

for file in ~/Downloads/*.{jpg,jpeg,png,gif,webp}; do
    echo "Processing $file"
    llm -m llama3.2-vision:11b "Describe this image." -a "$file"
done

To run this, you’ll need to install LLM and Ollama. Then you’ll need to install the llama3.2-vision model:

ollama pull llama3.2-vision:11b

and install the plugin:

llm install --upgrade llm-ollama

This is remarkably simple and produces really helpful descriptions.

The image shows a man running across the finish line of a triathlon, with a red archway and a crowd of people watching. The purpose of the image is to capture the moment of triumph for the athlete as he completes his race.

* A man:
        + Wearing a black triathlon suit
        + Running towards the camera
        + Has a number on his bib
        + Appears to be exhausted but determined
* A finish line:
        + Red archway with white text that reads "FINISH"
        + Surrounded by spectators and officials
        + Located at the end of a road or track
* A crowd of people:
        + Standing behind barriers, watching the athlete cross the finish line
        + Cheering and taking photos
        + Dressed in casual clothing, with some wearing team jerseys

The image conveys a sense of excitement and accomplishment, as the athlete reaches the end of his grueling triathlon. The crowd's enthusiasm adds to the celebratory atmosphere, making it clear that this is a momentous occasion for all involved.

It can obviously be modified to run against any file or folder that you want. I could imagine doing some intelligent document processing on this to help group similar files together and cleaning up my downloads folder.

Lastly, if you prefer a web interface for more adhoc usage. You can use Open WebUI and uvx to run it locally:

uvx --python 3.11 open-webui serve

This will automatically find the Ollama models you have, including your image model.

🔗 9:04 PM •

phi ai ollama llama vision llm til llm-ollama uv

March 29, 2026

AWS CLI vs MCP: Which Way Should AI Agents Talk to AWS?

The Comparison

Methodology

Caveats upfront

Results

Overall (42 trials per method)

The Charts Tell the Real Story

Examples

Example 1: List S3 Buckets — MCP Wins

Example 2: Describe EC2 Instances — MCP Wins

Example 3: Invalid Bucket Name — Both Handle It Well

The Measurement Bug

The Outlier: Cross-Service Lambda Metrics

The Counter-Arguments

What I’d Recommend (Provisionally)

Try It Yourself

January 22, 2026

From Vision to Main: How I Code With AI Agents Now

The Old Way Was Backwards

The Compound Engineering Insight

Going Upstream: Strategic Planning as Code

The Kickoff: 8 Phases of Structured Thinking

From Strategy to Shipped Code

GitHub as Single Source of Truth

Trunk-Based Development

Multi-Agent Review

The 80/20 Flip

What This Actually Looks Like

Try It Yourself

November 16, 2025

I Taught an AI to Draw My AWS Diagrams So I Never Have To Again

Attempt #1 Pydantic AI with build in CodeExecutionTool

Attempt #2: Pydantic AI with E2B Sandbox (Much Better)

Environment inspection tool

Executing Python and extracting images

Conclusion

May 3, 2025

AI Storybooks

What it can today

Under the Hood: The Architecture

What’s Coming Next

Try it today

March 9, 2025

January 31, 2025

Eval datasets and frameworks survey

Evaluation Datasets

Evaluation Frameworks

January 20, 2025

Running Deepseek R1 Locally

January 14, 2025

January 11, 2025

January 8, 2025

Running Ollama Vision Models Locally