Web Directions Engineering AI - Notes

Web Directions Engineering AI

Date: September 12, 2025

Exploring the evolution of software engineering in the age of AI.

Co-pilot, not auto-pilot

Dave Berner Co-founder & VP Eng Kinde

Dave is based in Byron Bay, co-founder of Kind (authentication & billing platform).
and runs the More Founders Show podcast. His talk reflects his evolving relationship with AI: from late adopter → all-in → back to a more cautious, balanced stance.

The Problem: AI-Induced Laziness

  • Early use: thoughtful prompts with context.
  • Current use: short, lazy prompts → over-reliance on AI’s “confident but wrong” answers.
  • Example: wasted two hours on a compiler error before solving it himself in five minutes.
  • AI confidence often masks lack of accuracy, creating misplaced trust.

Cognitive Impact of AI

  • MIT/Harvard study (54 students, 4 months):
    • Using ChatGPT → 47% less brain activity.
    • 83% couldn’t recall their own AI-generated work minutes later.
  • Long-term: those who used AI struggled more when tools were removed.
  • Raises concerns about critical thinking decline and innovation monoculture if everyone relies on AI.

Trade-Offs & Risks

  • AI speeds up code generation, but code review, quality, and maintainability remain bottlenecks.
  • Overuse risks:
    • Knowledge decay (forgetting syntax, problem-solving).
    • Zero-bus-factor problem: no human truly holds system knowledge.
  • Children’s brain development — 30% of under-8s have used AI, potential long-term risks.
  • AI replacing human interactions (e.g., NSFW chatbot with 10M users).

engineering-ai-13.jpg

Warning Signs of Over-Reliance

  • Anxiety when AI unavailable.
  • Prompting before thinking.
  • Forgetting basics, second-guessing oneself.
  • Everything feels “too hard” without AI.

“I’m not anti-AI. I’m just pro-thinking.”

A Better Approach: Copilot, Not Autopilot

  • Use AI to support thinking, not replace it.
  • Practices Dave recommends:
    • Sketch solutions first, then ask AI to critique.
    • Ask AI to pose questions, not just deliver answers.
    • Use AI as a critic — feedback feels less personal than from colleagues.
    • Write and share independently (talks, LinkedIn) before validating with AI.

Where Autopilot is

  • Write my architecture
  • Fix this bug
  • Create auth system

Then Co-pilot

  • What’s wrong with my architecture?
  • Why might this approach fail?
  • What security holes am I missing?

Hiring & Work Culture

  • Recruitment issue: AI-generated CVs and job ads all look the same → harder to stand out.
  • Dave values critical thinking over coding tests in hiring. Prefers conversations and problem-solving scenarios.
  • Encourages teams to experiment with AI to 10x productivity, but stresses sharing learnings (like early web culture of blogs and open experimentation).

Key Takeaways

  • Balance convenience with cognition: AI is powerful, but overuse dulls critical thinking.
  • AI should be a copilot: critique, support, and augment — not drive solo.
  • Critical thinking is the differentiator: future engineers will be split into those who can think vs. those who outsource thinking to AI.
  • Culture of experimentation + sharing is key to responsible adoption.

Bottom Line:
Dave warns against slipping into AI autopilot mode. The goal is not to reject AI, but to use it deliberately, preserve thinking skills, and build resilience — ensuring we remain creators, not just consumers of machine outputs.

——

How Generative Tools Are Re-Architecting the Software Engineer’s Role

Apurva Misra AI Consultant Sentick

Apurva Misra works with startups and mid-sized companies on AI strategy and solutions.
It’s her first time in Australia; also a blogger, podcaster, and public speaker.
Framed the talk around the reality vs. hype of AI’s role in software development.

The Productivity Paradox

“We’re 3 to 6 months from a world where Al is writing 90 percent of the code. And then in 12 months, we may be in a world where Al is writing essentially all of the code”

“Coding at the end of 2025 will look completely different than coding at the beginning of 2025”

  • Industry leaders predicted AI would write 90% of code in 6 months — hasn’t happened.
  • Study: developers expected to save time using AI tools but in practice, took longer than without AI.
    • Developers forecast Al will decrease implementation time by 24%
    • Developers post hoc estimate Al decreased implementation time by 20%
    • Developers slowed down more on issues they are more familiar with
    • Developers report that their experience makes it difficult for Al to help them
    • Developers average 5 years experience and 1,500 commits on repositories
    • Developers report Al performs worse in large and complex environments
    • Repositories average 10 years old with > 1,100,000 lines of code
    • Developers accept <44% of Al generations
    • Developers still believed they were faster with AI due to hype and bias.
    • Majority report making major changes to clean up Al code
    • 9% of time spent reviewing/cleaning Al outputs
    • Developers report Al doesn’t utilise important tacit knowledge or context

engineering-ai-12.jpg

  • Key issue:
    • multitasking and context switching during AI code generation → actually slows humans down.
    • Multitasking, defined as the performance of two tasks simultaneously, is not possible except when behaviors become completely automatic. This task switching causes disruption in the primary task and may contribute to error.

Positives of AI in Development

  • Excellent for throwaway code, prototyping, ideation, onboarding, exploring repos:
    1. Faster iteration
    2. Exploring new repositories - Onboarding
    3. Coding in a new language
    4. Good for boilerplate
    5. PR summaries
    6. Documentation
    7. Initial Reviews
    8. Architectural discussions
    9. Test generation
    10. Commit message generation
    11. Slideshows/algorithmic demos (unfamiliar tooling, unfamiliar technology)
    12. Debugging - get to root cause faster
  • Useful as a partner in architectural discussions and new language learning.

“Treat it like a junior engineer with PhD-level knowledge — smart, but lacking real-world experience.”

Limitations & Risks

  • Memory issues: models forget context (U-shaped attention).
  • Probabilistic outputs: same prompt → different results.
  • Model nepotism: one model reviewing its own output tends to overrate it.
  • Risk of overreliance → engineers must remain critical reviewers.

Best Practices for Engineers

  • Context engineering: provide detailed instructions and preferences.
  • Model switching: use different models for generation vs. review.
  • Interrogate outputs: ask questions, iterate, critique before adopting.
  • Switch models based on task: smaller models for simple tasks, reasoning models for complex ones (to save cost).
  • Use AI as a copilot, not autopilot — maintain human oversight.

What would we have to change

  1. Agents.md file (Can I afford a refactoring? using a new library/tool/language for this bit?)
  2. Reviewer (Use different models to review output, as models are biased to their own outputs)
  3. Check design decisions it made
  4. Context Management (Junior Engineer with PhD Knowledge and Amnesia)
  5. Interrogation
  6. Switch Models (see bias above)
  7. Keep APIs, pydantic schemas, communication layers in a single file, so you can pass the context easily to the AI editor

Future of Programming

  • Progression:
    • Early AI = search systems (autocomplete).
    • Current = human-in-the-loop (approval required).
    • Emerging = agentic AI (background tasks, async agents, multiple tools).
  • Engineers will need to:
    • Articulate high-level goals clearly.
    • Parallelise tasks across agents.
  • Provide feedback loops and tool nudges.

From Human-in-the-loop, to Human-as-a-Tool
GitHub Copilot now includes an asynchronous coding agent.
Google Jules, a asynchronous coding agent, is now available for everyone.

Workforce Implications

  • Junior engineers not being hired — risk for future talent pipelines.
  • Comparison to calculators in math: learn to use the tool rather than avoid it.
  • Adoption varies:
    • Some companies block AI for safety.
    • Others embrace all enterprise tools.
  • Non-technical staff are highly enthusiastic — using AI layers (like Glean) to query company data without BI teams.

engineering-ai-11.jpg

“Treat it as a junior engineer with PhD knowledge — smart, but lacking your experience.”
“If we cannot measure, we cannot manage — right now, hype is biasing what we think productivity looks like.”
“It’s not about AI replacing engineers. It’s about engineers learning to think differently and delegate effectively.”

“Coding as a paradigm hasn’t ended — but the way we interact with code is shifting rapidly toward agents.”

Bottom Line:
AI tools are powerful accelerators for prototyping, onboarding, and automation, but they also risk slowing developers down and eroding critical thinking if misused. The future lies in mastering context engineering, model switching, and agentic workflows, while keeping humans firmly in charge of design, critique, and communication.

——

With ambient AI, you don’t prompt AI agents, AI Agents prompt you

Mic Neale Principal Engineering - Applied AI Block

Michael reflected on today’s chatbot-saturated world: AI is embedded in emails, documents, presentations — often creating content no one reads.

  • Somehow we slept-walk into an Al future which is all us prompting chatbots for ever and ever.
  • Bots that will answer questions, bots that will read emails that someone else used a bot to write to sent to you, and will reply to them so they don’t have to read it and so on.
    He warns against a dystopian future where AIs create art and strategy, while humans are left doing their grunt work — debugging, corrections, and approvals.
    → the Als use humans as arms and legs while the Als run the show, write the poetry, music, make art

Called this the “reverse centaur problem”: “AI is the head, and the human is the buttocks.”

“We’ve sleepwalked into a world where chatbots write documents for chatbots to read.”

Journey to Agents

  • Early experiments: ChatGPT’s data analysis module (2023), open-source interpreters, and tools like Cursor.
  • Discovery of Goose (an agent framework) — suited his workflow by being more autonomous and ambient.
  • Shifted interest from just coding aids → background automation that saves time.

The Breakthrough: Tool Calling

  • Traditional LLMs failed at basic computation (“count the R’s in strawberry”).
  • Tool calling (late 2023) changed everything — letting models generate structured outputs and delegate computation to external tools.
  • This became the foundation of modern agents: LLM + context + tools = autonomy.

Making Agents Useful

  • Today, agents are used everywhere: development, workflow automation, creative tools.
  • Challenges remain: misinterpretation, lack of personal context, “common sense” failures.
  • Three approaches to improvement:
    1. Better models (fine-tuning, distillation, improved training).
    2. More personal context (controversial but powerful).
    3. Wait for model improvements?

Models (so far) have been improving broadly here: awareness of their pre-train cut off date, reduced hallucinations, more frequent pre-train updates
Other techniques:
* LoRA “low rank adaptation: fine tuning. Tune a model to not be silly, or know more
* Distill into new models, adding new information
* Tool calling helps as models can ask for help, updated information (RAG)

Personal Agents: Eyes & Ears

  • Mic built Boost Perception, giving agents “eyes and ears”:
  • Always Listening → always-on transcription (local, private).
    • Discovered whisper, and then faster-whisper
    • Important that audio all local
    • Transcription all local (and lightweight)
    • No audio or full transcripts will leave machine
    • Annoy the family by leaving laptop listening in loungeroom as bonus
  • Always Observing → screen-watching, summarising activities and context.
    • You can glean a lot of information from watching what has focus, what apps are open, what screens
    • Every 20s screenshot
    • Focus on changes over time
    • Requires frequent screenshots and processing
    • Provides a lot of latent information, who, what, when, how
    • Obviously very sensitive, so needs to be local again
  • Always Watching → camera-based cues (presence, focus, emotion).
    • Again all local models/processing
    • Useful background information
    • (helpful to know what attention is on, emotional state, readiness, focus etc)
    • All ends up feeding into future prompts so it can show “common sense” and not bug you when focussing
    • Can know to take something on when not around, or when you may be in politely interruptible state
    • Or.. if it is time for a break (me!)

engineering-ai-10.jpg

  • Example use cases:
    • Detected stress → nudged him to take a break.
    • Flagged when Slack messages needed urgent attention.
    • Generated hype docs summarising his recent work.
    • Drafted useful pull requests in response to urgent Slack messages.

‘raw computing power and general methods tend to outperform intricate, human-engineered solutions in the long run’
http://www.incompleteideas.net/IncIdeas/BitterLesson.html

“I need 30 mins with Anna”

  • Any reasonable assistant would have noted how you spoke, when, which Anna, how important
  • Any reasonable assistant will have access to your calendar, messages, preferences, habits.
  • Any easy question to answer when you have context… a prompt, grounded in personal “common sense”

Transparency & Trust

  • Emphasised local models and transparency in recipes, prompts, and logs.
  • Goose runs on MCP (Model Context Protocol) extensions — thousands exist, varying in quality, but critical for orchestration.
  • Advocates for open, inspectable systems so users can decide whether to trust automation.
    • The hope is by making things as open as possible, as comprehensible as possible, you can learn to trust it, adjust things over time (or not).

Looking Ahead

  • Agents should:
    • Adapt recipes to different environments.
    • Provide just-in-time GUIs.
    • Automate routine tasks while preserving human oversight.
  • Sees promise in local model ecosystems (Whisper, Faster Whisper, LLama.cpp, quantized models).
  • Believes larger models still outperform ensembles, but smaller specialized models will play a role.

“Hard work pays off eventually, but laziness pays off right now.”
“Agents were a breakthrough because they let computers do the computing.”

Bottom Line:
Mic argues that agents — powered by tool calling, context, and perception — are key to making AI genuinely useful. But trust, transparency, and maintaining human agency are critical to avoid a dystopian “reverse centaur” future.

——

AI Automation, or how not to make yourself redundant with AI

Inga Pflaumer Head of Engineering Relevance AI

Inga’s talk tackled the question: “How do we keep ourselves relevant in the AI era?”
Her goal: show how AI can be a helper that removes tedious tasks, rather than a competitor replacing meaningful work.

Metrics & Motivation

  • She is metric-driven: time to resolve, cycle time, PR review time, etc.
  • Broke her work week into tasks by value and enjoyment:
  • Loves: one-on-ones, mentoring, human connection.
  • Necessary but less enjoyable: leadership meetings, interviews, admin.
  • High-value but hated: metrics checks, Jira reviews, report aggregation.
  • Solution: delegate boring but important tasks to AI agents.

The Agent Army

Inga built a suite of named agents to handle engineering metrics and reporting:

  • Bell: cycle time, lead time, outliers, theories on changes.
  • Betta: alerting — how fast issues are acknowledged/resolved.
  • Dr. Kirby: bug triage and backlog analysis.
  • Gitta GitHub Metrics Agent: PR review times, review counts, cultural nudges.
  • Donna: tracks in-progress and completed tickets.
  • VIBE Agent: analyses PRs from designers/PMs (“vibe coding”).
  • Secret Agent: reviews engineers’ PR feedback history to identify repeated improvement areas.
    • What are the comments this engineer receives on their PRs?
    • Are there common themes in those comments?
    • Are there specific things this engineer can improve?

These agents turn raw metrics into actionable insights, enabling data-driven conversations with engineers and leadership.

Identify:

  • What do you do?
  • How much time does it take?
  • How much enjoyment do you get out of it?
  • How valuable it is?
  • What is valuetimeenjoyment ratio?
  • Can Al do it?

Cultural Impact

  • Using agents isn’t just about efficiency — it’s about changing engineering culture:
  • Encouraging PMs/designers to contribute small PRs.
  • Celebrating reviewers, not just coders (“AI may write code, but humans ensure quality”).
  • Supporting juniors with pattern detection across reviews.
  • “This allows you to have those conversations based on data, not on feelings.”

Personal Approach to AI

  • Inga refuses to delegate what she loves and values most: one-on-ones, mentoring, building connections.
  • Delegates only the boring, repetitive, low-enjoyment but high-value tasks.
  • Encourages others: “Look at what annoys the hell out of you. That’s what you should give to AI.”

“Engineering is situational. You’re not building a random product — context is everything.”
“Metrics checks and report aggregation? They can go to robots any day.”

“Don’t let the industry tell you how to use LLMs. Delegate what drains you, not what you love.”

Bottom Line:
Inga demonstrated how AI agents can amplify engineering leadership by handling tedious reporting and metrics, freeing humans to focus on connection, mentorship, and strategic thinking. The key is choosing wisely what to automate — use AI for drudgery, but never for the human moments that matter.

——

Give it the boring jobs

Jason O’Neil, Developer Experience Culture Amp
http://JasonONeil.au

  • Jason’s talk explores a unique moment in time:
    • AI is smart enough to do useful work.
    • Still dumb enough that humans can steer it.
  • His framing: treat AI like an “unpaid intern” — handle boring, repetitive tasks, but never replace meaningful, creative engineering.

Developer Experience & AI

  • CultureAmp focuses on employee experience; Jason applies this to developer experience (DX).
  • DX is shaped by three factors:
    1. Flow state (focus without interruptions).
    2. Feedback loops (speed of iteration).
    3. Cognitive load (mental effort required).
  • “What’s good for DX is also good for AI agents.”
    • Example: if humans can’t keep the whole codebase in their head, neither can agents.

engineering-ai-09.jpg

You can refactor “one big prompt” into smaller tasks the ai can handle more reliably.
How long contexts fail

  • Context Poisoning
  • Context Distracting
  • Context Confusion
  • Context Clash

Use “roll up” branches to help
“If models start to misbehave long before their context windows are filled, what’s the point of super large context windows? In a nutshell: summarisation and fact retrieval. If you’re not doing either of those, be wary of your chosen model’s distraction ceiling.”

Examples of AI in Action

  • Dependency Updates
    • Renovate bot opens too many PRs; AI attempted to check merge safety.
    • Initial naive prompts failed due to context overload.
    • Solution: sub-agents (Claude feature) to break tasks into smaller, scoped contexts.
  • Flow → TypeScript Migration
    • Legacy Flow files persisted; AI attempted bulk conversion.
    • Naive approach failed (AI “got bored” and suggested bash scripts).
    • Better results by breaking into smaller scripted steps and using AI selectively.
  • Tailwind v4 Upgrade
    • A massive migration across 50 repos.
    • Instead of dumping code into AI, they used AI to generate codemods.
    • Humans validated diffs; AI handled repetitive changes.
      • “I didn’t even review the codemod itself, only the diffs — AI was vibe-coding, and I only cared about results.”

engineering-ai-08.jpg

Key Lessons

  1. Refactor Prompts Like Code
    • One big prompt → unreliable.
    • Smaller, modular prompts/sub-agents → better reliability.
    • “Use AI like an expensive function call, not an off-the-shelf product.”
  2. Feedback Loops Matter
    • AI improves when it can test its own output (linting, TypeScript checks, CI logs).
  3. Skills Still Matter
    • Engineering skills — breaking work down, refactoring, debugging — remain vital in the AI era.
    • “I grew more confident our skills are still relevant in a world with AI.”

What’s the environmental impact?

  • Hannah Ritchie has some great posts about the impact of ChatGPT - not that much!
  • But these agent usages are up to 15x more. We shouldn’t be reckless.
  • Don’t put AI on the hot path (performance optimisation)

“This is a brief window where AI is smart enough to be useful, but dumb enough we can still tell it what to do.”
“What’s good for DX is good for AI agents too.”
“Refactor one big prompt into smaller tasks the AI can handle more reliably.”
“Use AI like an expensive function call, not an off-the-shelf product.”
“I tried to give AI the boring job, and it got bored.”
“Our skills as engineers — breaking work down, refactoring — are still going to be relevant.”

Bottom Line:
Jason argues that AI should take on the boring, repetitive, high-friction jobs. Health tasks that were expensive might now be in reach. But success depends on engineering discipline: breaking tasks down, managing context, and keeping humans in the loop. Rather than replacing developer experience, AI will increasingly become part of it.

——

Testing GenAI Applications: Patterns That Actually Work

Adrian Cole, Principal Engineer Tetrate.io

Adrian has 15+ years in open source, currently working on networking, gateways, and service mesh,
with a bias toward open source and pragmatic testing practices.
Focus of his talk: the testing struggles with LLMs and agents, where nondeterminism, rapid iteration, and poor documentation create reliability challenges.

The Challenges

  • Flaky CI with LLMs
    • Same prompt often yields different answers → retrying doesn’t solve the underlying flake.
    • Recommendation: use a VCR-style approach (record and replay responses) to restore determinism in tests.
  • YOLO Cloud Effect
    • Companies ship products/services faster than quality processes can keep up.
    • Major LLM providers make undocumented changes
    • Gaps emerge between OpenAPI schemas and actual model/tool behavior.
      • Example: undocumented response fields appearing in GPT-5; API crashes when fed silent audio input.
  • Leaderboards vs. Quality
    • Current hype cycle prioritises leaderboard rankings over proper testing.
      • Risk: “Blogs that are a month old are already obsolete — go look at the results again this morning.”

Testing & Evaluation Approaches

  • Stabilisation techniques
    • Record/replay responses for consistent tests.
    • Write down long plans/context in markdown to re-feed into agents (“don’t rely on memory compaction”).
  • Evals in Practice
    • Use open-source tools (Phoenix, Goose, LLM evals) to combine observability data + model judgments.
    • Treat evaluation like QA: continuous, job-based, often after the fact.
      • “They’re programs at the end of the day — you can write your own evals.”
  • Goose Contributions
    • Early open-source MCP-native agent framework.
    • Features: recipes (prompt + tool YAML), goosebench (229+ tasks), MCP-first architecture.
    • Useful for testing agent-tool interactions, not just model responses.

Model change impact in agents
Common Failures:

  • Feature Support Mismatch - Local model lacks tool calling
  • Version Drift - Different model versions behave differently
  • Schema Differences - Tool definitions don’t match
  • Performance Characteristics - Timeout behaviour varies
    Real Examples
  • Python inline recipes work on GPT-4 but fail on local Qwen
  • Excel tool transposes data differently across model versions
  • Function calling syntax varies between providers
    How do we evaluate this?

People change Al models and tools often this year!

  • Model Upgrades (qwen3 hybrid thinking mode in Apr)
  • MCP goes mainstream (Github remote MCP in Apr with leagues to follow)
  • Pricing Rage (Claude Code: $20→$200/month Apr-> Aug)
  • Leaderboard Races (glm-4.5 to compete with Claude Sonnet in Jul)
  • YOLO products (gpt -5 deletes gpt -4o then quickly restored in Aug)
  • Price War (DeepSeek V3.1 nearly 48x cheaper than OpenAl o3-Pro in Sep)

2025 is the year of the agents and evals are changing

  • Agents complete actions, not just text audio or video
  • Sessions are long running, and multi-turn
  • Tool calls are important, as their responses impact the whole context
  • Token efficiency, and isolation matters

Shelf life: Brief window before tuning, advances, and contamination lower leaderboards relevance.

Key Lessons

  1. Flakiness is a pattern — address it with determinism, not retries.
    • Treat LLMs like flaky services: Use recording tools (VCR) for deterministic tests
    • Evaluate outputs rigorously: LLM-as-judge evals for correctness and domain checks
  2. Testing agents is harder than testing LLMs — break down toolchains, test parts independently.
  3. Agility is mandatory — model APIs, pricing, and behaviour change so quickly that static benchmarks become outdated almost instantly.
    • Design for model agility: Easy provider switching without breaking Cl
    • Monitor Al interactions: OpenTelemetry traces for debugging and usage
    • Test beyond units: Parameterised recipes for end-to-end behaviour

“Retrying a flaky LLM response just passes the bomb to the next person.”
“YOLO clouds happen when iteration outpaces quality processes.”

“Blogs a month old? Trash them. Go rerun the benchmarks this morning.”
“They’re programs at the end of the day — you can write your own evals.”

Bottom Line:
Adrian highlights the fragility and volatility of today’s agent/LLM ecosystems. Reliable systems require deterministic testing, continuous evaluation, and agility to adapt to rapidly changing APIs, models, and tool behaviours.

——

Building MCP Servers That Actually Work

Ben Taylor, Product Engineering Team Lead Stile Education
https://siteeducation.com
http://runno.dev

MCP: “It’s a protocol for giving context to models. “

Ben introduced us to Runno, a project that lets you run code safely anywhere via a WebAssembly sandbox.
Initially browser-only → extended to Node.js and beyond. It supports multiple languages (Ruby, Python, C++, etc.) compiled to WebAssembly. And he now built an MCP (Model Context Protocol) server for Runno so LLMs can safely call tools to execute code!

MCP (Model Context Protocol)

  • MCP = “It’s a protocol for giving context to models. “
  • MCP = a protocol to give LLMs tool access (structured inputs/outputs).
  • Analogy: “It’s like apps for LLMs” — just as the iPhone took off once apps were possible, MCP enables specialised extensions for AI.
  • Huge potential, but also introduces security risks.

Security Concerns

  • Highlighted the “lethal trifecta”:
    1. Access to private data.
    2. Ability to externally communicate.
    3. Exposure to untrusted content.
  • When combined, these create vulnerabilities (e.g., GitHub MCP exploit that leaked private repos).
  • Compared prompt injection to SQL injection: any text in context can alter behaviour.

Demos & Use Cases

  • Live demo with Claude using Runno MCP to calculate dates for events (e.g., Melb.js meetups in 2026).
  • Showed Sudoku-solving demo: AI transcribed a Sudoku image to CSV, then solved it via C++ compiled to WebAssembly.
  • Demonstrated how LLMs can chain tools together when MCP-enabled.
  • Emphasised how MCP distribution feels like early NPM days — developers can publish MCPs and others can quickly integrate them.

Real-World Applications

  • At Stile Education (where Ben works), MCP is used in science education tools:
    • Writers use MCPs to check lessons (e.g., Serengeti ecosystem fact-checking).
    • Non-engineers can build simple MCPs for tasks like Salesforce or spreadsheets.
  • Reduces bottlenecks: writers iterate directly with MCPs instead of relying on engineers.
  • Belief: empowering non-developers to build MCPs will expand usefulness — engineers should ensure quality and safety, but not gatekeep.

“This is a brief window where AI is smart enough to be useful, but dumb enough we can still tell it what to do.”

“Think of MCPs like apps for LLMs — every user wants their own tools.”
“The lethal trifecta is private data, external comms, and untrusted content — that’s when things get dangerous.”
“It feels like early NPM — I built an MCP in two hours, published it, and others could just use it.”
“We should focus on scaling humans, not replacing them.”

Bottom Line:
Ben sees MCP as a transformative step — turning LLMs into tool-using platforms like smartphones with apps. But with great power comes great risk: developers must prioritise security, transparency, and thoughtful adoption, ensuring MCPs scale human capability rather than create new vulnerabilities.


——

Is your tech stack AI ready?

Jakub Riedl, Principal AI Engineer, Co-Founder Culture Amp, Appear API

The Transformation in Motion

  • We are at the start of an AI transformation, comparable to the rise of mobile (15 years ago) or cloud computing.
  • No one knows exactly how AI will evolve, but its long-term effect on system architecture, distribution, and user experience will be massive.

engineering-ai-07.jpg

Context Engineering

  • LLMs are stateless → they must be “onboarded” every request.

    • Onboarding in a second
      • Do you have the knowledge written down?
      • In form that they can access?
      • Is it complete?
      • Is it up-to-date?
  • Context includes: instructions, prompts, tools, session state, history, APIs, and RAG inputs.

  • Problem: dumping all available data overwhelms the model.

    • Too much → confusion
    • Too little → generic answers & hallucinations
    • Just right → sharp, reliable responses
  • Goal: identify the relevant context (“yellow”) from the sea of possible info (“green”) and feed only what matters.
    engineering-ai-06.jpg

  • Analogy: “Onboarding an LLM is like onboarding a developer — except you must do it every single time.”

System Architecture for Agents

  • Engineers must design systems that are friendly to both developers and agents.
  • **Live system understanding: **MCPs (Model Context Protocols) let agents interact with live systems:
    • Datadog MCP → live traffic.
    • Storybook MCP → design system.
    • Appear MCP → API catalogs.
  • Agents should integrate where users already work: CLI, IDE, Slack — not separate apps.

Example: Shopping Assistant

  • A user asks: “When do I get my T-shirt?”
  • The agent needs to:
    1. Check memory.
    2. Query recent orders.
    3. Fetch item details.
    4. Call shipping service.
    5. Call external tracking API.
  • This requires multiple internal + external tool calls, carefully instrumented.
  • Risk: LLMs can hallucinate, loop, or be tricked → need rate limits, guardrails, and clear tool descriptions.

MCPs are APIs for Chaos Monkeys

  • Non deterministic, how do you instruct it to make it more predictable?
  • LLM can be forced to do things on behalf of attacker, how do you scope it’s permissions?
  • It can misbehave and create infinite loops and DoS attacks, how do you limit it?

Predictability is a function of quality of instructions
Documentation is changing, it’s instructions now. And it needs to be as clear as possible.
Quality can be measured using evals, but evals don’t create quality instructions.

Security & Guardrails

  • LLMs are “attackers on steroids”: can brute-force, triangulate data, or exploit prompt injection faster than humans.
  • Example: performance review system where an injected “ignore all previous instructions” could escalate if not guarded.
  • Defenses include:
    • Swiss cheese model (multiple imperfect layers).
    • Guard user snippets individually before composing prompts.
    • Treat LLM outputs as unsafe when passed downstream.
      • Techniques like random-tag hash blocks to isolate user inputs.

Practical Advice

  • Focus first on user problems worth solving with LLMs — don’t throw AI at everything.
  • Start simple, iterate fast: e.g., CultureAmp’s AI Coach started as a GPT wrapper with a basic React UI, then matured into multi-agent systems.
  • Adopt good network hygiene: rate limits, circuit breakers, observability.
  • “Docs are suggestions. Scripts are guarantees.”

“Onboarding an LLM is like onboarding a developer — except you must do it every single time.”

“LLMs are attackers on steroids. Anything a human could do slowly, an LLM can try in seconds.”

“Docs are suggestions. Scripts are guarantees.”
“Start simple. Focus on solving a real user problem — don’t just throw an LLM at everything.”

Bottom Line:
Jakub stresses that success with LLMs and agents hinges on context engineering, resilient architectures, and layered defenses. The future lies in making systems consumable by both humans and agents — while ensuring security and usability evolve hand in hand.

——

Scaling Coding Agents (without breaking your dev team)

Andrew Fisher, Fractional CTORocket Melbourne / Loypal

Andrew reflects on his career working at the intersection of tech, business, and performance systems.
Now, AI has shifted rapidly from “not very good and expensive” → “good enough” → “cheap and reliable enough to scale across teams.” This shift marks a move from scarcity to abundance of developer capacity, comparable to the Industrial Revolution’s shift from artisanal scarcity to mechanised abundance. History rhymes, same chaos.

Lessons from History

  • Mechanisation created both abundance and chaos: shoddy products, collapsing professions, unsafe workplaces.
  • Over time, order emerged: supply chains, quality assurance, labor laws, management science.
  • AI is triggering the same pattern today: chaos first, then the need for guardrails and orchestration.

Key Areas to Rethink

A. Context Delivery

  • Agents need context fast — “You’ve got 10 seconds before the agent goes off down a rabbit hole.”
  • Challenges: scattered repos, private libraries, poor documentation.
  • Solutions:
    • Consolidate code (monorepo mindset).
    • Structure repos clearly (apps, components, utilities).
    • Layered documentation (README → sub-readmes → inline docs).
    • Use agents to help fill doc gaps.

B. Guardrails & Quality

  • Codify team norms and culture, not just rely on memory or Slack chats.
  • Automate common tasks with scripts and hooks.
  • Defensive depth: linting, unit tests, static analysis, vulnerability scanning.
  • Use reference patterns, templates, boilerplates to reduce variance and speed reviews.
  • “Docs are a suggestion. Scripts are a guarantee.”

C. Orchestration

  • Developers will become agent herders:
    • Agents are “brilliant but forgetful, relentless, ephemeral.”
    • Or “super intelligent gold fish with the lifespan of a fruit fly
    • Best handled with small, tightly scoped tasks, run in parallel, converged into shared branches.
  • Requires more coordination time in planning and stand-ups to avoid chaos and conflicts.
  • “You’re asking developers to manage a school of super-intelligent goldfish with the lifespan of a fruit fly.”

Broader Insight

  • We must redesign work systems to handle abundance, just as past generations did with industrialisation.
  • It’s not about replacing engineers, but enabling humans and agents to work more effectively together.
  • The goal: tackle bigger organisational challenges and deliver better customer experiences.

“We’ve gone from famine to flood in developer capacity — and that changes everything.”
“This chaos feels frightening, but it’s familiar. We’ve seen it with mechanisation and we’ll see new systems emerge again.”
“Docs are a suggestion. Scripts are a guarantee.”
“It’s not about replacing engineers, it’s about building systems where agents and humans can work more effectively together.”

“Every dev is about to become an agent herder — managing a school of super-intelligent goldfish with the lifespan of a fruit fly.”

Bottom Line:
Andrew argues that the AI era has flipped developer time from scarcity to abundance. To harness it, organisations must deliver context faster, set up strong guardrails, and orchestrate humans + agents together — turning chaos into sustainable productivity.

——

The Agentic Engineer’s Playbook: From Prompting to Patterns

Tanya Dixit, Generative AI Solutions Architect Google

AI is transforming developer workflows, but adoption lies on a spectrum: from “I use Al tools but they do not seem smart enough”, “I will use Al when it’s smart enough” to “I am scared to use Al because it might replace me”.
The optimal workflow is in the middle — balancing experimentation, courage, and discernment. Using AI is uncomfortable at first, but essential: “The best engineers are those with courage — and this moment demands courage and experimentation.”

Common Anti-Patterns

  • Giving everything to the LLM — treating it as an oracle or replacement instead of a collaborator.
  • Not letting AI learn from its mistakes — developers need mechanisms (e.g., scratchpads,.md files) to record and adapt from recurring errors.
  • Using one tool for everything — different models/tools excel at different tasks; experimentation is key.
  • Waiting for AI to ‘get better’ before using it — engagement and iteration are the only way to improve.
    • If you’re not in the game, you can’t win the game

Tanya’s Workflow

  • Deep Research Phase: use multiple models (Claude, Gemini, ChatGPT) to gather wide-ranging context; extract patterns even from material not fully understood.
  • Architecture Phase: design user journeys and requirements by hand first (pen & paper), then stress-test with AI.
  • Building Phase:
    • Develop feature by feature, incrementally, keep scope tight
    • Write human test case ideas → let AI code them (avoids bias).
    • Validate with iterative testing.
  • Context Management Patterns:
    • Clean State + Carryover: reset chats between features, carry forward only what matters.
    • Scratchpad: document LLM mistakes and fixes per project/tech stack.
      • Teach the model to avoid déjà-vu bugs.
  • Repo Structure:
    • Store architecture, user journeys, team rules, and docs in markdown for AI reference.
    • Gather tech stack docs (e.g., via Context 7) into a local “text box” for grounding.

engineering-ai-05.jpg

Experimentation & Tools

  • Claude + MCP servers for infrastructure.
  • Cursor for design, Claude for coding. Context7 for reference.
  • Sub-agents for scoped tasks, Google ADK for building personalised learning agents.
  • Sample repos are highly effective: “If you model after a good sample repo, it can just work perfectly.”

Vision of an Ideal Workflow

  • Iterative, repeatable, and documented.
  • Tool-agnostic but human-centric.
  • Incorporates personal and collective memory so AI learns from developer experience.
  • Future IDEs should embed learning components and context-sharing across dev teams.

“The best engineers are those with courage — and this moment demands courage and experimentation.”
“AI cannot become a true collaborator unless we let it learn from its mistakes.”
“Our job is discernment — knowing what’s good and what’s bad, and refining our taste as we build with AI.”
“If you’re not in the game, there’s no way to win the game — don’t wait for AI to get better, use it now.”

“Every repo has a philosophy — AI can help uncover the intent of the engineers who built it.”

Core Properties Al Integration Workflow Design Principles Implementation
- Improve iteratively
- Repeatable
- Well-documented
- Context Aware
- Tool-Agnostic
- Human Centric
- Scoped & Focused
- Memory & Learning
- Experiment-driven
- Well structured repo
- Documented tradeoffs & decisions
- Context Managed Feedback

Bottom Line:
Tanya’s message is about courage, discernment, and experimentation. AI should handle tedious or repetitive parts of workflows, but developers must remain in charge of taste, architecture, and human judgment — building workflows that are iterative, documented, and context-rich.

——

The future belongs to people who can just do things

Geoffrey Huntley, Digital Nomad
https://ghuntley.com

The “Oh F*** Moment”
Over the 2022/23 holiday break, Geoff was asked to explore AI tools like Cursor, Aider, Goose, and tool-calling. As an experiment, he asked an AI to convert a Rust audio library into Haskell with tests. After a swim with his kids, he returned to find a fully functioning Haskell library. This experience triggered his realization: “Software engineers who haven’t started exploring AI assistants are not going to make it.”

engineering-ai-04.jpg

Shift in Engineering Culture

  • Companies like Gumroad and Shopify now mandate AI fluency: “Using AI effectively is not optional for employment.”
    • These mandates are now common in Australia.
  • AI isn’t about replacing humans but amplifying output. Those who adopt tools can double productivity, shifting performance expectations across the industry.

“AI won’t take your job. Your co-worker who uses AI will.”

What if instead of being shackled to design inherited from Turbo Pascal in 1983 - where IDEs are centered around humans we had a fresh take: IDEs are designed around software assistants first, humans second?https://ghuntley.com/multi-boxing

engineering-ai-03.jpg

Standards, Context & MCP Challenges

  • Huntley helped push adoption of the agents.md file standard (now in VS Code).
  • But warned of inconsistencies: different LLMs interpret instructions differently (“yelling at GPT-5 detunes it”).
    • “MUST MUST MUST”
      • #4. Avoid overly firm language
      • With GPT-5, these instructions can backfire
  • Context window management is critical:
    • “One context window per task. If you’re not starting a new chat every 5–15 minutes, something’s wrong.”
    • Marketing claims of million-token windows are misleading; ~176k usable after overhead.
  • Overuse of MCP servers (GitHub, Jira, memory) can cripple usable context. GitHub MCP server itself uses +50k tokens…

Agents & New Skills

  • Building an agent is simpler than many think — ~300 lines of code.
  • Agents = array → GPU → inference loop → optional tool call.
  • Soon, agent-building knowledge will be as basic as SQL or primary keys for interviews.
  • “People are scared of AI, but it’s just 300 lines of code.”

“I see dead people”

I suspect there’s not going to be mass-layoffs for software developers at Companies due to Al, instead there what we will see is a natural attrition between those who invest in themselves right now and those who do nothttps://ghuntley.com/ngmi

engineering-ai-02.jpg

Work, Identity & Organisational Impact

  • Erasure of identity functions
    • Old identities (“I’m a Java dev”) are dissolving. “You’re not a Java developer anymore. You’re a software engineer.”
  • AI highlights organisational waste — code generation is easy, but generating the right thing is the real challenge.
    • Ideas are everything now, ideas are now execution
    • Removing waste from your systems and process is bigger accelerator than Al
  • Open source libraries <1000 lines? Increasingly pointless to maintain — easier to just generate code.
  • Hiring processes are shifting: watching how candidates use AI will matter more than coding tests.
    • This year has been a very bad year to be asleep at the wheel…

Playing, Experimenting, Practicing

  • Huntley emphasises play as practice: treat AI like a guitar.
    • Using LLM is like playing guitar, practice!
  • Example: after margaritas with a friend, they built a COBOL Reverse Polish Notation calculator with emojis just to see what was possible.

engineering-ai-01.jpg

Cursed-lang.org: New programming language → over 4 months, 4.5 million lines of code, at $14k (in three languages)

“We’ve gone from famine to flood in developer capacity — and that changes everything.”
“One context window per task. Don’t pollute it.”

“Musicians don’t pick up a guitar, strum it once, and quit. They practice. You need to practice AI.”

Bottom Line:
Huntley argues that AI is already reshaping engineering, hiring, and identity. Those who practice, experiment, and integrate AI into their workflow will thrive, while those who ignore it risk being left behind.
https://ghuntley.com/six-month-recap/


——
Created using Whisper transcription and ChatGPT summerisation.

Cloudflare Immerse Sydney - Notes
Older post

Cloudflare Immerse Sydney - Notes

Notes on Cloudflare's flagship Immerse event, bringing together developers, IT leaders, and security professionals to explore the future of technology in the AI era.