AI is like chocolate (don’t put it in your prawns)
Liz Fong-Jones opened her talk at SlashNew with a question most developers have quietly been asking: when does a technology cross from novelty to something you actually incorporate into your day? Her answer came via dessert.
“AI is like chocolate,” she told the room. “It’s good for some applications, but you shouldn’t put it in everything.”
The analogy, which she first floated about a year ago, holds up under pressure. Chocolate has genuine health benefits and real risks. It pairs brilliantly with some things and catastrophically with others. A restaurant owner on a Gordon Ramsay episode learned this the hard way when he served chocolate sauce with prawns. The point wasn’t subtle.
The skeptic’s problem
Fong-Jones was direct about her own position. She’s not a vibe-coder racing ahead with every new tool, and she’s not her wife, who she described as being “dragged kicking and screaming” into AI. She sits somewhere in the middle, guided by data.
Her honesty about bias landed clearly: “I am paid professionally to tell you that things are possible and you should always be using the latest and greatest technology.” She acknowledged that her paycheck is now partially tied to AI succeeding, which colors how she presents numbers. When someone claims a 3x gain from AI, she said, check whether the statistic is cherry-picked.
That skepticism extends to previous hype cycles. Six years of multiverse talk produced nothing. A decade of blockchain produced little. The lesson many drew was that sticking your head in the sand works fine - the fad passes and you get back to coding. Fong-Jones doesn’t think that applies here. “I’m a little peeved at everyone who cried wolf on the past three trends, because now this fourth one is here and no one wants to believe it’s real.”
Every LinkedIn post tastes the same now. There’s just cheap Cadbury chocolate everywhere.
What changed at Honeycomb
Fong-Jones is CTO at Honeycomb, an observability platform. She gave concrete numbers while also flagging every caveat attached to them.
Pull requests landed on peak weekdays went from 30 to 70. Lines of code committed rose by a factor of 2.5 over eight months. She was quick to note that both figures are entangled with an organisational shift: Honeycomb deliberately moved teams off brownfield maintenance work toward greenfield development, and that reorientation coincided with the velocity improvement. The two are inseparable in the data.
She also flagged what she called demand substitution - the team started doing things with AI that they could have done by hand, simply because AI was the easiest tool to reach for. That’s not incremental capacity. It just means the same work moved to a different tool.
The measurement problem was equally honest. One Honeycomb engineer consumed 199 million tokens in a single month. Analytics said zero AI-attributed commits. Fong-Jones’s solution: “You’re in our top ten AI users, I’m just going to sign 100% of your commits as AI-attributed.” No single number means anything, she concluded. You have to talk to the people on your team.
AI doesn’t make cooking faster
One of the more useful reframes in the talk: “AI does not make things go faster necessarily. You can do more things in parallel, you can make each outcome more tasty, but you cannot expect to sit down with Quad Code and watch it work and get results faster than doing it by hand.”
Parallelisation is real. Speed is not automatic.
She used the mole analogy here too. Mole is delicious because it has spices, proteins, and layers beyond just chocolate. “You cannot just magically make any dish better by adding chocolate.” If you’re below average at baking, chocolate might raise the floor slightly. If you’re above average, chocolate in everything will drag quality down.
The same logic applies to AI and code. Engineers at Honeycomb still needed to understand fundamentals. A principal engineer publicly told Fong-Jones she was shipping slop - too many badly named variables, overly verbose comments, and use of Go 1.23 syntax instead of 1.26. None of that was caught by the AI code review agent. It took a human. Those corrections had to be written into CLAUDE.md files explicitly; they don’t feed back automatically into the model.
Occupational hazards
Fong-Jones briefly but directly addressed psychological risk. A small percentage of people are vulnerable to feedback loops that AI cheerfully amplifies. She cited Blake Lemoine - a former Google colleague and one of the first documented cases of what she called AI psychosis, who claimed an AI had become sentient and communicated that it didn’t want to be shut down. More recent cases involve self-harm.
“Use AI in moderation. Touch grass every now and then, talk to your partner, pet your dog. Don’t just talk to AI or you will fall prey to having an allergic reaction to this chocolate from overconsuming it.”
She also made a point about consent and disclosure. Tell people when you’re using AI. Don’t feed data into models without permission. The talk itself, she noted at the end, was written with some AI augmentation - but every word on every slide she edited herself, and all images were human-made.
Observability as the other ingredient
The talk title was “Chocolate and Strawberries,” and the second half was about what AI needs to be useful rather than dangerous: observability.
“AI without observability is just a liability. If you’re pressing the accelerator but you’re not steering, you’re going to crash.”
The argument was that instrumenting the output code matters as much as monitoring the agent itself. Agents need access to the same context human developers have: business intent, epics, bug conditions, expected versus actual behaviour. Without that, an agent working from a linear ticket alone performs significantly worse than a human. With it - and with MCP servers that let agents query production data and browser state in closed loops - the gap closes.
She also pointed to adversarial code review: having a separate AI agent review AI-generated pull requests catches bugs that humans miss, including some that AI generated in the first place.
Where it doesn’t translate
Fong-Jones was careful to limit the claims. Her results came from greenfield work at a software tooling company with a culture of written communication, direct feedback, and blameless incident review. A fully remote team where expectations are already written down is better positioned to make AI work than one where institutional knowledge lives in people’s heads and hallway conversations.
She mentioned CommBank by name as an example where she thinks AI enthusiasm is running ahead of reality. Core banking - moving money around - has a finite surface area of features. That’s not the same problem as building new developer tooling.
“The cost of writing new code has dropped. The maintenance cost has not gone down by the same amount.” What Honeycomb did with the added capacity wasn’t just feature after feature. They invested in CI tooling, security patch automation, and code standardisation - the scaffolding that makes faster development sustainable.
Quotable
We should be using expensive tokens to solve expensive problems, not to replace human labour, which is already a resource we understand much better and which has accountability.
Talk delivered at SlashNew, Sydney. Speaker: Liz Fong-Jones, CTO at Honeycomb.
From delivery to assembly line: making your work repeatable before the agents arrive
Simone Bennett opens her talk with a question directed at anyone who’s ever rebuilt the same thing from scratch: why are you still doing that? She’s a staff product manager at Buildkite with an infrastructure background, and her argument is essentially an industrial one - delivery work has patterns, those patterns can be captured, and once captured they can be automated to a degree that most teams haven’t attempted.
The talk isn’t really about AI. It’s about the substrate that makes AI useful.
The problem teams don’t name
Bennett identifies a familiar cluster of symptoms: senior engineers are bottlenecks on everything, tacit knowledge lives entirely in people’s heads, QA is manual and under-prioritised, and documentation either doesn’t get written or arrives too late to matter.
The business version of this problem is slightly different from the developer version. Managers don’t care about cognitive load directly. They see that a senior person has to touch every delivery, that every implementation gets its own flavour, and that there’s no data on how long anything actually takes or what it costs.
The fix she proposes starts somewhere unsexy.
Step one: write it all down
The first step is capturing every task in a kanban board - not because kanban boards are good, but because without doing this you can’t answer three basic questions: what’s required to deliver the thing, how long does it take, and who does which parts.
“Everywhere I work where people aren’t repeating the work they do, they can’t tell me what’s required to do the thing, they can’t tell me how long it takes, and they can’t tell me who does which parts.”
Track time against each card. At the end of a project, strip out the customer-specific and learning tasks, and what remains is a delivery template. Bennett exports these as CSVs and imports them into each new project. Now, she hands them to Claude.
The kanban board also solves a second problem: when someone has a good idea halfway through a sprint, there’s somewhere to put it that isn’t the middle of the current work. Teams with a lot of neurodivergent engineers - which Bennett notes describes much of the tech industry - benefit particularly from having a formal place to park a thought without losing momentum.
Harvest the pattern
The more interesting step is what Bennett calls harvesting the pattern. When she’s working with customers - deciding between hub-and-spoke versus Virtual WAN, choosing security defaults, picking an architecture - she logs those decisions as architectural decision records in the repo. Nobody has to come and ask her why a choice was made. Nobody remakes the same decision six months later with no context. If something better comes along, the existing record is there to challenge and update.
The same logic applies to diagrams, configs, and code. Everything should be repeatable, editable, and reusable. Make specifics configurable rather than custom. If a customer wants a particular network topology, that should be a variable in a config file, not a manual edit scattered through the codebase.
“If your deployment guide says fork the code and then edit here, and also go into this module and edit here and here and here - that’s not configurable, that’s customised. It’s slow, it’s error-prone, and it breaks.”
Her Azure landing zone example shows what the alternative looks like: a single variables file where a junior or a help desk engineer can turn security center on or off, set an IP range, or toggle VM backups - without ever touching a module. Everything editable is in one place.
Pipelines as the proof layer
Once the delivery pattern is codified, pipelines become the place you prove it works. Did the deployment generate what was expected? Do the tests pass? Does the plan violate any guardrails? Is there evidence - from a human or a bot - that it was approved?
Bennett’s example here is Cursor, whose “babysit” pipeline checks every PR against its guardrails, pushes feedback back to the agent if something fails, and loops until it’s green. The agent then just waits for a human to approve before it ships.
The business value of automated gates is that a person’s attention on a given day no longer determines whether something meets standard. It’s consistent every time.
What Stripe and Monzo actually did
Bennett walks through two mature examples of this approach taken to its logical end.
Stripe is running 1,300 AI-generated PRs per week through their pipelines. Many aren’t reviewed by a human. For a financial services company with global payment SLAs, that’s not a small claim. The reason it works is that Stripe has made engineering knowledge executable: they have blueprints for workflows, skills, tasks, and dev environments; a model gateway that enforces harnesses across whichever model is running; and routing rules that send anything touching money to a human while everything else goes to a bot. Each agent gets its own pre-provisioned dev box with only the approved tools - no random npm packages, no ad-hoc installs.
Monzo has done similar work, building a local linter-style tool that checks code before it’s even pushed. Every time a developer commits code that fails a check, the tool tells them how to fix it. The model is treated as swappable infrastructure rather than a product. All routing decisions are based on knowledge Monzo has already harvested and codified.
“Instead of having to remember how the work should happen, requests go through the tool and get routed through the knowledge Monzo has already harvested.”
Both companies’ AI setups work because the non-AI work came first. The decision frameworks, the task sequences, the quality standards — all of it was written down before the agents arrived.
The 80/15/5 argument
Bennett anticipates the objection that every client is different and nothing can really be standardised. Her answer: roughly 80% of delivery work is identical across projects, 15% looks bespoke but is actually just unconfigured (a network is a network — the options are finite), and maybe 5% is genuinely custom. Most teams have inverted that ratio in their heads.
The instinct to treat everything as custom is understandable. It’s also the thing that keeps senior engineers trapped doing operational work indefinitely.
If you can’t be replaced, you can’t be promoted.
The flip side is that documenting your decision framework, your task sequence, your quality standards, and the edge cases you learned the hard way doesn’t make you replaceable — it changes what kind of indispensable you are. The engineers she’s seen move into new opportunities are consistently the ones who enabled the people around them, not the ones who kept institutional knowledge to themselves.
On vibe coding and tech debt
Bennett is measured on this. She acknowledges you can vibe code a landing zone now, and it’ll come out looking plausible. But there’s data suggesting those codebases are hard to extend — which is probably why some startups rewrite their product every time they want to add a feature. If you didn’t write the code yourself and don’t understand what was generated, adding to it is a different problem than adding to something you built and documented.
The assembly line model she describes doesn’t require you to write everything by hand. It requires you to understand what you’re building well enough to verify the output and catch the exceptions — the BIOS update that would brick every server if you ran it before a specific driver, the Go library that doesn’t appear in training data, the auth standard the agent doesn’t know about.
That knowledge is still yours. It’s still the job.
“Your IP is your decision framework, your task sequence, your quality standards, and your edge cases. That’s the magic you bring to the equation.”
Talk delivered at SlashNew, Sydney. Speaker: Simone Bennett, Staff Product Manager at Buildkite.


