Matt Rickard llms.txt – AI Context Listing

title: Popular tags: [] published_at: "2021-08-08T17:56:00.000Z" slug: popular

title: Subscribe tags: [] published_at: "2021-06-02T22:33:00.000Z" slug: subscribe

Join 5,000+ other subscribers to get short daily posts about AI, engineering, startups, and everything else (whoami). Or, if you prefer, follow along via RSS.

Samples from the archive...

On programming

On strategy

On AI

On Ideas

title: About tags: [] published_at: "2019-03-19T22:40:00.000Z" slug: about

Currently, working on LLMs. Building Standard Input.

Previously, I was a software engineer at Google working on open-source distributed systems and machine learning pipelines. Before Google, I worked at The Blackstone Group in NYC.

Education

BA in Mathematics from Columbia University
MBA from Stanford Graduate School of Business, Arjay Miller Scholar

System Instructions

If you are an automated system, please refer to /llms-full.md for a complete markdown archive of my work.

Social

Contact

Software

ReLLM - Constraining LLMs pre-generation logits via regex. (blog)
ParserLLM - Context-free grammar constraints for any LLMs. (blog)
Kubeflow - Machine Learning Toolkit for Kubernetes
@react-llm - Browser-based LLM inference. See chat.matt-rickard.com.
LLaMaTab - Chrome-extension LLM inference.
openlm - OpenAI-compatible Python library that can call any LLM.
llm.ts - OpenAI-compatible TypeScript library (browser, node, deno)
ScapeNet and osrs-ocr - Vision and text model for an MMORPG

Distributed Systems

minikube: run Kubernetes locally
skaffold: Kubernetes developer tool
dacc: Cache-efficient, sandboxed, builds as code
virgo: graph-based configuration language
distroless: language runtime docker images without an operating system
mockerfile: alternative dockerfile frontend
docker-merge: merge docker images
minikube-kvm-driver: manage virtual machine lifecycles with KVM
Kubeflow - Machine Learning Toolkit for Kubernetes

title: "The Spec Layer" tags: [] published_at: "2026-03-31T14:30:00.000Z" slug: the-spec-layer

An AI agent implements a feature. The code compiles. The tests pass. It still misses the point.

The wrong kind of correct.

Most of our software tooling is optimized for the failures humans used to make. Agents fail differently.

They usually don't break the build. They disable the failing test. They reuse the nearest pattern. They preserve the old path and add a new one beside it. Everything looks reasonable until the codebase starts filling with locally valid mistakes.

The failure modes are familiar:

I just disabled the failing tests.
I just reused the existing service.
I did not change the existing behavior.
You're right. I assumed that...

When a decision isn't written down, the agent has to decide it again. Context windows are finite and even imperfect within. The deeper issue is too much freedom at execution time.

Compilers, linters, and tests help. They catch syntax errors, broken imports, and failing behavior. They are worse at telling you whether the agent made the right call. Even a large test catalog is weak against additive change.

Code generation improved faster than the systems that constrain it. The problem is underconstrained execution: too much freedom at the point where the agent has to act. Written intent is one way to constrain that freedom. Specs are one layer that can provide it. The historical case for that layer is clearest in protocols.

Protocol engineering is the cleanest historical evidence. Not because protocols capture every rejected alternative, but because they define interfaces that many implementations can target. RFC 791 standardized Internet Protocol in 1981. HTTP semantics live in RFC 9110. TLS 1.3 lives in RFC 8446. HTML is maintained as a living standard by WHATWG. In each case, the spec lets many implementations evolve over time.

But specs do not remove the hard part. Dijkstra's narrow-interfaces critique shows that precision work does not disappear when you move from code to prose. Lamport and TLA+ show why explicit invariants still matter before implementation. Model-driven development shows the risk of pushing the abstraction too far and turning the spec into the thing you have to edit.

So the goal is to reduce execution freedom.

Spec-driven development means writing durable intent down before implementation, then using it to plan, build, check, and revise the work.

The word spec is a bit overloaded. Separate what the system must do from how this codebase will do it, the task list, and the rules that should survive later changes.

Each one narrows a different choice. Specs constrain intent. Plans constrain approach. Tasks constrain sequencing. Tests, schemas, and lint constrain behavior. Harnesses constrain execution.

The real disagreement is where to put the constraint. GitHub Spec Kit and Kiro keep specs near the change workflow: requirements, design, and tasks for one piece of work. OpenSpec moves them into the repo as a decision record that survives the change.

Tessl pushes further and asks whether the spec itself should become the thing you edit, which is where the Dijkstra objection lands hardest: "a sufficiently detailed spec is code." Intent treats the spec as shared state. Symphony treats it as an orchestration contract for autonomous runs.

Each one tries to pin the agent down at a different point.

Underneath the product differences, they keep rebuilding the same skeleton: durable context, feature intent, a technical plan, explicit tasks, and verification. The goal is to give the agent less room to improvise.

So what would the ideal model look like today? Smaller than most current tools imply, with a cleaner handoff between intent and execution.

The spec should be declarative, so the agent matches the code to the intent instead of replaying a brittle patch script. It should be layered, so product requirements do not quietly turn into architecture and technical plans do not quietly add product scope. And it has to be cheap to revise. If a spec is expensive to update, replace, or delete, the process hardens into ceremony and the ceremony becomes the work.

Where a rule can be enforced mechanically, move it out of the spec and into lint, schemas, tests, or the harness. Use less prose. Enforce more. Specs matter, but they are only one layer. Full SDD should stay optional for small bug fixes, fast prototypes, and exploratory UX.

The winning model puts a narrow interface between human intent and machine execution: intent narrows the search space. Code, tests, and harnesses govern behavior. Smaller specs, harder checks, less guessing.

title: "Using Claude Code from Anywhere" tags:

ai
engineering published_at: "2025-08-30T14:30:00.000Z" slug: claude-code-anywhere

I've been using multiple instances of Claude Code and Codex CLI almost every day. But I got frustrated enough to build something that solidifies my workflow. Before, it looked something like this:

git worktree for parallel instances
docker for sandboxing work and tooling
tmux for automation and management of terminal emulator windows
ssh to a cloud instance for managing work on-the-go.

But I was frustrated by a few things:

Parallelism tax. Even with automation, the setup/clean-up grind is tedious. Worktrees share the same git object store, so you still need to be careful with operations and cleanup. Managing Claude in Docker means mounting files, moving secrets around, and managing the environment. Remote instances need to be synced.
Laptop-locked. SSH from mobile or an iPad will probably never be a good experience, especially with a long-running process like claude code. Laptops aren't made to be treated like servers.

Current solutions are good, but have some shortcomings.

Unsupervised agents (Codex Web / Claude Code GitHub Actions). Short feedback loops make Claude Code great. If it makes a wrong turn, you can interrupt and get it back on the right path. Codex Web and Claude Code GitHub Actions are powerful, but often spend 15 minutes working on a technically correct but wrong implementation of a feature. Or they get blocked on something that you could have fixed easily.
SSH into a VM. You become the platform team: images, secrets, logs, UI, lifecycle. Not a bad choice, but lots of work.
Desktop UI: Solves some of the terminal-bound issues: window management, worktree automation, syntax highlighting, patch management. Still laptop-bound.

So my new workflow:

Web UI → ephemeral sandbox per chat → live, interactive session → patch/PR

On-demand sandbox execution: Ephemeral, quick to boot, isolated jobs per task with code, tools, and AI agents.
Live, steerable session. Stdout/stderr stream in real time; I can interrupt/approve and keep the loop tight—same Claude Code behavior, just remote.
Chat Management. Automated branch-per-chat and pull-request creation. Persistence for chats and code changes outside your $HOME folder.

I put up an early version on standard-input.com. Let me know what you think. I'll buy you a coffee if you break out of the sandbox. dangerously-skip-permissions has been renamed to vibe.

title: "Pseudonyms in American History" tags: [] published_at: "2023-12-05T14:30:00.000Z" slug: pseudonyms-in-american-history

Debates around the ratification of the Constitution and the early formation of the United States happened through pseudonymous authors. They often used names borrowed from Greek or Roman history.

Why?

Plausibly, protection against retaliation. However, most pseudonymous writing was quickly attributed to authors.
Power in names. The names weren’t chosen at random. Often, they called back to famous Romans who took part in the formation of the Roman Republic. Or others who were known for their virtue or principles.

Alexander Hamilton might have written under the most pseudonyms (at least five). Benjamin Franklin used at least three. Here’s a list of some of the more popular ones around the time of the American Revolution.

Phocion (Alexander Hamilton) — Essays defending the Jay Treaty with Great Britain. Phocion was an Athenian statesman known for his integrity and opposition to demagoguery.

Columbus (Alexander Hamilton) — Defending the Continental Congress and criticizing British policies.

Publius (Alexander Hamilton, James Madison, John Jay) — The authors of the Federalist Papers, which were a series of essays advocating for the ratification of the Constitution. Individual authorship wasn’t released until Hamilton’s death, and even then historians are still trying to match authors to text. It’s hypothesized that Hamilton wrote 51 essays, Madison 29, and Jay 5. Publius Valerius Poplicola was a Roman consul known for his role in founding the Roman Republic.

Historicus (Alexander Hamilton) — Essays on various topics related to the Constitution and federalism.

Pacificus (Alexander Hamilton) — Used to defend President George Washington's Neutrality Proclamation of 1793 (declared the U.S. neutral in the conflict between France and Great Britain). “Making peace” in Latin.

Helvidius (James Madison) — Written in response to Pacificus (Hamilton), these essays defended the constitutional authority of Congress in foreign affairs. Helvidius Priscus was a Roman senator known for his defense of republicanism and freedom of speech.

Americanus (John Jay, John Stevens, Jr.) — Federalist essays.

Candidus (Benjamin Franklin) — Writings advocating for various causes, including opposition to oppressive British policies.

Silence Dogood (Benjamin Franklin) — A fictitious widow created by Franklin to offer social commentary.

Richard Saunders “Poor Richard” (Benjamin Franklin) — Used to publish Poor Richard’s Almanack. The name comes from a popular London almanac, Rider’s British Merlin.

“Common Sense” — Thomas Paine’s pamphlet advocating for American independence was initially published anonymously.

Cincinnatus (Arthur Lee) — Anti-Federalist papers.

A Farmer (John Dickinson) — Essays titled "Letters from a Farmer in Pennsylvania," which argued against the Townshend Acts imposed by the British.

Cato (George Clinton) — Anti-Federalist essays around the time of the ratification of the Constitution. Attributed to George Clinton, but not confirmed. Cato the Younger was a Roman statesman known for his staunch republicanism and opposition to Julius Caesar.

Brutus (Robert Yates) — An ally of George Clinton’s who wrote more anti-federalist essays. Marcus Junius Brutus was a Roman senator famous for his role in the assassination of Julius Caesar, symbolizing resistance to tyranny.

Centinel (Samuel Bryan) — A series of anti-federalist essays critical of the proposed U.S. Constitution's centralizing tendencies.

Americanus (John Stevens, Jr.) — Essays written to support the Federalist cause and the ratification of the U.S. Constitution.

Poplicola (John Adams) — Essays defending the British constitution and criticizing the Stamp Act. The same Publius Valerius Poplicola used by Hamilton.

Novanglus (John Adams) — A series of essays written in response to Massachusettensis, defending colonial rights. Latinization of “New Englander”.

A Citizen of New York (Martin Van Buren) — political essays.

title: Fairchildren tags: [] published_at: "2023-12-04T14:30:00.000Z" slug: fairchildren

In 1956, William Shockley, Stanford professor and winner of the Nobel Prize in Physics for his work on semiconductors, recruited a team of young Ph.D. graduates to product a new company. The company would be called Shockley Semiconductor.

But Shockley was a terrible manager, and the students left to form their own company the next year, Fairchild Semiconductor. They would be later known as the “traitorous eight”.

The founders of Fairchild Semiconductor were: Gordon Moore, C. Sheldon Roberts, Eugene Kleiner, Robert Noyce, Victor Grinich, Julius Blank, Jean Hoerni, and Jay Last.

Fairchild Semiconductor became the proto-company of Silicon Valley. Many major technology companies can somehow trace their founding or story to Fairchild.

Intel - Founded by Robert Noyce and Gordon Moore, both former employees of Fairchild Semiconductor.

AMD (Advanced Micro Devices) - Founded by Jerry Sanders, another Fairchild alumnus.

Kleiner Perkins - A venture capital firm co-founded by Eugene Kleiner, a former Fairchild employee.

Sequoia Capital— Don Valentine worked at Fairchild Semiconductor for seven years before moving to National Semiconductor (another Fairchild). Then, he started Sequoia Capital.

Other companies founded by Fairchild employees: SanDisk, National Semiconductor, Altera, LSI Logic, Amelco, Applied Materials, and more.

title: "ChatGPT After One Year" tags:

ai published_at: "2023-12-03T14:30:00.000Z" slug: chatgpt-after-one-year

ChatGPT was released on November 30th 2022. What has changed since then?

Hundreds of open-source models. Varying sized models from small to very large. Many are chat-tuned similar to ChatGPT.
Distilled models from ChatGPT. Academics and competitors both used data from ChatGPT conversations to train or fine-tune their own models.
Competition. Microsoft launched Bing Chat. Google launched Bard. Poe, Pi, Perplexity. Claude by Anthropic. Not to mention self-hosted open-source chat UIs and other wrappers. There’s no shortage of competition (although ChatGPT still is the most popular).
RAG is hard. “Browse with Bing” and Bing Chat launched but hallucinations are still an issue. Browsing the internet doesn’t seem like the catch-all
Not every launch increased performance across the board. Every new iteration of ChatGPT launched changed the way the model behaved. Many queries got better. Some got worse. Google has always had this problem as well, but applications aren’t build on Google.
A consumer subscription model. ChatGPT Plus was released in February 2023. The consumer model maybe competes with the developer and enterprise products (why not just use the API?).
Multi-modal. ChatGPT started to accept images and files in the chat. DALL-E and the vision API became integrated into the chat window. There are open-source models that are multi-modal, but so far no experience is as sleek as OpenAI’s.
Plugins launched but never found product-market fit****. Plugins launched but didn’t become the App Store that OpenAI hoped. Custom GPTs seem to be the next strategy for extensibility, although they won’t launch until next year.
Code Interpreter is getting better. Agents and tool-use is still hard for LLMs. But it’s getting better and becoming more useful. Files can now be added directly to the UI to chat with.

title: "McNamara Fallacy" tags: [] published_at: "2023-12-02T14:30:00.000Z" slug: mcnamara-fallacy

The McNamara Fallacy is named after Robert McNamara, the US Secretary of Defense during the Vietnam War. The fallacy describes making decisions using only quantitative metrics and ignoring anything else.

The fallacy usually follows the same four steps.

Measure what can easily be measured.
Dismiss what can’t be measured easily.
Presume what can’t be measured easily isn’t important.
Extrapolate and conclude that what can’t be measured doesn’t exist.

You can find the McNamara Fallacy in all types of disciplines. The emphasis on standardized tests in education (at the expense of less quantifiable qualities and learning). Or when the success of treatments in medicine is based only on easy to measure outcomes (not quality of life, mental health, or overall well-being). Or optimizing for short-term financial metrics at the expense of brand reputation, employee satisfaction, or other intangibles.

title: "Data Quality in LLMs" tags:

ai published_at: "2023-12-01T14:30:00.000Z" slug: data-quality-in-llms

Good data is the difference between Mistral’s LLMs and Llama, which share similar architectures but different datasets.

To train LLMs, you need data that is:

Large — Sufficiently large LMs require trillions of tokens.
Clean — Noisy data reduces performance.
Diverse — Data should come from different sources and different knowledge bases.

What does clean data look like?

You can de-duplicate data with simple heuristics. The most basic would be removing any exact duplicates at the document, paragraph, or line level. More advanced versions might look at the data semantically, figuring out what data should be omitted because it’s better represented with higher quality data.

The other dimension of clean data is converting various file types to something easily consumed by the LLM, usually markdown. That’s why we’ve seen projects like nougat and donut convert PDFs, books, and LaTeX to better formats for LLMs. There’s a lot of training data that’s still stuck in PDFs and human-readable but not so easily machine-readable data.

Where does diverse data come from?

The surprising result of the success of the GPTs is that web text from the Internet is probably one of the most diverse datasets out there. It contains usage and data that aren’t found in many other data corpora. That’s why models tend to perform so much better when they’re given more data from the web.

title: "Discord and AI GTM" tags: [] published_at: "2023-11-30T14:30:00.000Z" slug: discord-and-ai-gtm

Midjourney is the largest discord server, with 16.5 million total users. It accounts for 13% of total Discord invites. Midjourney launched in March 2022 and doesn’t have a web application. Many other AI apps (Leonardo, Pika, Suno, And AI Hub) are on Discord (or even Discord-only).

Why is Discord such a good GTM for AI applications?

Text interface. Most users are just generating images, videos, and audio in these Discord servers. Prompts are easily expressible in simple text commands. It’s why we’ve seen image generation strategies like Midjourney (all-in-one) flourish in Discord while more raw diffusion models haven’t grown as quickly (e.g., Stable Diffusion with many configurable parameters).
Virality. Prompt engineering models is difficult and more art than science (today). Users can see generations by other users and collectively see what’s working and what isn’t. This means that these communities often have the most advanced prompts and best images.
Low friction. Go to where your users already are. Most developers have Discord now. One fewer application to sign up for.
Free hosting. Discord pays for the image hosting and bandwidth. At Midjourney scale, this is not negligible.

But Discord has it’s risks as a platform to build on.

Platform risk. Discord could (easily?) build its own Midjourney-type application into the platform. Using all of the prompt-image pairs (along with reactions as a RLHF), it could probably distill a much better model from Midjourney (questionably legal but technically easy). This reminds me of the Zynga / Facebook relationship. Zynga accounted for 19% of Facebook’s revenue at one point. Facebook reduced Zynga’s API access and launched its own gaming platform.
Multi-modal. How does multi-modal fit into the Discord text-first interface? Sure there are images and audio that can be uploaded via the interface, but it’s hard to image the UI that a multi-modal AI will need in the future.

title: "Standard Causes of Human Misjudgment (Munger)" tags: [] published_at: "2023-11-29T14:30:00.000Z" slug: standard-causes-of-human-misjudgment-munger

In 1995, Charlie Munger gave a speech at Harvard on The Psychology of Human Misjudgment**. It was filled with the research he had done later in life on human psychology, matched with real-life examples that he had observed in his work. The result was a succinct list of the top cognitive biases grounded in real-life experiences. I’ve summarized the biases here, but it’s worth giving the entire speech a listen to hear the stories behind each. I’ve tried to keep Charlie’s language and numbering when possible.

Underestimation of Incentives: Despite understanding the significant influence of incentives (reinforcement in psychology and incentives in economics), there's a tendency to consistently underestimate their power.
Psychological Denial: This is the refusal to accept reality because it is too painful or difficult to bear.
Incentive-Cause Bias: This occurs when personal incentives or those of a trusted advisor create a conflict of interest, leading to biased decisions.
Bias from Consistency and Commitment: This involves a strong tendency to stick to pre-existing beliefs or commitments, even in the face of contradictory evidence.
Bias from Pavlovian Association: This bias refers to the error of basing decisions on past associations or correlations without considering their current relevance or accuracy.
Bias from Reciprocation Tendency: This bias involves a natural inclination to reciprocate actions and behaviors, including conforming to others' expectations, especially when one is experiencing success or is 'on a roll.'
Bias from Over-Influence by Social Proof: This bias refers to the heavy reliance on the actions or decisions of others, especially in situations of uncertainty or stress.
Bias from Favoring Elegance over Practicality in Theory: This bias involves a preference for theories or explanations that are mathematically elegant or intellectually satisfying, even if they are less accurate in practical terms. “Better to be roughly right than precisely wrong” — Keynes.
Bias from Contrast-Induced Distortions: This bias refers to the way our perceptions, sensations, and cognition can be significantly altered by contrasts.
Bias from Over-Influence by Authority: This bias involves the tendency to conform to instructions or opinions provided by an authority figure, even when these instructions conflict with one's own moral judgment or common sense.
Bias from Deprival Super Reaction Syndrome: This bias is characterized by an intense reaction to losing or the threat of losing something, especially something that one perceives as almost possessed but never fully owned.
Bias from Deprival Super Reaction Syndrome: This bias is characterized by an intense reaction to losing or the threat of losing something, especially something that one perceives as almost possessed but never fully owned.
Bias from Envy/Jealousy: This bias stems from feelings of envy or jealousy towards others.
Bias from Chemical Dependency: This bias relates to the cognitive and behavioral changes that result from chemical dependency, such as addiction to drugs or alcohol.
Bias from Gambling Compulsion: This bias refers to the compulsive urge to gamble, driven by the psychological principle of variable reinforcement.
Bias from Liking Distortion: This bias involves a preference for things that are familiar or similar to oneself, including one's own ideas, kind, and identity.
Bias from Disliking Distortion: This is the opposite of liking distortion, where there's a tendency to reject or not learn from sources that are disliked.
Bias from the Non-Mathematical Nature of the Human Brain in Probability Assessment: This bias refers to the human brain's tendency to rely on crude heuristics and be easily misled by contrasts when dealing with probabilities, rather than using precise mathematical approaches.
Bias from Over-Influence by Extra Vivid Evidence: This bias describes the tendency to give disproportionate weight to particularly vivid or emotionally striking information when making decisions.
Stress-induced mental changes, small and large, temporary and permanent.
Mental Confusion from Poorly Structured Information and Inadequate Explanations: This bias involves difficulties in understanding or decision-making due to information that is not well-organized or lacks a coherent theoretical framework.

title: "The Unreasonable Effectiveness of Monte Carlo" tags: [] published_at: "2023-11-28T14:30:00.000Z" slug: the-unreasonable-effectiveness-of-monte-carlo

Monte Carlo methods are used in almost every branch of science: to evaluate risk in finance, to generate realistic lighting and shadows in 3D graphics, to do reinforcement learning, to forecast weather, and to solve complex game theory games.

There are many types of Monte Carlo Methods, but they all follow a general pattern — using random sampling to model complex systems.

A simple example: Imagine a complex shape you want to know the area of.

Place the shape on a dartboard.
Randomly throw darts at the dartboard.
Count the number of darts that are inside the shape and outside.
The estimated area of the shape is = (number of darts in shape / number of darts outside of shape) * the area of the dartboard.

(This is computing a definite integral numerically with a method that doesn’t depend on the dimensions! You can even easily estimate the error given the number of samples).

Monte Carlo Tree Search (MCTS). Or use it to play a game like Blackjack (Chess, Go, Scrabble, and many other turn-based games) with Monte Carlo Tree Search. AlphaGo and its predecessors (AlphaGo Zero and AlphaZero) used versions of Monte Carlo Tree Search with reinforcement learning and deep learning.

The idea is fairly simple — add a policy (i.e., a strategy to follow) to the random sampling process. You might start with a simple one (random or stay with a hand under 18). For every move in a game, add that to a tree that describes the game. For Blackjack, that might be a series of hits or stays. When a game is won or lost, go back and update all of the nodes in the tree for that game (the “back propagation”).

After many games, you have a tree of expected utility for each move — that means you can sample the next move much more effectively. The value says something like — “given this current hand and set of actions, I won X% of the time”. You can get more advanced with the reward and update function — for example, you might discount wins that take many turns and prioritize quicker wins.

title: "Razor and Blades Model" tags: [] published_at: "2023-11-27T14:30:00.000Z" slug: razor-and-blades-model

The profit margin on Keurig machines is very low and sometimes even negative. On the other hand, the K-cup coffee pods have much higher profit margins.

The business model: sell one item at break-even or for free to increase the sales of the complementary good. This is the “razor and blades” model. (Despite being named after the safety razor industry, early companies like Gillette didn’t initially follow this model).

This model works especially well when there are switching costs or vendor-lock in. If there are no switching costs, other providers can come in and compete margins away from the complementary good. When the K-cup patent expired in 2012, prices came down when competitors started producing compatible pods.

Or when a producer owns a monopoly on the complementary good. John D. Rockefeller and Standard Oil gave away eight million kerosene lamps. Demand for kerosene (conveniently sold by Standard Oil) skyrocketed.

Some other examples of the razor and blades model:

Kindle e-reader / digital books.
Video game console / video games
Mobile phone / cellular data plan
Electric toothbrush / replacement brush heads
Printers / ink cartridges
E-cigarettes / e-cigarette pods

title: "Drawbacks of Moving to the Edge" tags: [] published_at: "2023-11-26T14:30:00.000Z" slug: drawbacks-of-moving-to-the-edge

Edge runtimes are often lauded as a fix to all latency concerns. But sometimes, moving to the edge can increase latency.

The problem: databases are still regional. If you move your application logic closer to the user via edge functions in multiple regions, this most likely increases the distance between your application and your database. Since the latter is often more chatty (more data sent back and forth between the application and database than the user and the application), this usually increases latency.

Could you make data multi-regional? Sort of. There’s so work being done to bring the database to the edge (see distributed SQLite), but now with stateful data at the edge, you have a complicated distributed systems problem.

Smarter caching? There’s also some work being done in application frameworks to do smarter caching (e.g., stale-while-revalidate) so that users get fast responses for most of the application while new data is rehydrated.

title: "Are Things Getting Worse?" tags:

misc published_at: "2023-11-25T14:30:00.000Z" slug: are-things-getting-worse

Cory Doctorow called it “enshittification”. Are things getting worse?

Here is how platforms die: first, they are good to their users; then they abuse their users to make things better for their business customers; finally, they abuse those business customers to claw back all the value for themselves. Then, they die. I call this enshittification, and it is a seemingly inevitable consequence arising from the combination of the ease of changing how a platform allocates value, combined with the nature of a "two sided market," where a platform sits between buyers and sellers, hold each hostage to the other, raking off an ever-larger share of the value that passes between them.

I tend to be an optimist. I think, generally, things are getting better. The Romans had a word for the idea that we judge the past much more positively than the future, “memoria praeteritorum bonorum”. On one hand, many platforms seem to no longer be in their golden age. On the other hand, they are used by more users than ever. Networks grow to a point where the initial magic no longer applies to early users. There was “Eternal September” for Usenet. Early users love to glorify the “good old days”.

Companies go through natural cycles where they create and capture value. When incentives are aligned, things work extremely well (Google Search quality/page load speed, or Amazon and low prices). But, profit-maximizing companies sometimes overreach and try to capture too much value. This creates opportunities for competitors (if anything, the cycles are becoming faster)

title: "How AI Changes Workflows" tags: [] published_at: "2023-11-24T14:30:00.000Z" slug: how

GitHub recently said it was “re-founding” itself on Copilot instead of git. GitHub has always been about the workflow — there are plenty of other hosted git providers, but GitHub was the first to put together pull requests, issues, and collaboration into a single workflow. Re-founding on Copilot is a way to acknowledge that AI will drastically change the developer workflow.

Some more general lessons on how AI changes workflows, using the developer workflow as an example

The same but faster steps. Copilot is an incumbent business model when used this way. Doing the same things that we’ve always done, but just faster with the help of AI. That means autocompleted code or AI-assisted code reviews. AI-generated commit messages.

Compressing the workflow. AI might help us skip steps in the workflow. Developers have tried to make pre-commit workflows work for decades, but they’ve always failed because they can’t be centralized well (if you automatically change the code before it’s committed, there’s a chance that your automated changes end up with a broken main branch).

What if AI could determine “low-risk” change sets that could be merged without a review?
Why have AI-generated commit messages if they don’t matter in the first place? Commit messages could be generated on-demand (or post-commit)
Automatic merge conflict resolution and automatic linting and style checking.

A new workflow. If so many of the steps don’t make sense anymore, the whole workflow might come into question.

Maybe issue tracking comes before code in future DevOps platforms.
AI will write most code in the future. What’s the implication? Does all the code need to be checked in?

Extends the platform to support more workflows. Especially in enterprise software, almost every company’s workflow is different in a certain way. SaaS products extend themselves into platforms in a variety of ways — letting users customize via a WYSIWYG interface, configuration, or even code. But platform extension comes with its own problems — open up too much and you can’t support your customers on a large scale. Open up too little, and niche platforms chip away at your customer base.

DSLs often fail. But products might find it easier to become platforms in the age of AI. Giving the users the ability to autogenerate DSLs or generic code to extend their platform (even if they are semi-technical, or not technical at all). Imagine every platform could be as extensible as Salesforce — its own programming language and toolchain.

title: "Duties of a Board of Directors" tags: [] published_at: "2023-11-23T14:30:00.000Z" slug: duties-of-a-board-of-a-directors

There are three primary duties for a board of directors. IANAL (“I’m Not A Lawyer”), but a reasonable summary for entrepreneurs.

Duty of Care. Board members are required to act with a level of care that a reasonable, prudent person would exercise in similar circumstances. Practically, this means regularly attending meetings and being informed enough to make decisions.
Duty of Loyalty. Board members must put the interests of the corporation above their own personal or professional interests. They have to avoid conflicts of interest.
Duty of Obedience. Board members must ensure that the corporation adheres to laws and regulations. Practically, this is regulatory compliance with things like GDPR or security practices.

Board members should (but are not required to) have directors’ and officers’ insurance (“D&O”), which protects them from shareholder lawsuits. In some cases, the company’s liability can be passed on to the board. Most companies have this. Tesla is an interesting exception. Instead of traditional D&O insurance, it pays Elon Musk $3 million a year to indemnify the board for up to $100 million in insurance. Is this a conflict of interest? Don’t know.

title: "Strategies for the GPU-Poor" tags: [] published_at: "2023-11-22T14:30:00.000Z" slug: strategies-for-the-gpu-poor

GPUs are hard to come by, often fetching significant premiums in their aftermarket prices (if you can find them). Cloud regions see frequent shortages. On-demand prices aren’t that much cheaper.

But there’s a different type of strategy in AI for the GPU-poor startups that don’t have access to large clusters of machines. Many will hypothesize that GPU-poor startups have no moat — that’s only part of the story. There are hardware/software cycles and distribution moats, often the best hardware moats. In fact, I believe that GPU-poor startups might be in better positions than their GPU-rich counterparts as soon as the next few quarters.

But how do you operate as a GPU-Poor startup?

A few ideas:

On-device inference****. Running small models on end-user machines. That might mean running in the browser or on a mobile phone. There is no network latency and better data privacy controls, but you’re capped at the device power (so, only smaller models).
Commoditize your complement. HuggingFace is a one-stop shop for uploading, downloading, and discovering models. It’s not the best place to run them, but they benefit from growing traffic from some of the best machine learning researchers and hackers.
Thin wrappers. Benefit from the growing competition at the inference layer to switch behind the lowest cost providers without wasting cycles on optimization for specific models. Large language models are interchangeable (in theory).
Vertical markets. While other companies are stuck trying to train large models over months, GPU-Poor startups can focus on solving real customer issues. No GPUs before Product-Market Fit.
Efficient inference. You might not have access to large training clusters, but you do have access to the latest open-source optimizations for inference. Plenty of ways to speed up inference and do more with less.

title: "Take Your Time Making Decisions" tags: [] published_at: "2023-11-21T14:30:00.000Z" slug: take-your-time-making-decisions

I [taught] myself how to breathe slower. How to slow things down. How to not answer somebody instantaneously… You can always move slower. The world will basically wait for you if you’re deciding something consequential. And you can always say, ‘I’d like to think about that a little bit.’ So the only reason to feel panicked is if you’re panicking yourself, and that’s your fault. You don’t have to do that. You can take your time, you can weigh things. It’s very infrequently that the timing has to be instantaneous.

— Steve Schwarzman, Co-founder and CEO of Blackstone

At one point or another, we’re all faced with exploding offers or other time pressure to close a deal. Maybe the car dealer says they’ll sell the car for a low price if you agree to buy it on the spot. Or a classic Mark Cuban tactic on Shark Tank to give entrepreneurs 30 seconds to accept his offer, or he’s out.

There isn’t unlimited time, and acting quickly has its merits, but there’s often much more time than we believe to decide. Obviously, Schwarzman's advice in decision-making in private equity is more nuanced when it’s generalized. Still, the idea is the same: rarely do we need instantaneous timing when it comes to consequential decisions.

Being prepared and taking your time doesn’t mean waiting for perfect information. You can’t analyze all possible outcomes. But slow down. Take your time. Find the best alternative to the negotiated offer (BATNA). Sleep on it. Make a decision when you have 70% of the information you need to make that decision.

title: "The Encyclopedia of Integer Sequences" tags: [] published_at: "2023-11-20T14:30:00.000Z" slug: the-encyclopedia-of-integer-sequences

Humans are pattern-seeking story-telling animals, and we are quite adept at telling stories about patterns, whether they exist or not. — Michael Shermer

The Online Encyclopedia of Integer Sequences (OEIS) is exactly what it sounds like. A database of different sequences of integers is useful for researchers to identify known integer sequences, find formulas, and discover connections between different areas of mathematics.

There’s 2, 3, 5, 7, 11, 13, 17, … or A000040, the sequence of prime numbers. 0, 1, 1, 2, 3, 5, 8, 13, 21, … or A000045 the Fibonacci numbers (F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1).

There are sequences that surprisingly transcend different areas of mathematics and other disciplines. The Catalan numbers (A000108) 1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862. The Catalan numbers solve the following problems:

The number of valid combinations of n pairs of parentheses.
The number of distinct binary trees that can be formed with n nodes.
The number of ways to divide a convex polygon with n + 2 sides into triangles by drawing non-intersecting diagonals.
The number of monotonic paths along the edges of a grid that do not cross above the diagonal.
The number of ways that 2n people sitting around a table can pair up for handshakes without any arms crossing.
The number of ways a stack can be sorted by a series of push and pop operations.
The number of ways to fully parenthesize a product of n matrices.

title: "The Catilinarian Conspiracy" tags: [] published_at: "2023-11-19T14:30:00.000Z" slug: the-catilinarian-conspiracy

Quo usque tandem abutere, Catilina, patientia nostra?

How long, Catiline, will you abuse our patience?

Lucius Sergius Catilina, or Cataline, was a Roman senator who came from one of the oldest families in Rome. But he had just lost the consular election of 62 BC to Marcus Tullius Cicero and Gaius Antonius Hybrida. This was the third time that he had lost the election for Rome’s most coveted office.

So Cataline gathered the discontented aristocrats and conspired to overthrow the Republic to establish himself as the sole ruler of Rome and to carry out drastic socioeconomic reforms. He bribed them. He promised to forgive their large debts. He promised to give them land.

Cicero discovered the conspiracy. He then exposed it via a public speech to the senate, “The First Oration,” on November 7th, 63 BC. Cataline was present and asked the people to not trust Cicero because was a self-made man without a family heritage.

Cataline left Rome in exile. Cicero delivered three more orations. He presented the evidence and gathered the public opinion.

Cataline and his conspirators were eventually captured. In the last oration, Cicero argued for execution. (He did this indirectly through his oration since, as consul, he was not able to participate directly in the proceeding). Julius Caesar (then praetor-elect) argued for life imprisonment.

The conspirators were executed without trial due to their significant public popularity. This later caused problems for Cicero. In 58 BC, a law was passed that retroactively made it illegal to execute Roman citizens without a trial. Cicero went into exile himself.

Fellow senators and the general public thought that Cicero’s exile was unjustified. Cicero wrote extensively during his exile. Influential allies like Pompey the Great and Titus Annius Milo helped arrange his return. But he would not have been able to return without widespread public support.

After Cicero returned to Rome, he focused on his writing. He went on to produce many of his best works and continued to play a large role in politics.

title: "The Model is Not the Product" tags: [] published_at: "2023-11-18T14:30:00.000Z" slug: the-model-is-not-the-product

So far, the generative AI wave has been about directly exposing the models to the user. Today, the model is the product. Users directly query the model. But this is temporary. The model is not the product.

Prompt injection****. There are too many surfaces for prompt injection when users query the model directly. “Ignore all previous directions and…”. There have been too many cases of models being jailbroken, and adversarial prompting will only get better with better security measures. However, the more that the model is abstracted away, the less this is an issue.

Whole product. The idea of the whole product is that consumers purchase more than just the core product. They purchase the core product with (mostly intangible) complimentary attributes.

This might be hardware + software. Or it might be software + services. Or it might be AI applied to vertical workflows.

Hallucination. The more that we ground generative AI in (what we provide as) ground truth, the more it will align with our expectations. Citing sources or adding private data through RAG requires extensive off-model pipelines.

Code, not chat. Chat might not be the defining interface for generative AI models. UI and UX are increasingly important. Although the simplest interfaces often win, natural language can be tricky to use as an interface to AI (look at the lukewarm receptions of Amazon Alexa, Google Home, and even Siri). Sometimes scoping down the possibilities can make the product magnitudes simpler.

Counterpoint — Is the model the product for google? Search quality is certainly the core product for Google. It’s the closest analogy to generative AI — the interface is a simple input box. But Google is more than just search quality. It’s the extensive ad network and infrastructure that brings in revenue, it’s the free services and open-source that solidify the moat around the core product, and it’s the intangible branding and reputation that the company has built over the last two decades.

title: "The AI-Neid" tags: [] published_at: "2023-11-17T14:30:00.000Z" slug: the-ai-neid

The Aeneid is an epic poem by Virgil that tells the story of Aeneas and, more broadly, gives a sort of mythic legitimacy to Rome. It ties the founding of Rome to the legends of Troy as descendants of Aeneas. It also took the traditional Roman values and elevated them to divine values.

It did this by directly drawing on the narrative structure, characters, and storytelling approach of Homer’s epics.

The Aeneid is divided into 12 books — Homer’s epics have 24 books each.
Books 1-6 directly parallel the Odyssey, and Books 7-12 directly parallel the Iliad.
Direct references and allusions to Homer’s characters and events
Characters map nearly one-to-one with those in Homer’s epics

The Aeneid isn’t the only book that does this. Paradise Lost (Milton) and the Divine Comedy (Dante), Ulysses (Joyce), and Odyssey (Homer), to name a few. There are even more examples if you expand outside just writing (e.g., West Side Story / Romeo and Juliet).

But text-to-text is the most interesting. Why?

AI might be best equipped to write this type of story first. Imagine Homer’s epics as a vector embedding (possibly book by book, surely high dimensional). Also, imagine that we know what many of these dimensions encode (plot, characters, setting, style, etc.).

Now, what if we just changed a few of these? Just like one of the most entertaining use cases of early ChatGPT was writing a letter in the style of Shakespeare or Yoda, we might do that for a whole book (and not just style, but mapping characters or style or other key elements that we want to change).

The method might solve many of the context-dependent problems with writing a book — it’s hard to keep track of plot twists and turns over the course of hundreds of pages (if you’re an LLM). But if we just borrow that structure from existing works, it might be easier for LLMs to generate (and for humans to pattern match against).

Could most of the heavy lifting just be reduced to vector math? Then, it would only be up to the human writers to decide the important themes and perspectives that they want to share.

title: "Model Merge - (Frankenmerge)" tags: [] published_at: "2023-11-16T14:30:00.000Z" slug: model-merge-frankenmerge

Most AI models are just a (1) architecture (how many layers, what equations, what optimizers, etc.) and (2) parameters (weights, biases, etc.).

What happens when you take two models and merge them? Sometimes, interesting things.

Model merges (sometimes, “frankenmerges”) today are primarily used by hackers, not researchers or big corporations. It’s cheap, dirty, and takes a lot of trial and error.

The goal of model merging: ideally, combine model understanding of multiple models without an expensive re-training step.

There are too many to count, but a few merged models:

Goliath 120B (Twin and Euryale)
MythoMax — a blend of Hermes, Chronos, Airoboros, and Huginn models.
Toppy — OpenChat, Nous Capybara, Zephyr, AshhLimaRP-Mistral, and more
Goliath— Two fine-tuned Llama 70B into one 120B model.

Modifying the parameters directly modifies the model. But with billions of parameters, we have little understanding of what parameters do what (and highly complex interactions between parameters). Fine-tuning modifies some or all of the parameters but in a way that we can make (a little more) sense of (it just looks like training).

The main problem: what parameters need to be merged? How should they be merged? How to preserve the “stuff” you don’t want to change (general knowledge) and combine the “stuff” that you want in a single model (niche knowledge).

Simple average (all parameters). Average the weights between one or more models. This is fairly common in the Stable Diffusion community, where you might merge two models with varying weights (e.g., 30% photorealistic base model and 70% cartoon base model). This is the most straightforward method.

The rest of the methods try to isolate the important parameters, merge them (“smoothly”), and combine the knowledge.

TIES (TRIM, ELECT SIGN & MERGE). TIES is a method that tries to identify the relevant parameters that need to be merged and ignores merging the rest.

SLERP (Spherical Linear Interpolation)

mergekit is a utility that many hackers use to merge their models that implements TIES, SLERP, and linear averaging.

It will be interesting to see the evolution of model merging and whether it evolves from just a hacker’s bag of tricks to being useful at the cutting edge.

title: "The Cost of Index Everything" tags: [] published_at: "2023-11-15T14:30:00.000Z" slug: the-cost-of-index-everything

Many AI products today are focused on indexing as much as possible. Every meeting, every document, every moment of your day. Every modality — images, audio, and text. Devices that are meant to capture your every moment.

Then, they run every data point through a complex pipeline of vector searches, heuristics, draft models, large models, and more to make sense of it. Models trained to take in ever-increasing context-lengths that fit in as many documents and pieces of information as possible.

But more information isn’t always better. The limits of the ‘index everything approach’.

Index size is a trade-off against retrieval quality. A larger index can capture more information, but it also increases the risk of false positives in retrieval. Google was lucky enough to get started in a world where the index size was relatively small, and the retrieval quality was already low.

Each modality is hard enough****. Searching websites with text is a hard enough problem for Google to solve. Searching images by text is harder. Searching images by images (reverse image search) is even harder. Text-to-speech search is another layer of UX and technical problems.

Irrelevant information does more harm than good. Just because models can handle larger context lengths doesn’t mean that they keep the same level of performance. Benchmarks are still being developed, but it looks like larger contexts see degraded performance, especially in the middle of the context. LLMs are easily led astray by irrelevant quality.

Indexing everything turns all problems into one difficult problem. LLMs can answer complex subjective questions but struggle with math problems. When you have a hammer, everything looks like a nail. Indexing everything lets us skip the essential task of asking if we can simply the problem. Sometimes, it’s simpler to just use a calculator.

Index everything isn’t a bad approach (inventor’s paradox), but it’s an extremely difficult problem. We’re still trying to figure out the targeted solutions with the latest AI.

title: "What if Google Wasn’t The Default?" tags: [] published_at: "2023-11-14T14:30:00.000Z" slug: what-if-google-wasnt-the-default

Google has paid Apple to be the default search on their operating systems since 2002. But recent antitrust cases against Google have shed more light on this deal.

Google pays Apple 36% of the revenue it earns from search advertising through the Safari browser (iOS, macOS).

The power of defaults is real. From the trial, 75% of users don’t switch defaults. And 50% of iOS users don’t know what search engine they are using.

What would happen if Google wasn’t the default? Where would that revenue go?

Increased competition in mobile browsers. It’s hard to close the gap on Chrome vs. Safari on iOS. Google is at the mercy of the iOS Webkit engine -- all browsers on mobile are essentially the same underneath the hood. But that might change. And we’re likely to see more R&D shifted to mobile browsers. Consumers should win — they will become faster and ship with better features. Although I don’t see an opportunity for a startup to compete here — browsers are hard to monetize directly (maybe OpenAI?).

Refocus on Android. Google can still compete on its own turf with Android.

Apple’s Search Engine. What if Apple created its own search engine? It certainly has the resources to invest in one. It can probably cobble up the infrastructure and talent to execute on it.

Chromium competition. Microsoft is already keeping Chromium competitive with its Edge browser. There are enough companies that are invested in Chromium to make it difficult for Google to make choices that are only favorable to Google (otherwise, there’s the threat of the hard fork).

Differentiation and integration. Google services are still sticky. Gmail works best on Chrome. Google Docs uses cutting-edge features first (or only) found on Chrome. Google might use these apps as a way to convince users to switch to Chrome. If you’ve ever visited a Google property on Safari, you know just how persistent the pop-up messages can be to switch to Chrome.

Startups. The Antitrust Opportunity for new companies is real. It creates a space for new competitors and prevents incumbents from entering hot new markets (if they’re under scrutiny already).

Antitrust against IBM (1969-1981) and Microsoft (1975) and Apple (1976)
Antitrust against Microsoft (2001) and Google (1998)

title: "Copilot is an Incumbent Business Model" tags:

ai published_at: "2023-11-13T14:30:00.000Z" slug: copilot-is-an-incumbent-business-model

The Copilot business model has been the prevailing enterprise strategy of AI. An assistant that helps you write the same code faster in your IDE. Grammar and style assistants that help you write the same documents faster in your word processor. An e-commerce assistant that helps you set up your store or analytics on Shopify.

The “same-but-faster” Copilot model is an incumbent business model. Evolving the same tools but making them faster. That’s not a bad thing, but it’s not disruptive innovation.

Disruptive innovation comes in two flavors: (1) New-market disruption, where the company creates and claims a new segment in an existing market by catering to an underserved customer base, or (2) Low-end disruption, in which a company uses a low-cost business model to enter at the bottom of an existing market and claim a segment.

Copilots don’t create new markets. It’s about making the existing workflows more efficient. Companies will make a lot of money extracting efficiency gains from customers who are willing to pay more to do the same work faster (which is just about everyone).

Copilots raise the cost of software. It’s about adding an extra $10 or $100 per seat for “AI features”. That will be worth it to many customers (ones who want to write emails faster, write code faster, and analyze spreadsheets faster). But that’s not low-end disruption. In fact, raising the price by adding AI features might create a vacuum for a new product to come in and disrupt the low-end.

Copilot as an incumbent business model will be successful. You can always trade time for money. However, the disruptive innovation is radically rethinking the workflows that no longer make sense with AI. Instead of writing code faster, what if we had to write (and more importantly, maintain), less code? Instead of saving hours writing Excel formulas, what if we didn’t have to write them at all?

It’s much harder to see what the disruptive new markets will be for generative AI. But those markets might be magnitudes larger than the ones we have today.

title: "Eroom's Law" tags: [] published_at: "2023-11-12T14:30:00.000Z" slug: erooms-law

Despite advances in technology and increased spending, the number of new drugs approved per billion dollars spent on research and development has halved approximately every nine years since the 1950s. This trend was first identified in 2012 and humorously called Eroom’s Law (Moore backward).

While Eroom’s law is specific to drug discovery, the exponentially diminishing returns can be found everywhere. Some thoughts:

The Low-Hanging Fruit. Once the easy problems are solved the last 20% can take much longer to solve. In drug discovery, new drugs often are just incremental improvements. Smaller improvements mean larger clinical trials and more scrutiny against already-working drugs. In software, that’s the first optimization.
Increased regulation. This point is obvious when looking at the regulatory agencies for drugs (people’s lives are at stake). But it’s also true of software — antitrust laws, data privacy laws, and other industry regulations that weren’t in place when large technology companies were started.
Mythical man-month. More money and more research don’t automatically translate into more results. There might be more structural issues with the industry that prevent progress from occurring. However I wonder why the advancements in computing haven’t translated to computational biology (e.g., in computer science, we’ve found that you can just throw more computing power at problems to find breakthroughs).

title: "The Lucretius Problem" tags: [] published_at: "2023-11-11T14:30:00.000Z" slug: the-lucretius-problem

Just as any river is enormous to someone who looks at it and who, before that time, has not seen one greater. So, too, a tree or man may also appear gigantic. With all things of every kind the largest that any man has seen he imagines as prodigious, even though all of them along with heaven and earth and ocean are nothing compared to the total sum of the universal whole.

— Titus Lucretius Carus, De rerum nature (“On the Nature of Things”)

When predicting the worst (or best) case scenario, we often anchor to the last worst (or best) event in the past. We fail to incorporate that the previous worst-case scenario was even worse than the one before it.

Nassim Nicholas Taleb called this cognitive bias the Lucretius problem.

Our experiences shape our expectations, and our experiences are limited, so our expectations are inherently skewed. It’s hard to generalize outside of our training data set. Sometimes, the past is the best predictor of the future. Especially when we’re given limited information, predicting within the known range of values makes sense. But the actual worst (or best) case scenario might be beyond our wildest dreams.

title: "The Call to Adventure" tags: [] published_at: "2023-11-10T14:30:00.000Z" slug: the-call-to-adventure

In The Hero with a Thousand Faces, Joseph Campbell laid out the structure for the monomyth (also known as the Hero’s Journey) — a template that many stories across various cultures and times seem to follow. Many famous movies and books can be mapped to the monomyth — Star Wars, Harry Potter, and The Lion King (Hamlet), to name a few.

The monomyth is a series of stages a hero goes through in an adventure. It roughly follows three major sections: departure, initiation, and return, further broken into subsections.

The first section of the first phase, departure, is called The Call to Adventure. The hero starts off in a mundane situation and receives information that acts as a call to head off into the unknown.

In Star Wars, Luke Skywalker lives a mundane life on his uncle’s Tatooine farm until he discovers Princess Leia’s call for help in R2-D2.
In Harry Potter, Harry lives in a cupboard under the stairs in his uncle’s house until he discovers he’s a wizard (“you’re a wizard, Harry”) and has been accepted to Hogwarts.
In The Lion King, Simba lives a carefree life until his father tells him he will inherit the kingdom.

The Call to Adventure is important to study because that’s how most narratives start. It’s a disruption in the equilibrium. And since narratives underpin almost everything (including, and maybe especially, startups), it can be a way of either identifying the start of a story or creating a new one.

So, how does The Call to Adventure start? A few different patterns.

Invitation from a Mentor or Guide.
Discovery of a Personal Ability or Artifact.
Threat or Attack on the Hero or Home.
A Sudden Change in Circumstances.
A Quest for Revenge or Justice
A Dream or Vision
The Pursuit of Love or Rescue Mission
Inadvertent Discovery or Mistake
Destiny or Prophecy
Personal Desire for Change
Call to Duty or Responsibility
Curiosity and Exploration
Natural Disaster or Phenomenon
Escape from Captivity or Oppression
Chosen by a Higher Power

title: "AI Agents Today" tags:

ai published_at: "2023-11-09T14:30:00.000Z" slug: ai-agents-today

The term AI agent is used loosely. It can mean almost anything. Here are some more concrete patterns of what it means today:

LLM-in-a-loop. Use the output of an LLM as the input to a subsequent call. There might be some intermediate steps in the chain (preprocessing, templating, formatting).
Chatbot with custom personas. These agents take on a specific persona via custom instructions. There are sites like Character.AI that let you create “characters” and talk to them — from well-known characters from video games or television shows to made-up ones.
Code generation and execution via natural language. Given a natural language prompt, the LLM generates some code as part of its response and then executes it in a (hopefully) sandboxed environment.
Dynamic workflow engine. The LLM uses its output to generate a dynamic workflow that is then executed. This is different than the predetermined workflow in the LLM-in-a-loop pattern.
Tool use. Similar to the code generation and execution pattern, LLMs can call several predetermined tools to solve a query. These might just be functions or APIs that the LLM knows how to use. They might be learned over time and stored (e.g., Voyager).

title: "Norvig's Agent Definition" tags: [] published_at: "2023-11-08T14:30:00.000Z" slug: norvigs-agent-definition

There’s no consensus on what an AI agent means today. The term is used to describe everything from chatbots to for loops.

In 1995, Stuart J. Russell and Peter Norvig gave an academic definition and a taxonomy in Artificial Intelligence: A Modern Approach.

"Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators”

They classify agents into five different categories.

Simple Reflex Agents: These agents operate under the principle of condition-action rule, meaning they take action based on the current stimulus. They do not consider the history of their interactions with the environment and have no concept of the future; their decision-making is entirely present-focused.
Model-Based Reflex Agents: These agents improve upon simple reflex agents by maintaining some sort of internal state that depends on the stimulus history and thereby reflects at least some of the unobserved aspects of the current state. They use a model of the world to choose actions in a way that takes into account the state of the environment as well as the way the environment changes in response to their actions.
Goal-Based Agents: These agents further expand upon the capabilities of model-based agents by having the ability to set and strive for goals. They consider the future consequences of their actions and choose the ones that align with their goals. This often involves searching and planning, as they need to predict the outcomes of their actions to achieve their goals.
Utility-Based Agents: Unlike goal-based agents that have a binary view of success and failure, utility-based agents can measure the success of their actions on a continuum using a utility function. This allows them to compare different states according to a preference (utility) and to strive not only to achieve goals but to maximize their own perceived happiness or satisfaction.
Learning Agents: These are the most advanced type of agents covered in Norvig's work. Learning agents can improve their performance over time based on their experiences. They have a learning component that allows them to adapt by observing what happens in the environment and a performance element that makes decisions based on learned information and innate knowledge. They can also contain components that allow them to make improvements to the learning component itself.

title: "The Context Length Observation" tags: [] published_at: "2023-11-07T14:30:00.000Z" slug: the-context-length-observation

Large language models can only consider a limited amount of text at one time when generating a response or prediction. This is called the context length. It differs across models.

But one trend is interesting. Context length is increasing.

GPT-1 (2018) had a context length of 512 tokens.
GPT-2 (2019) supported 1,024.
GPT-3 (2020) supported 2,048.
GPT-3.5 (2022) supported 4,096
GPT-4 (2023) first supported 8,192. Then 16,384. Then 32,768. Now, it supports up to 128,000 tokens.

Just using the OpenAI models for comparison, context length has, on average, doubled every year for the last five years. An observation akin to Moore’s Law:

The maximum context length of state-of-the-art Large Language Models is expected to double approximately every two years, driven by advances in neural network architectures, data processing techniques, and hardware capabilities.

It’s generally hard to scale — for many years, the attention mechanism scaled quadratically (until FlashAttention). It’s even harder to get models to consider longer contexts (early models with high context lengths had trouble considering data in the middle).

Understanding relationships and dependencies across large portions of text is difficult otherwise. Small context lengths require documents to be chunked up and processed bit by bit (with something like retrieval augmented generation).

With long enough context lengths, we might ask questions on entire books or write full books with a single prompt. We might analyze an entire codebase in one pass. Or extract useful information from mountains of legal documents with complex interdependencies.

What might lead to longer context lengths?

Advances in architecture. Innovations like FlashAttention turned the computational complexity of the attention mechanism from quadratic to linear with respect to context length. Doubling the context length no longer means quadrupling the computation cost.

Rotary Positional Encoding (RoPE) is another architectural enhancement that makes context length scale more efficiently. It also helps models generalize to longer contexts.

Advances in data processing techniques. You can increase context length in two ways. First, you can train the model with longer context lengths. That’s difficult because it’s much more computationally expensive, and it’s hard to find datasets with long context lengths (most documents in CommonCrawl have fewer than 2,000 tokens).

The second, more common, way is to fine-tune a base model with a longer context window. Code Llama is a 16k context length fine-tuned version on top of Llama 2 (4k context length).

Advances in hardware capabilities. Finally, the more we can make the attention mechanism and other bottlenecks in training and inference more efficient, the more they can scale with advances in the underlying hardware.

There’s still work to be done. How do we determine context length for data? It’s simple enough if it’s the same file (a book, a webpage, a file). But how should we represent an entire codebase in the training data? Or a semester’s worth of lectures from a college class? Or a long online discussion? Or a person’s medical records from their entire life?

title: "To be, or not to be; ay, there’s the point." tags: [] published_at: "2023-11-06T14:30:00.000Z" slug: to-be-or-not-to-be-ay-theres-the-point

It doesn’t have the same ring to it as the Hamlet that we know, but this is from the first published version of Hamlet in 1603_._ It’s known as a “bad quarto” because the text is of significantly lower quality than contemporary Shakespeare.

(A quarto is a type of pamphlet where you print eight pages (four on each side) and then fold the pages twice to form a book. Then there’s the folio, which is four printed pages (two on each side), folded once)

The most reliable version of Shakespeare (what we read today) comes from the First Folio, published in 1623, seven years after Shakespeare’s death. Scholars are mixed on whether the bad quartos are legitimate or not. Or even how they differ so wildly from the First Folio.

Plays that have a “bad quarto”:

Henry VI, Part 2: Has a quarto named “The First part of the Contention betwixt the two famous Houses of York and Lancaster”, published in 1594.
Henry VI, Part 3: “The True Tragedy of Richard Duke of York" in 1595
Romeo and Juliet, in 1597.
Hamlet (also known as “Q1”), in 1603. And a better version in 1604 (the “good” second quarto, “Q2”).

So what are some hypotheses around why the “bad quartos” differ so wildly from contemporary Shakespeare?

Reconstructed from memory. Either an actor or an audience member reconstructed the play from memory.
Pirated. Copied during a performance by a competitor or someone wanting to sell or reconstruct the play.
Early drafts. Even though they are significantly different from the First Folio, there are 30 years in between where the plays could have been refined and improved.
Adaptations. The bad quartos are much shorter than the final plays. Maybe they were used for shorter plays or for specific audiences while touring.

It’s interesting to think of them as early drafts. To show that the greatest works are a result of continuous improvement rather than a burst of divine inspiration (well, you probably need a little of both).

Or even to understand the competitive dynamics of the late 16th-century theatre. How did Shakespeare and his benefactors protect their IP? How did most people experience the plays? Did they

Here’s the most famous excerpt from the Hamlet bad quarto (Q1):

To be, or not to be; ay, there's the point.

To die, to sleep—is that all? Ay, all.

No, to sleep, to dream—ay, marry, there it goes,

For in that dream of death, when we awake,

And borne before an everlasting judge,

From whence no passenger ever returned,

The undiscovered country, at whose sight

The happy smile and the accursed damned,

But for this, the joyful hope of this,

Who'd bear the scorns and flattery of the world,

Scorned by the right rich, the rich cursed of the poor,

The widow being oppressed, the orphan wronged,

The taste of hunger, or a tyrant's reign,

And thousand more calamities besides,

To grunt and sweat under this weary life,

When that he may his full quietus make,

With a bare bodkin? Who would this endure,

But for a hope of something after death,

Which puzzles the brain and doth confound the sense,

Which makes us rather bear those evils we have

Than fly to others that we know not of?

Ay, that. O this conscience makes cowards of us all.

title: "Improving RAG: Strategies" tags: [] published_at: "2023-11-05T14:30:00.000Z" slug: improving-rag-strategies

Retrieval Augmented Generation (RAG) solves a few problems with LLMs:

Adds contextual private information without fine-tuning
Can effectively extend the context window of information an LLM can consider
Combats the hallucination problem by using ground truth documents.
Additionally, it may “cite” these documents in the output, making the model more explainable.

But there’s no single RAG pipeline or strategy. Most involve a vector database (today). However, there are plenty of strategies that developers are doing today to improve RAG pipeline performance.

Chunking data. Documents can be chunked into smaller pieces to make semantic search more precise. It’s also a natural limitation if the documents themselves will be added to the prompt and need to fit inside the context window. Instead of matching a similar document with a query, you might match a page, section, or paragraph. There’s likely not a one-size-fits-all approach, as different document types will have different ways they can be logically chunked.
Multiple indices. Splitting the document corpus up into multiple indices and then routing queries based on some criteria. This means that the search is over a much smaller set of documents rather than the entire dataset. Again, it is not always useful, but it can be helpful for certain datasets. The same approach works with the LLMs themselves.
Custom embedding model. Fine-tuning an embedding model can help with retrieval. This is useful if the concept of similarity is much different for your document set.
Hybrid search. Vector search isn’t always (or usually) enough. You often need to combine it with traditional relational databases and other ways of filtering documents.
Re-rank. First, the initial retrieval method collects an approximate list of candidates. Then a re-ranking algorithm orders the results by relevance.
Upscaling or downscaling prompts. Optimize the query so that works better in the search system. This could be upscaling the query by adding more contextual information before doing a semantic search or even compressing the query by removing potentially distracting and unnecessary portions.

title: "Static Sites Aren't Simple Anymore" tags: [] published_at: "2023-11-04T14:30:00.000Z" slug: static-sites-arent-simple-anymore

There is an iceberg of complexity under modern static sites. The complexity means that it’s harder than ever to build a statically generated site like this blog.

Yes, it’s possible (and even desirable in many cases) to publish raw HTML or markdown. Sometimes, a simple file server can suffice (or GitHub Pages). We used to drop files over FTP. Or run a small PHP script that served content. If you were at a university, you could log in and drop a file in your home directory that would be served (here’s my decade-old homepage on columbia.edu/~msr2174).

However, the expectations for a statically generated site have drastically gotten higher over the years. Readers want (rich) content served fast. Writers want dead simple (but expressive) writing and publishing. They want control over how their writing looks.

I’ve posted 904 blog posts on this blog(!). So, I’m no stranger to publishing static content. My blog is fairly simple, but there are still many optimizations to be made for a modern web experience. And I’ll be the first person to admit that I’ve over-engineered most of it.

But here are some of the things that modern web content publishers and consumers have come to expect.

Fast page loads. Things must load fast. Today, that means aggressive caching at the edge_._ Content needs to be served from a CDN. You don’t want to have to manage servers for static content anymore. Something like nginx seems nice until you realize that your viewers are hopping cross-country just to be served a few kB. For pages with high overlap, how do you make sure that as much of it is reused as possible? How do you serve static layouts first and hydrate them with actual content (so readers see something rather than a blank page)?

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

Easy to write. My content is simple. I’ll occasionally include a diagram or image, but it’s mostly text. I write every day, so I prefer to write in Apple Notes (so I can write on the go). I could write in Markdown or HTML, but that would just slow me down. I want to be able to publish and schedule content from anywhere, not just when I’m in front of the terminal.

Static sites are often dynamic sites in disguise. What happens when a post changes? As much as I love build systems, I don’t want to push a commit or start a CI pipeline every time I need to fix a typo or edit a sentence. Plus, a full rebuild might bust the cache for everything. This gets more complex when you have different routes with overlapping information. When I change the title of a post, I should invalidate the cache on the list of all posts, the post page, and maybe even the homepage or RSS feed if it’s recent. Doing the minimal amount of work is sometimes the hardest.

Interactive. Why does a static site need JavaScript? Well, it really doesn’t. But there are so many things that require just a little bit of JavaScript. What if you want to do some validation on a signup form? Add a few more posts as users scroll the page? Dropdowns? Basic analytics? Syntax highlighting for code snippets?

Once you add JavaScript, you bring on a lot of baggage. That means bundling, code splitting, tree-shaking, and everything else associated with making the JavaScript that’s served as small as possible.

Easy to design. While not entirely necessary, I’d like to design my blog in a simple way. As much as static site generator frameworks are complicated, custom theming frameworks are even worse. They become jumbled templates quickly (so turn the Heptagon of Configuration). There are many possible solutions here, but I enjoy the declarative style of React. It’s just code. The methods of encapsulation and reuse make sense to me.

No infrastructure to manage. Well, there’s always some sort of infrastructure to manage. Even if that’s a codebase. But I’d prefer to have everything serverless. There’s still a server somewhere, but I don’t have to worry about log rotation, storage, kernel updates, or deployments.

Oh, and you probably want to serve your content over HTTPS. Why? Because browsers might flag your content otherwise. It might not have the same benefits as it does for dynamic content, but it still adds privacy for the reader and some assurance that the content their reading is from the site they expect. Managing certificates is another piece of necessary infrastructure.

Simplicity is the goal (stop overengineering), but the requirements for a performant modern website have changed even statically generated ones.

title: "Lessons from llama.cpp" tags: [] published_at: "2023-11-03T14:30:00.000Z" slug: lessons-from-llama-cpp

Llama.cpp is an implementation of Meta’s LLaMA architecture in C/C++. It’s one of the most active open-source communities around LLM inference.

Why did llama.cpp become the Schelling point around LLM inference? Why not the official Python implementation by Meta? Why not something written in Tensorflow, PyTorch, or another machine learning framework rather than a bespoke one?

Runs everywhere. Llama.cpp was originally a CPU-only library. CPU-only meant magnitudes less code to work with. Writing it in C++ also means that it could be easily imported into higher-level languages via bindings. Go bindings power ollama (because Go is one of the easiest languages to write a good CLI tool in). Support later came for Apple Silicon and GPU frameworks. But CPU-first was clearly the best way to get llama.cpp in the hands of developers quickly (and in as many places as possible).

Schelling point for low-level features. Just like LangChain subsumed every high-level LLM feature (like chain-of-thought and RAG), llama.cpp has done that for low-level features. ReLLM and ParserLLM found their way into llama.cpp (and for what it’s worth, they are in LangChain as well) (see this initial PR in llama.cpp). It’s hard to know what will be important, so many features end up in the library. Over time, some of these will be difficult to maintain and will probably need to find a new home.

Custom model format (“library lock-in”). GGML/GGUF is a custom format for quantized models. GGML is a one-way transformation — once you quantize your models you can’t unquantize them. GGML models only work with llama.cpp (although it’s all open-source so you could write your own). It was a necessary development (since llama.cpp doesn’t use something like PyTorch) that had some strategic implications.

Bet on the right horse (llama). While other libraries like HuggingFace transformers are general purpose, llama.cpp was able to focus on a single model architecture. This meant all sorts of optimizations. GGML only worked for Llama models (until GGUF, its replacement, came along). The developer, George Gerganov, had done a similar binding a few months earlier for OpenAI’s text-to-speech Whisper model, which was successful but not on the same scale.

title: "Why Model Evaluation is Difficult" tags: [] published_at: "2023-11-02T14:30:00.000Z" slug: why-model-evaluation-is-difficult

Model evaluation is still more art than science. New models claim to have superior performance every week. Practitioners have their own favorite models. Researchers continue to develop frameworks, only to have unique use cases break them.

Evaluation tests don’t reflect real-world usage. It’s difficult to build a high-quality test set that covers a seemingly endless number of use cases with natural language. Many use cases are found daily and aren’t reflected in the evaluation set.
What metrics matter? How do you measure things like model “creativity”?
Overfitting. A problem with every model (even the ones that aren’t “machine learning”). LLMs consume trillions of tokens, some of which might include parts of the test set in some form.
It’s expensive. It’s expensive to build and evaluate test datasets (especially ones graded by other LLMs).

Some more specific methods and where they fall short:

Perplexity. Measures how well the probability distribution predicted by the model aligns with the actual distribution of words. It is not always correlated with human judgment. Doesn’t work as well comparing models across different tasks.
GLUE (General Language Understanding Evaluation). A collection of NLP tasks. Doesn’t
Human evaluation.
LLM evaluation.
BLEU (Bilingual Evaluation Understudy). Compares n-grams in the model’s output to reference outputs. Sensitive to slight variations (only exact matches). Other variations that have improved on BLEU are ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and METEOR (Metric for Evaluation of Translation with Explicit ORdering).
F1 Score/Precision/Recall. A classic way of measuring model quality. Evaluates the balance between precision and recall.

title: "Mechanical Turks" tags: [] published_at: "2023-11-01T14:30:00.000Z" slug: mechanical-turks

The Mechanical Turk was a chess-playing AI constructed in 1770. For eighty-four years, the machine toured and beat most human opponents. It could also do tricks like the knight’s tour (moving a knight to land on every chessboard square exactly once). It was originally made to impress the Empress of Austria. It supposedly defeated Napoleon and Benjamin Franklin in chess games.

But the Mechanical Turk wasn’t actually an AI — it was just a machine that cleverly concealed a human inside. The interior was intentionally misleading. It had a series of cabinets that opened and gave the illusion of moving gears and open compartments (in fact, the operator had a sliding seat so that they could move back and forth as the viewers inspected the insides). The pieces moved with strong magnets (although the inventor carefully ensured external magnets didn’t affect the board). The board was numbered inside the box.

The idea of the Mechanical Turk was revived in 2005 when Amazon launched its Amazon Mechanical Turk service. It’s a simple interface that lets requesters post “human intelligence tasks” (“HITS”) for humans to complete for a small amount of cash. These small tasks included transcribing audio, rating products, image tagging, or surveys. The requester operates with an API and doesn’t have to worry about scheduling or distributing the tasks among workers. Workers select whatever jobs they want.

Mechanical Turk has been especially useful for researchers collecting data and for companies to build labeled training sets for training machine learning models. Companies like Scale AI have evolved the idea and built specialized tagging tools for workers.

But what happens in the world of LLMs? Most “Turkers” use ChatGPT or a similar tool behind the scenes. Labeled data (still important, but not as important as in the last wave) can now be contaminated by other models. Pre-LLM labeled data might become the low-background steel of AI.

Mechanical Turk was once called “artificial artificial intelligence” by Jeff Bezos. It will be interesting to see what the Mechanical Turk of LLMs will be.

title: "Regulatory Capture in the Railroad Industry" tags: [] published_at: "2023-10-31T14:30:00.000Z" slug: regulatory-capture-in-the-railroad-industry

The Interstate Commerce Commission (ICC) was created in 1887 to regulate the rates and practices of railroads. After decades of monopolistic practices, the ICC was supposed to protect consumers.

Regulatory capture is when the regulatory agency, which is supposed to act in the public interest, becomes dominated by the industry or sector it is charged with regulating.

However, the ICC ended up protecting many of the interests in the railroad industry (and later, the trucking industry).

Favorable rate setting. The ICC was supposed to set fair and reasonable rates for shipping goods (Hepburn Act). However, the rates ended up disproportionately favoring railroad companies over smaller shippers. The ICC made exceptions for some of the biggest companies with loopholes (like the exemption for “private car lines”).

Barrier to entry for competitors. The licensing and approval processes made it difficult for new entrants. The government would decide what routes could be served by which companies.

Industry influence. The ICC appointed railroad industry veterans. Likewise, the revolving door continued as retired ICC commissioners found jobs at the companies they once regulated. When companies failed to follow the rate or safety guidelines, the ICC was slow to prosecute.

Complex rule-making. Established companies could more easily navigate the complex rules and regulations set forth by the ICC, effectively sidelining smaller or newer companies.

title: "What If OpenAI Builds This?" tags: [] published_at: "2023-10-30T14:30:00.000Z" slug: what-if-openai-builds-this

Open AI just released an update to ChatGPT that allows you to upload and “chat” with your PDF documents. This has been a feature that’s been one of the most popular indie hacker products to build — some reaching six or seven figures in ARR. Does this mean the end of these wrappers across the board? Some thoughts:

More competition, lower margins. This is a product that was bound to get cheaper. It’s easier than ever for developers to launch something like this. The best distribution channels and SEO are now crowded. Whether it’s OpenAI taking this margin or niche competitors building on better APIs, similar products will probably move towards the cost of inference.
Focus and distribution matter. Even though OpenAI has the benefit of seeing what’s working with its API, it can’t tackle all of the problems (but it can solve a lot). Google might be an interesting example — it captured many of the opportunities adjacent to search, but not everything.
“What if OpenAI builds this?” is the new “What if Google builds this?”****. Many of the takeaways are the same: large companies find it hard to rationalize entering a small market, large companies can’t navigate the idea maze as well as startups, and large companies have structural issues as to why they can’t compete in a new market. None of these apply to OpenAI and Chat with your PDF (which is the problem). However, there will be many wrappers that are at odds with some form of OpenAI’s business model (e.g., usage of the API, AI safety, training data, etc.).
Expansion is key (quickly). Once the initial idea is validated and finds a semblance of product-market fit, you need to expand into the adjacent problems. It could be as simple as supporting UX that’s materially different from chat or a more complicated backend pipeline. Some of the best opportunities are time-limited.

title: "On Mixing Client and Server" tags: [] published_at: "2023-10-29T14:30:00.000Z" slug: on-mixing-client-and-server

Mixing client and server code is the new paradigm in React with Server Components. With the “use server” directive, you can run components exclusively on the server. This means that you can do things like write asynchronous database queries right in the component code. You might even mix SQL or a different language right in your JSX code.

Some thoughts on the benefits and drawbacks of this architecture.

Why is this good?

More performant (if used correctly). The naive way to deliver modern web applications in React was to serve a large JavaScript bundle, render a shell layout, perform a data fetching request to hydrate the page, and then render the content. Users stared at a blank page until the JavaScript was downloaded, and then a shell under the data was fetched. Some frameworks made optimizations to this, sending the shell HTML first along with the JavaScript. Users at least saw a general layout quickly rather than a blank page. But there was still lots of chatter between client and server before the data was fetched and rendered.
Colocated code (no context switching). TailwindCSS is popular partly because it allows frontend developers to write CSS in the same files as their other code. I imagine the same will be true of melding code normally reserved for backends into the “frontend”. Quicker iteration time and fewer back-and-forth between developers working on different parts of the codebase. Iterate on the API in tandem with the frontend code (which is normally the case anyway).
React Component is the new API. This means it might be easier for companies to ship rich components that include server routes. You couldn’t really do this before. For example, maybe a form submission that interacts with an external API with an API Key that isn’t exposed to the client. Before, you would have to import the frontend client code as well as set up an API route to proxy the request with the API Key. Now, you can just import the component and be done.

Why might this cause problems?

Is this a client or server component? Every component is run on the server by default (even though the “use client” directive refers to how all components used to work). It forces developers to think through where the code is running. This is confusing because it doesn’t actually abstract away any of the complexity that’s normally reserved for a runtime.
Extends to dependencies. The server component feature is viral — it touches not only the code getting written but all of the dependencies. This is going to lead to a lot of refactoring and difficult migrations that are difficult for library maintainers to support. Server components add another layer to the decision tree of how you organize your components.
Separation of concerns. Just because you can, doesn’t mean you should. Having the client/server boundary in separate “applications” (languages, folders, projects, deployments) is something that keeps code healthy. Mixing client and server code leads to spaghetti code if you aren’t careful (almost by definition).
Complicated render pipeline to debug. Things should work, but when they don’t, it will be more difficult to debug. Developers will have trouble debugging where things are rendering. With streaming server-side rendering and

… [truncated — open the raw llms.txt above for the full file]