Obsidian Vault Feed

Post by @dillon_mulroy on X

Thu, 28 May 2026 15:28:16 GMT

here's my pi session for those asking how i get to this point https://pi.dev/session/#5ddc8747b98a80ad993c18e95270038e… > **Dillon Mulroy @dillon\_mulroy** · 2026-05-28 > > my "plans" largely look like pseudo code composed of mostly types/interfaces, how they compose, and their boundaries > > ive recently started including call stacks - been very helpful for both me and agents when implementing > > ![图像](https://pbs.twimg.com/media/HJaKEh8WEAEWv4h?format=jpg&name=large) ![[Pasted image 20260528232830.png]]

fix-bug

Mon, 25 May 2026 06:54:02 GMT

几乎总是，正确的修复不是处理坏状态，而是让坏状态不可能发生。

Cross-Repo Handoff Protocol: Coordinating AI Coding Agents Across Separate Repositories

Mon, 25 May 2026 06:53:47 GMT

AI coding agents are getting incredibly good at working inside a single repository. They can inspect files, pick up on local conventions, write tests, and churn out solid code. But let's be real: modern software products are rarely confined to a single repo. When you're building a new feature, you're usually touching a backend API, a frontend UI, maybe a mobile app, infrastructure, and shared packages. Not to mention updating documentation and analytics events. This brings up a fascinating new challenge for the era of AI-assisted development: **How do we coordinate AI agents when each one is isolated in a different repository?** This article proposes a lightweight pattern to solve this: The **Cross-Repo Handoff Protocol** —a simple, file-based coordination layer for AI coding agents. --- ## The Problem: Agents in Silos Imagine you have two AI agents working on a feature. One is building the API, the other is putting together the UI. Without a coordination layer, each agent might solve its local part perfectly, but completely break the end-to-end product flow. When these agents hit boundaries that depend on contracts, things get messy quickly. Who decides what endpoint the UI should call? What shape should the response take? Are there required fields, or specific error states to handle? If we don't coordinate them, agents tend to duplicate assumptions, hallucinate API contracts, and miss crucial frontend/backend dependencies. They treat local success ("Hey, my tests passed!") as system success, leaving unresolved questions buried in their isolated conversation histories. The core issue isn't code generation—it's **cross-repo communication**. One option is to open your entire workspace and let a single "god agent" handle everything. For small tweaks, this works fine. But for larger features, the context window gets noisy, responsibilities blur, and the agent starts making wild assumptions across boundaries. Alternatively, you can run one focused agent per repository. This gives them great focus, but now you're stuck manually playing telephone, copy-pasting context between their chat windows. --- ## The Proposal: A File-Based Message Bus We need a lightweight pattern. Instead of waiting for complex centralized agent runtimes or full agent-to-agent network protocols, we can use something every agent already understands: **files.** The goal is simple: **Use the filesystem as the message bus.** This keeps agents independent and repositories decoupled, while making their coordination visible, reviewable, and auditable by humans. ### Why Files? Files are dead simple. They are versionable, diffable, tool-agnostic, and compatible with Git. They're easy for us humans to inspect, and trivial for agents (like Claude Code, Cursor, or Goose) to read and write. This isn't meant to replace future, sophisticated agent-to-agent protocols. It's a practical pattern that works *today* with the tools we already have. --- ## The Workspace Layout The protocol uses a neutral, shared directory outside the individual repositories. Let's look at an example: ``` platform-workspace/ .ai-workflow/ issues/ TICKET-123/ issue.md shared-context.md contract.md decisions.md status.md messages/ handoff/ verification/ product-api/ product-ui/ ``` In this setup: - `product-api/` is the domain of the API agent. - `product-ui/` is the domain of the UI agent. - `.ai-workflow/` is the shared coordination layer. Agents don't need to touch each other's codebases to talk. The API agent stays in its lane, the UI agent stays in its, and both read/write structured markdown files in the `.ai-workflow` directory. [![ ](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8vcjk3eihg4w1n0shiu.jpg)](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8vcjk3eihg4w1n0shiu.jpg) --- ## The Core Principle: Contracts over Conversations Agents shouldn't communicate by injecting comments into each other's code. They should use an auditable workflow layer. We need to enforce explicit boundaries: - **Messages** are temporary discussions. - **Contracts** are the source of truth. - **Decisions** are permanent records. - **Handoffs** are delivery summaries. - **Verification** proves the system works. Why? Because a question isn't a contract. An implementation summary isn't an end-to-end test. ### Key Files in the Workflow - `issue.md`: The original task. Includes acceptance criteria, product context, and affected repos. - `shared-context.md`: The overarching business goals, user flows, and domain constraints both agents need to know. - `contract.md`: The agreed-upon cross-repo behavior. **Once agreed, this is the source of truth.** This prevents an agent from implementing based on a stale assumption. For example: ``` # Contract: User Display Name ## Endpoint \`GET /api/users/{id}\` ## Response shape { "id": "string", "displayName": "string", "firstName": "string" } ## Rules - \`displayName\` is required. - If \`displayName\` is missing, the UI may fallback to \`firstName + lastName\`. ``` - `decisions.md`: Permanent records of choices made ("The API will return `displayName` directly"). - `status.md`: Tracks overall progress and blockers. We also use directories for specific lifecycle stages: - `messages/`: Asynchronous chat between agents (e.g., `001-api-to-ui-question.md`). - `handoff/`: Summaries of what each agent built, changed files, and local tests run. For example, an `api-handoff.md` would list exactly what controllers were touched, the tests run, and notes for the UI agent. - `verification/`: Records of validation (integration checklists, QA notes). It separates "I wrote the code" from "the system works." --- ## The Message Lifecycle Every message file needs structured metadata so agents know what to do with it. At a minimum, it should include sender, receiver, type, status, and any blocking info. **Example Question (from API to UI):** ``` --- id: 001 issue: TICKET-123 from: api-agent to: ui-agent type: question status: NEEDS_RESPONSE created_at: 2026-04-30T13:00:00-06:00 blocks: - api-implementation --- ## Question Does the UI need this field as \`displayName\`, or should it compose it from \`firstName\` and \`lastName\`? ## Context The API can support either option, but we need to avoid breaking existing consumers. ``` **Example Answer (from UI to API):** ``` --- id: 002 issue: TICKET-123 from: ui-agent to: api-agent type: answer status: ANSWERED responds_to: 001 created_at: 2026-04-30T13:15:00-06:00 --- ## Answer The UI prefers receiving \`displayName\` directly. The UI can keep a fallback using \`firstName\` and \`lastName\` for older responses during rollout. ``` By standardizing message types (`hello`, `question`, `proposal`, `decision`, `blocker`, `handoff`, `close`), we keep the workflow predictable. An issue cannot be considered complete while any message is still marked `OPEN`, `NEEDS_RESPONSE`, or `BLOCKED`. --- ## Handshakes and Hangups One of the biggest headaches in multi-agent coding is when one agent thinks the job is done, while the other is still waiting for an answer. **Start with a handshake:** When an agent spins up, it writes a `hello` message declaring its scope. The API agent explicitly states, "I will only modify `product-api` and I cannot modify frontend routes." **End with a hangup:** At the end, agents explicitly close their loop. The API agent writes a `close` message stating its implementation is done, tested, and it has no pending questions. The UI agent replies with a `close_ack`. This explicit closure gives human reviewers confidence that both sides reached a natural, synchronized stopping point. --- ## Verification: Trust, but Verify Local tests aren't the same as cross-repo validation. Just because both repos passed their own tests doesn't mean the feature works. The `verification/` folder is where you ensure the integration checklist is checked off. Are the contracts respected? Do the error fallbacks trigger correctly? Did we actually run an end-to-end test? --- ## Trade-offs: Nothing is Free Let's be pragmatic—this pattern introduces overhead. - **Token usage:** If agents blindly read the entire workflow folder instead of just relevant files, token costs will spike. - **Noise:** Turning every minor clarification into an artifact can clutter the workspace. - **Stale State:** If `contract.md` or `status.md` aren't updated, agents will hallucinate based on bad data. - **False Confidence:** Checking a box in a markdown checklist doesn't mean the code actually works. This protocol is most useful when the cost of a wrong cross-repo assumption is much higher than the cost of writing the handoff files. For tiny, isolated tweaks, it's definitely overkill. A good rule of thumb is to only make agents read what they need (e.g., `issue.md`, `contract.md`, and unresolved `messages/`). --- ## Final Thoughts As AI coding agents become permanent fixtures in our engineering teams, the bottleneck will shift. It won't just be about generating code faster; it will be about coordinating these agents across repositories, contracts, and delivery phases. In human teams, we solve this with Jira tickets, design docs, Slack threads, and PR reviews. AI agents need something similar—something explicit, readable by machines, and auditable by humans. The Cross-Repo Handoff Protocol keeps agents independent and repos decoupled, while turning their coordination into a tangible artifact. Most importantly, it helps prevent local agent success from turning into a system-level failure. *What do you think? Have you run into the multi-agent coordination wall yet? Let's discuss in the comments!*[MongoDB](https://dev.to/mongodb)Promoted [![Gen AI apps are built with MongoDB Atlas](https://media2.dev.to/dynamic/image/width=775%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.imgur.com%2FlGiI0TQ.png)](https://www.mongodb.com/cloud/atlas/lp/try3?utm_campaign=display_devto-broad_pl_flighted_atlas_tryatlaslp_prosp_gic-null_ww-all_dev_dv-all_eng_leadgen&utm_source=devto&utm_medium=display&utm_content=airevolution-v1&bb=241241) ## Gen AI apps are built with MongoDB Atlas MongoDB Atlas is the developer-friendly database for building, scaling, and running gen AI & LLM apps—no separate vector DB needed. Enjoy native vector search, 115+ regions, and flexible document modeling. Build AI faster, all in one place.

从 github cli 学习如何设计 agent friendly cli

Sat, 23 May 2026 14:22:58 GMT

GitHub 的 gh 应该是 vibe coding 里非常高频的命令了，而且它 CLI 本身也设计得很 agent-friendly，很适合拿来学习研究。我觉得 gh 最值得学的，不是某几个具体 command，而是它处理了两个很关键的问题：一个是 GitHub 能力面太大，CLI command 很容易爆炸。另一个是 agent 的 context 很贵，命令输出不能把无关信息全塞进来。先看 command 爆炸的问题。 GitHub 的能力面非常大。如果每个能力都做成一个 command，很快会变成： ```bash gh issue list gh issue create gh issue comment delete gh repo deploy-key add gh project item archive ... ``` command 越来越多，整个 --help 也会变得难以维护。agent 也永远要猜：这个操作到底有没有对应的 command？ gh api 用很取巧的方式解决了这个问题（当然这也建立在 GitHub 本身非常完善的 RESTful API 基础上）： ```bash gh api repos/epiral/bb-viewer/issues/comments/4517246421 -X DELETE ``` 它不是简单地"可以直接调 API"，而是把 RESTful API 里的 resource model 自然迁移到了 CLI 里：路径定位资源，HTTP method 表达动作，认证和输出格式由 gh 统一处理。所以文档里看到： ```text DELETE /repos/{owner}/{repo}/issues/comments/{comment_id} ``` 几乎不用翻译，就能写成： ```bash gh api repos/epiral/bb-viewer/issues/comments/4517246421 -X DELETE ``` 这对 agent 特别重要。API 文档本身就可以变成 CLI 使用说明。agent 不需要学一套和 API 文档完全不同的 DSL，也不需要等 CLI 作者给每个长尾能力都包一层 command。比如用 REST 路径直接查一个 PR： ```bash gh api repos/cli/cli/pulls/13492 --jq '{number: .number, title: .title, state: .state}' ``` 输出： ```json {"number":13492,"state":"open","title":"Replace SITE_DEPLOY_PAT with gh-cli-site-deployer App"} ``` 路径就是资源定位，--jq 做字段裁剪，整个过程不需要记任何专用 command。这背后其实是一层 resource interface。resource 层解决的是覆盖面问题：能力很多，但语法可以统一。但 resource 不是万能的。RESTful 一直以来的问题就是，有些用户意图很难自然 resource 化。比如 login。它不是对某个资源做 CRUD。比如 clone。它既涉及远程 repo，也涉及本地文件系统和 git 状态。比如 checkout。它不是更新一个远程资源，而是在本地切换工作区状态。比如 merge。它经常包含多个底层动作，但用户表达的不是"修改这个字段，再删除那个分支"，而是"把这个改动合进去"。所以 CLI 里还需要 command 层。command 不是"多步骤编排"的同义词。多步骤编排只是 command 的常见来源之一。command 的本质是承接那些无法自然 resource 化的用户意图。如果从 gh 再抽象一步，可以把 resource 和 command 在语法上显式分开。一个可能的方案是用 / 前缀表示 resource： ```bash # command，没有 / 前缀，表达动作 cli login cli clone epiral/bb-viewer cli checkout 353 cli merge 353 --squash cli status # resource，用 / 前缀，表达对象路径 cli /issues list cli /issues/42 get cli /issues/42 update state=closed cli /issues/42 delete cli /issues/42/comments create body="LGTM" cli /projects/4/items list ``` / 的好处是它不占用任何单词。如果用 api，会让人以为这是传统 API wrapper；如果用 resource，太啰嗦；如果直接写 issues list，又会有歧义：issues 到底是 command 还是 resource？ /issues 就很清楚：这是一个资源路径。 resource 层的动词可以收敛到一个很小的集合： ```text list get create update delete ``` 这样 agent 学会一个资源，就基本学会了所有资源。 ```bash cli /issues list state=open author=me cli /issues create title="Fix login" body="..." cli /issues/42 update state=closed cli /issues/42/comments create body="LGTM" ``` 参数也可以分清楚：key=value 是资源参数，--flag 是 CLI 行为控制。 ```bash cli /issues list state=open --json --jq '.[].title' ``` resource 层负责覆盖长尾能力，command 层负责表达高层意图。两者不是替代关系，而是互补关系。再看第二个问题：输出污染。在 agent workflow 里，命令输出不是越多越好。无关字段进入 context，不仅浪费 token，还会污染语义空间，干扰后续推理。 gh 的 --json / --jq 很值得学。比如不做裁剪，agent 拿到的可能是这种输出： ```bash gh pr list -R cli/cli --limit 1 --json number,title,author,labels,state,reviewDecision,updatedAt ``` ```json [ { "author": { "id": "MDQ6VXNlcjE2MTE1MTA=", "is_bot": false, "login": "williammartin", "name": "William Martin" }, "labels": [], "number": 13492, "reviewDecision": "REVIEW_REQUIRED", "state": "OPEN", "title": "Replace SITE_DEPLOY_PAT with gh-cli-site-deployer App", "updatedAt": "2026-05-22T16:54:37Z" } ] ``` 但如果下一步只是要知道 PR 标题，真正需要进入 context 的只有一行： ```bash gh pr list -R cli/cli --limit 1 --json title --jq '.[0].title' ``` ```text Replace SITE_DEPLOY_PAT with gh-cli-site-deployer App ``` \--jq 的价值不是"省一个管道"，而是把信息裁剪发生在进入 LLM context 之前。先减少 token 浪费，再减少无关字段对后续推理的干扰。还有一种情况：默认输出本身就是语义化的。 ```bash gh pr view 13492 -R cli/cli ``` ```text title: Replace SITE_DEPLOY_PAT with gh-cli-site-deployer App state: OPEN author: williammartin reviewers: copilot-pull-request-reviewer (Commented), BagToad (Requested) number: 13492 url: https://github.com/cli/cli/pull/13492 additions: 26 deletions: 2 -- ## Summary Replaces the personally-held SITE_DEPLOY_PAT used by the release workflow... ``` 这类输出比一整坨 JSON 更适合 LLM 直接理解。所以 JSON 的定位应该是串联和精确抽取，不是默认认知界面。默认输出应该尽量语义化。这个点前一篇展开过，这里只 callback 一下：自然语言是模型更擅长的表征。如果设计自己的 resource 风格 CLI，也可以沿用这个思路： ```bash cli /issues/42 get ``` 默认输出： ```text Issue #42: Fix login bug State: open Author: epiral Labels: bug, auth Updated: 2h ago Login fails when session expires. ``` 需要串联时： ```bash cli /issues/42 get --json --jq '.title' ``` ```text Fix login bug ``` 默认语义化，需要时结构化。最后还有一些执行层面的设计也值得学。 gh 的 flags 很一致：--repo、--assignee、--label、--json、--jq、--web 在不同 command 里复用。对人是降低学习成本，对 agent 是提高泛化能力。 \--web 是一个自然的 fallback： ```bash gh pr view --web ``` CLI 不需要假装覆盖所有交互。有些事情就是 Web 更合适。还有非交互模式： ```bash GH_PROMPT_DISABLED=1 gh pr create --title "fix bug" --body "..." ``` \--yes 跳确认，--dry-run 做预览，token 走环境变量。这些都是 agent 能稳定使用 CLI 的基础设施。所以 agent-friendly CLI 不只是 machine-readable CLI。 machine-readable 解决的是程序串联；agent-friendly 还要解决语义理解。结构上要稳定可组合：resource path、统一动词、--json、--jq。语义上要适合 LLM：默认自然语言输出、清晰错误信息、少 token 噪音。从 gh 里可以学到的核心就是这个方向：用 resource 层避免 command 爆炸，用 command 层承接无法 resource 化的动作，再用输出裁剪避免污染 agent 的 context。

How's Linear so fast? A technical breakdown

Sat, 23 May 2026 12:51:39 GMT

![How's Linear so fast? A technical breakdown](https://media.performance.dev/cdn-cgi/image/width=3572,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/yLYBhiiY_FsV.jpg) A few milliseconds is all it takes to update an issue in Linear. A traditional CRUD app doing the same thing takes about 300ms. How do they do it? There's no secret silver bullet to performance. The reality is that it's built from the ground up on the right foundation, then improved by countless decisions. My goal is to walk through some of the techniques that make Linear feel the way it does and help you implement the same. ## What I'll cover - Database in the browser - Making the first load feel instant - The sync engine - Designed for speed - Animations A quick disclaimer: I've never worked at [Linear](https://linear.app/) and have never seen their code. Everything I share comes from my personal experience, studying their app, reading their blog posts, or watching their conference talks. I simply love building web apps and have been using Linear since their beta launch. Also, the article’s hero image comes from a video by [Meg Wayne](https://x.com/megxwayne), whose work for Linear is phenomenal. --- ## Database in the browser Most web apps live inside the same loop. The user clicks. The browser fires an HTTP request. A server queries a database and sends it back. The browser repaints. The end result is a spinner, a skeleton, or a frozen UI for a few hundred milliseconds while the app waits on the network. Linear inverts the traditional relationship. The actual database the UI reads from is in the browser, in IndexedDB. Mutations apply locally first, then asynchronously push to the server, which broadcasts deltas back to other clients via WebSocket. In my opinion, this is the most critical piece to Linear's performance. When your goal is to build a fast web app the biggest bottleneck you will fight is the network. Any data sent between the client and server costs hundreds of milliseconds. The best approach is to eliminate the need for a network request entirely: which is exactly what Linear does. I'll be repeating this a lot, but the secret to building incredible web apps is by hiding all the network requests from the user. The more loading states you can avoid the better. Here's an example of how simple Linear's requests are: ```typescript // A traditional web app updating the server async function updateIssue({ issue }) { showSpinner(); const response = await fetch(\`/api/issues/${issue.id}\`, { method: "PATCH", body: JSON.stringify({ title: issue.title }), }); const updated = await response.json(); setIssue(updated) hideSpinner(); } // vs Linear issue.title = "Faster app launch"; issue.save(); ``` The first line, `issue.title = "Faster app launch"`, updates an in-memory datastore (MobX observable in Linear's case). The second line, `issue.save();`, queues a transaction that their sync engine batches and flushes to the server. The key here is that the UI re-renders synchronously off the local, in-memory, update. There are no spinners because there is nothing to wait for because the data is synced in the backround. This is the magic of treating the browser as the database for each user. [Tuomas](https://x.com/artman), one of Linear's co-founders, said this at a conference in 2024: 'Literally the first lines of code that I wrote was the sync engine, which is very uncommon to what you usually do when you're a startup.' From day one, Linear knew the approach they wanted to take and the tradeoffs it would take. Linear's issue creation no spinners or delays I know most people won't build a custom sync engine like Linear just to make their app feel fast and they don't need to. For most use cases, libraries like [Tanstack Query](https://tanstack.com/query/latest) and [SWR](https://swr.vercel.app/) can get surprisingly close with optimistic updates. Most web apps feel slow because the UI waits for each network request to complete before updating state. For most usecases the network request will succeed so you should take advantage of that and optimistically update your state. ```typescript // optimistic mutation with SWR mutate( \`/api/issues/${issue.id}\`, { ...issue, title: "Faster app launch" }, false ); // vs Linear issue.title = "Faster app launch"; issue.save(); ``` The key idea is simple: UI responsiveness should not depend on network latency. Users perceive speed based on how quickly the interface reacts, not how quickly the server responds. Optmistic requests is one of the highest leverage improvements you can make: - eliminate unnecessary spinners - update state immediately - validate in the background - rollback only if needed Linear's foundation is based on this exact principal and it makes the app feel native and fast. ### A peek into Linear's stack Linear is built on the simplest stacks you can find: React, TypeScript, MobX, Postgres, a CDN. There's no edge database, no React Server Components, or no fancy framework. ```text Frontend React + react-dom (UI runtime) MobX (observable graph, granular re-renders) TypeScript (single language end-to-end) Rolldown-Vite + plugin-react-oxc(mid-2025; previously Rollup; previously Parcel) ProseMirror + y-prosemirror (rich text editor; Yjs CRDT for live collab) Radix UI primitives (popovers, menus, focus traps) Emotion + StyleX (Emotion runtime + StyleX compiled to atomic CSS) Comlink (Worker RPC) idb (IndexedDB wrapper backing the local-first store) graphql-request (GraphQL transport to the sync server) Sentry (error monitoring) Inter Variable (single woff2, font-display: swap) Backend Node.js + TypeScript (single language for all server code) PostgreSQL on Cloud SQL (issues table partitioned 300 ways) Memorystore Redis (event bus + cache + sync cursors) turbopuffer (similar-issue detection, vector db) Kubernetes on GCP (one workload per concern) Cloudflare Workers (multi-region edge proxy) Other clients Desktop: Electron (same web JS, native chrome) Mobile: Swift (iOS) + Kotlin (a separate full reimplementation) Marketing Next.js (static) styled-components Inline SVG sprite ``` The biggest standout to me is their decision to stick with client-side rendering. CSR often gets criticized for slow initial loads, but with the right architecture and design it can feel instant. I'm also a big fan of the simplicity it brings. Keeping the app entirely client-side creates a much cleaner mental model and removes a lot of the complexity that comes with server-rendered apps. You don't have to constantly think if you're on the server or client. If window object is accessible or not. If you're setting the right cache headers or not. There's beauty in simplicity and the constraints you're forced into. So how does Linear make their client side rendered app feel instant? --- ## Making the first load feel instant One thing I obsess over is the first load, and Linear clearly does as well. For productivity tools especially, the time it takes before you can actually start working is one of the most important details to consider. No one wants to be waiting for a new tab to load for multiple seconds First, you have to understand what makes initial loads slow. For a client side app you have to request the `index.html`, then that requests all the JavaScript and CSS, which then runs some sort of authentication, and finally makes some API requests to show the app. ### Linear's bundler arc: Parcel, Rollup, Vite, Rolldown The first step to making an app feel instant happens long before runtime. It starts at build time. Remember, the network is the bottleneck, so shipping the least amount of JavaScript and CSS is critical to fast load times. From what I can gather Linear has rewritten their build pipeline four times: Parcel → Rollup → Vite → Rolldown. Each migration was driven by the same goal: reduce the amount of JavaScript and CSS and improve the developer experience. From their own blog posts they claim: - 50% less code shipped. - 30% smaller after compression. - Cold-cache page loads got 10 to 30% faster. - Time-to-first-paint of the active-issues view dropped 59% (on Safari). - Memory usage dropped 70 to 80% Most of that came from a combination of decisions targeting only modern browsers, better dead-code elimination, and aggressive code splitting. Dropping legacy support is the big win (no polyfills, no ES5 transpilation, no nomodule fallback) but the dead-code and chunking work matters just as much. Even with all of these optimizations, Linear still ships a substantial amount of code: roughly 21 MB of minified JavaScript. The difference is that it's aggressively code split into hundreds of route-level chunks that are fetched on demand. ```typescript // vite.config.ts (reconstruction; matches observed chunk graph) export default defineConfig({ plugins: [react()], build: { target: "esnext", // no legacy syntax, no polyfills cssMinify: "lightningcss", modulePreload: { polyfill: false }, rollupOptions: { output: { // One chunk per npm package > ~3 KB. Cache invalidation // becomes per-library instead of per-app-revision. manualChunks(id) { if (id.includes("node_modules")) { const pkg = id.match(/node_modules\/([^/]+)/)?.[1]; if (pkg) return \`vendor-${pkg}\`; } }, }, }, }, }); ``` The lesson isn't which bundler to pick but the importance of dropping legacy browsers, going native ESM, and code splitting like crazy. Each step is small. Stacked, they cut Linear's first-load JavaScript roughly in half and their build time by an order of magnitude. So, the first secret to instant load times is reducing the amount of JavaScript and CSS needed to render something for the user. ### Preloading after initial load **Once you've split your JavaScript into the smallest chunks possible you can start doing work in the background.** But hold on, splitting the bundle into hundreds of chunks creates a new problem. Each chunk imports other chunks, and the browser doesn't know what those are until it parses the entry script. Without help, the load timeline becomes a waterfall: fetch the entry, parse it, fetch its imports, parse those, fetch their imports. Every level adds a network round-trip, which you want to avoid at all costs. What Linear does is before any JavaScript runs, the browser sees the entire list and fires off the requests in parallel. By the time the entry script reaches its first `import`, the chunks are already in cache. Here's what it looks like in the `` if their `index.html` ```html ``` The `crossorigin` attribute on each preload matches the `crossorigin` on the entry script, so the browser reuses the cached fetch instead of treating preload and import as separate resources. Same trick as the font preload, applied to every chunk on the critical path. The cold-load timeline collapses from a sequential waterfall into a single parallel batch. The network still does the work. It just does it all at once. The beauty of this technique is you're able to do all this work in the background when the user first hits the login page. In a few seconds the full app is stored in cache and served instantly. It's extremely important to understand how people will use your app. Once you have this understanding you can start using it to your advantage, such as preloading scripts in the background as Linear does. ### The service worker for even more speed and offline capabilities The rest of the Linear, the route-level chunks for views the user hasn't visited yet, gets cached in the background by a service worker. The worker has a precache manifest baked into its source, around 1,200 hashed assets covering route chunks, icons, and fonts, and pulls them down lazily after the first page load. Within a few seconds of hitting the login screen, the full app is sitting in cache. Preloading all the chunked javascript files to ensure instant loads from cache This buys two things. Subsequent navigations skip the network entirely; the service worker answers directly from its cache without even going through HTTP cache. And the app keeps working when the network doesn't. Combined with the local-first sync engine (which already has the user's data in IndexedDB), Linear is usable offline. You can read issues, create new ones, edit titles and descriptions, change statuses. Everything queues in the local transaction store and flushes the next time the connection comes back. Modulepreload is for what the app needs now, parallel-fetched so the browser never blocks on a serial import chain. The service worker is for what the app needs next. So, to get load times fast the steps for Linear is to elminate as much code as possible, split it into small pieces, and precache it in the background. Again, the goal of all this work is to make network requests as fast as possible or, even better, eliminate them completely. ### Vendor bundle composition I found it interesting that every package Linear uses gets its own chunk, cached independently. A traditional `vendor.js` invalidates the entire dependency graph on any bump. Linear's chunking turns vendor caching from a single massive file to fine-grained. Bumping a single dependency invalidates one chunk; the rest stay cached. Seems like a no-brainer and yet another detail to ensure fast load times. ![](https://media.performance.dev/cdn-cgi/image/width=2400,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/b5RGjc_nI54O.jpg) Each individual package split into its own js file ### Loading massive font files Font loading is one of those details a lot of apps get wrong. The failure modes are visible: invisible text for half a second, layout shifts as the real font swaps in, double-fetched resources because the preload didn't match. Linear's setup avoids all three: ```html ``` ```css @font-face { font-family: "Inter Variable"; font-weight: 100 900; font-display: swap; src: url(https://static.linear.app/fonts/InterVariable.woff2?v=4.1) format("woff2"); } /* Italic and Berkeley Mono follow the same shape, single woff2 each. */ ``` Variable fonts cover the full 100–900 weight axis in a single woff2, eliminating per-weight requests. `font-display: swap` renders the fallback stack immediately and swaps to Inter when it loads. The trick that's easy to miss: `crossorigin="anonymous"` on the preload tag. Without it, the browser preloads the font, then fetches it again when CSS later references it, because the two requests have different CORS modes. `crossorigin` on the preload makes the browser reuse the cached one. This all seems simple, but I'm always surprsied at how many apps load fonts incorrectly. Linear is a great example of thinking through the details and ensuring font loading is as fast and accurate as possible. ### Inlined app shell Another key tehcnique to make the first load feel fast: Inlined in `` is just enough CSS to paint the loading state with no external stylesheet fetched. Remember, the network is the bottleneck and what you'll always be fighting to make your app feel fast. In this case, Linear elminates a network request by inlining the critical CSS required to show the user an app shell. ```css ``` Beyond CSS there is also a bunch of inlined JavaScript that's critical to loading the initial experience. ```typescript ``` Before any bundle has parsed, the JavaScript from `index.html` reads `localStorage.splashScreenConfig`, merges any `sessionStorage` override on top, and applies the user's remembered shell tokens directly to `document.documentElement.style`: sidebar background, base color, border color, sidebar width, agent toolbar height. It detects color-scheme preference and Electron context. It checks whether `localStorage.ApplicationStore` exists, and if not, adds a logged-out class that switches the shell to the auth layout. By the time the first JavaScript bundle comes from the network the loading screen is already correctly themed, sized, and positioned for whether the user is logged in. This gives the user the feeling that the app is ready to go as soon as they hit enter in the URL bar. There's no faster way around this than sending down the initial app shell in the initial `index.html` response. An example of how fast Linear's initial load is ### Render first, authenticate second Authentication is another step where most apps give up their performance budget. The conventional flow: fetch the HTML, load the bundle, validate the session, fetch the user, fetch the workspace, then render. One to three seconds before the user sees anything. Linear treats auth the same way it treats mutations. Assume the happy path and verify in the background. This is probably one of my favorite parts of their architecture because it allows them to almost immediately render the full experience on load. Most CRUD apps keep the real session in an HttpOnly cookie, then add a second JS-readable cookie or `/me` request so the frontend can tell whether the user is logged in during startup. Linear does something simpler. Instead of maintaining a parallel auth signal, the inline boot script just checks whether `localStorage.ApplicationStore` exists: ```javascript if (localStorage.getItem("ApplicationStore") === null) { document.documentElement.classList.add("logged-out"); } ``` If it's there, the user has used Linear in this browser before, which means their workspace is already sitting in IndexedDB. This goes back to the first section we covered where the database lives in the browser. If it's missing, there's nothing to render anyway, so the shell flips to its logged-out layout and the login flow takes over. The initial flow for Linear isn't "do you have a valid session." It's "do we have anything to show you." Their actual session token sits in a cookie. The bundle never tries to be smart about it. It just renders what it has and lets the next request (the WebSocket handshake, a sync delta, any HTTP call) be the thing that fails with a 401 if the session has gone stale. When that happens, the client redirects to login. The whole pattern is consistent with the rest of the architecture: the client trusts what's local, the server is the source of truth for correctness, and the two reconcile asynchronously. Just like a mutation. Just like their sync engine. Manually deleting the auth session and refreshing the desktop app This is maybe one of my favorite details about Linear that I wish more apps behaved this way. For authentication, assume happy path, and fallback if not. If there's data to be shown: show it! And leverage your browser's datastores to render immediately. --- ## The sync engine Most of what makes Linear fast lives downstream of one decision: the server is a sync target, not a source of truth for the UI. The internals of their sync engine been thoroughly reverse-engineered already, and Tuomas has given multiple excellent talks on the architecture. I'm not going to retrace them. What I want to do is name the three pillars that actually produce the speed, because the speed is a property of how they fit together, not of any single one. ### 1\. The data is already there When the app boots, it doesn't fetch the workspace from the server. It hydrates from IndexedDB into an in-memory MobX object pool, and every query from the UI goes to the pool first. There's no "loading issues" state because the issues are already on the user's machine. Something I found interesting is as they've scaled they've chunked the data in the sync enginer using the similar fundamentals as their JavaScript bundles. Not everything is fetched at once: the two heaviest tables, Issue and Comment, lazy-hydrate on demand. This is data-level code splitting, and it's what lets the engine scale: startup cost tracks the workspace structure, not the workspace size. A 10,000-issue workspace boots about as fast as a 100-issue one. Click into a project, the issues are there. Filter by assignee, the index is already built. There's nothing to fetch because there's nothing missing. It's either been immidately loaded from your browser or shortly after in a codesplit lazy chunk. ![](https://media.performance.dev/cdn-cgi/image/width=2400,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/IdWPGINS86Ge.png) IndexedDB: the database is in your browser ### 2\. Mutations don't wait for the network When you change an issue's status, three things happen almost at once: the MobX observable updates so the UI reflects the change, the mutation is written to a durable transaction queue in IndexedDB, and it's queued for the server. The network hasn't been touched yet. The user never waits to see their own change. The retry, the rollback, the durability across reloads, all background. If the server rejects, the observable reverts and there's a brief flicker, but in practice that almost never happens because most invalid mutations are caught before the transaction is even created. As I keep saying: the network is the enemy and you must do everything you can to avoid it. Linear's flow starts with the local mutation and treats the server as a confirmation step, not a permission step. ### 3\. One delta, one cell When the server confirms a mutation (yours or someone else's), the change comes back as a small JSON envelope describing what moved. The client applies it by writing to the corresponding MobX observable. Because every property on every model in Linear is its own observable, and every component that reads one is wrapped in `observer()`, MobX knows exactly which components depend on which fields. A change that updates one field of one issue re-renders exactly the components that read that field. Not the parent list, not the sidebar, one cell. A 50-issue update is 50 cell re-renders, not a list re-render. This is what lets a busy workspace stay smooth when ten people are editing things at once: the cost of receiving updates scales with what changed, not with what's on screen. I've built real-time apps streaming in stock data and fundamentals and having atomic updates of individual components it key to making an app feel performant. You want to avoid cascading updates as much as possible and Linear does exactly that. Updating an issue in the list and single issue row re-renders ### Why the three fit together Take any one away and the app starts to feel slow. A local database without optimistic writes still spins on save. Optimistic writes without granular observables still jank on every update. Granular observables without a local database still wait on initial load. Linear's speed isn't a property of any single layer. It's a property of the system. The bundler and loader shell are what make the app feel fast on first paint. The sync engine is what keeps it feeling fast once you start using it. --- ## Designed for speed Speed isn’t just an engineering problem. It’s a design problem too. A perfectly built sync engine still loses to a slow input model: if the fastest path to an action requires a mouse, three menus, and a click, the user pays for those steps regardless of how fast the underlying engine runs. Another cornerstrone to Linear's speed is how they've intergarated the keyobard as a priamry tool to navigate and complete your work. Every common action has a shortcut. The command palette is one keystroke away. The right-click menu is custom-built. None of these are accidents but instead thoughtful design decision from day one. ### Every action has a shortcut Single letters edit the focused issue. Two-letter combos navigate. Modifiers act globally. Listening to the founders talk about Linear’s early days, it’s clear that shortcuts were foundational from the start. The sync engine was designed in part so that any action could be performed at any time. It feels like this combination of design and engineering is continues to be behind every feature. If you look through their UI you'll notice shortcuts visible everywhere. The most frequent ones are single characters as they're used the most often. Furthermore, every action can be done with a mouse as not to alienate beginners. ![](https://media.performance.dev/cdn-cgi/image/width=1584,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/2Jd6ZXiwNCwp.jpg) ![](https://media.performance.dev/cdn-cgi/image/width=1584,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/2__J_PuPKayE.jpg) ![](https://media.performance.dev/cdn-cgi/image/width=1584,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/MuG3qxjnzrQq.jpg) ![](https://media.performance.dev/cdn-cgi/image/width=1584,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/KgHIdD85Bfr5.jpg) ![](https://media.performance.dev/cdn-cgi/image/width=1584,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/jvagQZ3l8V3y.jpg) ![](https://media.performance.dev/cdn-cgi/image/width=1584,quality=100,format=auto,fit=scale-down/posts/p_gAMR6Z7y49Fp/Y6mUzOgZJJHW.jpg) ### The command palette is always one keystroke away `⌘ k` opens a command palette that lets users search over almost any action in Linear. Issues, projects, labels, status changes, navigation, issue creation, settings, theme toggles. The command is incredibly fast because it's searching the local MobX object pool, not a server. Remember, avoid the network. The architectural payoff is that the entire app is accessible from a single pane. Navigation is search. Issue creation is search. Status changes are search scoped to statuses. Moreoever, the command is contextual and adapts to the what you're working on. A great way to teach key actions and shortcuts for any view. One primitive, used everywhere, running on data that's already in memory. A fast app needs both incredible engineering and design. You can build a perfect sync engine and a flawless rendering pipeline, and still ship something that feels slow if the design is wrong. Engineering speed makes a single interaction fast. Design speed makes the path to each interaction short. For a tool used all day, the difference between a shortcut and a two-second mouse path compounds over every action. Combine shortcuts with a global commmand palette and you've got yourself an app that's incredibly fast to use. --- ## Animations All the work up to now can still be undone by bad animations. Teams spend enormous effort making every part of their app fast. Initial load, updates, database queries, all of it. They shave off milliseconds so users never have to wait. Then, at the very last step, someone adds a 500ms height animation to an element. ### There are only a handful of properties you should animate Browsers have three tiers of property changes, and the cost scales with how high each one is on the rendering pipeline. Composited properties (`transform`, `opacity`) hand the work to the GPU and run independent of the main thread. Paint-triggering properties (`color`, `background-color`, `border-color`, `fill`) skip layout but still redraw pixels. Layout-triggering properties (`width`, `height`, `top`, `left`, `margin`, `padding`) force the browser to recompute the position of every subsequent element on the page. Never animate those. I mean never. ```css /* What Linear does */ .row:hover { background-color: var(--color-bg-hover); transition: background-color 0.12s; } .icon-arrow { transform: translateX(0); transition: transform 0.15s; } /* What you'd write if you didn't know better */ .row:hover { margin-left: 2px; /* triggers layout for every row beneath */ transition: all 0.2s; /* and now you're animating margin */ } ``` The `margin-left` version recomputes the layout of every row beneath the hovered one, on every frame, for the full 200ms of the transition. On a long issue list that's the difference between buttery and jank. If you go over every single property Linear animates in their app it's reserved to a handful, mostly those composited properties (`transform` and `opacity`) and sometimes properties like `background-color` and `border-color`. ### Know when to hold back In my opionion, what's almost as important as only animating composite properties is knowing when to not animate at all. It's easy to get carried away with animations. But in a tool used every day, the animations you'd love on a marketing site start to get in the way. Even a small hover delay, in the wrong place, becomes the thing the user notices. Linear nails most of this. The command palette is the one I'd argue is too slow, but I've become a cranky old man over the years. There are no transitions on list tiems to keep things snappy The reason a lot of their animations work is that they reference their origin. The status popover scales out of the status pill. The agent panel slides in from its toggle. The motion is doing spatial work, telling the user where the new element came from, rather than fading in from nowhere as decoration. ### Keep durations short and snappy ```css /* variables form Linear's stylesheet */ --speed-highlightFadeIn: 0s; --speed-highlightFadeOut: .15s; --speed-quickTransition: .1s; --speed-regularTransition: .25s; --speed-slowTransition: .35s; ``` Most design systems default longer than they should. Material's standard duration is 200ms, iOS's spring closer to 350ms. Defaulting to shorter transitions is one of the easiest ways to make an app feel faster, and Linear's defaults sit well below the industry norm. Linear takes this one step further with asymmetric timing on enter and exit. Hover highlights, popovers, and the agent panel appear instantly when you summon them, then fade out over 150ms when you dismiss them. The agent window appears instantly but fades out similar to macOS --- ## How Linear is so fast There are so many more details I could cover that make Linear feel fast. The reality is there's no single thing that makes an app performant. It's the culmination of hundreds of decisions made correctly. What I love about Linear's approach is how simple most of it is. No Next, no Tanstack, no fancy framework. They decided early on what architecture would serve their users best and have stayed true to it. The result is a client-side rendered app that's faster than server-rendered ones (and without the complexity)! The shape of it is roughly this. The server is a sync target rather than a source of truth. The database lives in the browser. Mutations apply locally first and reconcile in the background. The first load ships less code in more pieces, with a service worker precaching the rest while the user is still on the login page. Auth is assumed based off state and verified later. The sync engine hydrates from IndexedDB into per-property MobX observables, so a 50-issue update is 50 cell re-renders rather than a list re-render. The input model is keyboard-first. Every common action has a shortcut with a global command palette. Animations stay on the GPU, durations sit below the 100ms cause-and-effect threshold, and layout-triggering properties are never animated. The hard part isn't the implementation. It's the dedication to the craft over years, as the codebase matures, expands, and pushes up against new constraints. If you haven't, I'd recommend checking out [Linear](https://linear.app/) to see it all in action. --- Hope you learned a thing or two! It was fun writing this and diving into the details that make Linear what it is. I just love building the best web apps in the world and see how other people do it. If you have any feedback, suggestions, or want to connect you can [find me on X.](https://x.com/brotzky)

How to use evals and prompt engineering to ship a local model that matches frontier performance

Sat, 23 May 2026 06:51:21 GMT

![图像](https://pbs.twimg.com/media/HIxdjAmXYAEBUIp?format=jpg&name=large) Most production AI features don't need a frontier model. Here's how I used capability evals and prompt engineering to ship a local 3B model that matches Claude Sonnet on quality, runs twice as fast, and costs nothing per call. I’ve been building Mima, a social and news app that uses AI to summarize conversations, detect toxicity, and add other touches that make navigating the connected web smoother. Of course, I built it using my favorite Large Language Model (LLM), Claude. But now two things were blocking the beta: - **Keeping the user’s Personally Identifiable Information (PII) on their device** and off third-party servers. This is a skunkworks app, not a funded business with money to throw at GDPR compliance! - **Keeping costs low.** Every call to an Anthropic server is money I could be spending on other things, like a designer or Amazon gift cards for product testers. In London's startup scene, I've watched many AI-heavy products eat their founders out of house and home on inference costs alone. And [Gartner expects total inference spend to keep rising](https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025) even as per-token prices fall, because agentic workloads consume tokens faster than prices drop. Anthropic itself [introduced new rate limits in 2025](https://www.anthropic.com/news/updating-rate-limits) after acknowledging that Claude Code usage was growing faster than expected. Today's prices are subsidized by VC, not unit economics, and when the subsidy ends, every cloud LLM call in your stack becomes a cost center you can’t control. So I went looking for a way to do most of this work locally. Most production AI features do one narrow task (classify, summarize, extract, translate), and that's a fraction of what an LLM is capable of. You're paying for the rest in latency, tokens, and dependency on a service you don't control. But small language models (SLMs) sit between 2 and 16 GB on disk, run on the user's device, don't go down when Wi-Fi does, and cost nothing per call. Foundation models are still best for long-context reasoning or open-ended creative work. But for summarization, extraction, classification, and most of the actual production AI surface, today's SLMs are more than enough. Which raises the question: if SLMs are this capable, why isn't every product using them? Because picking the right one and proving it's the right one has been a skill reserved for ML engineers until recently: evals. Evals are a skill every AI engineer worth their salt needs to learn, and this is how to do it. # Just enough inference with evals No matter their size, [different models are better and worse at different tasks, as we can see from any benchmark comparison](https://huggingface.co/spaces/lmarena-ai/arena-leaderboard). There’s no perfect model, only models of varying capability for your specific task. But most of us look to benchmarks or ask our friends, “What’s the best new model?” What we really should be asking is “which model is good enough to accomplish my task quickly, accurately, and cheaply?” We need to measure their respective capabilities so we can make an informed trade-off, such as opting for a slower model that delivers more accurate results, or vice versa. To measure a model’s capabilities, you’ll need evals. Evals are to models what tests are to code. Well, not quite. With code, we’re testing for specific outcomes. 2 + 2 = 4, always. With evals, we’re testing acceptable outcomes. The eval for “What’s the capital of France?” would accept “Paris,” “The capital of France is Paris,” “It’s Paris!” and possibly even geographic coordinates! This makes evals more appropriate for non-deterministic code. You're asking, “Across a representative set of inputs, does this model produce outputs that meet our bar often enough to ship?” # Finding a SAGE (Small And Good Enough) model In the [“prototype big, ship small” framework](https://web.dev/articles/ai-model-selection), you prototype any AI feature or product with an SOTA (state of the art) model to make sure what you’re trying to do is physically possible. It will also give you the results with the least effort. In four steps, you’ll be able to select the smallest model capable of performing within the larger model’s range of expected outcomes: 1. **Prove it’s possible.** Use the best model you can to prototype the outcome you are looking for (like Gemini for translating French comic scans since it’s multimodal). 2. **Set success criteria.** Collect a set of inputs and ideal outputs (the comic scripts in French and their correct translations in English, for instance). 3. **Test from small to large.** Compare the outputs of smaller models against your test criteria. Work your way up from the smallest model until you get “close enough” to your baseline LLM. (What counts as “close enough” depends on your use case.) 4. **Select the smallest model** that gives acceptable responses for your use case. This is your SAGE model: Small and Good Enough. Each step matters and skipping any of them is how you end up with a model that “kind of works” or falls apart in an edge case you didn’t consider. ## Step 1: Proving the feature with Claude I had already built two conversation summarization features calling Claude Sonnet, and I was satisfied with the results. These were my baseline, the measuring stick against which all other models needed to measure up to. Sonnet’s summarization was impeccable, but the cost was high: 28 summaries ≈ $0.44 USD. Manageable for testing, but untenable for scaling. This performance formed the baseline for my golden dataset. ## Step 2: Building the rubric and creating the golden dataset A “golden dataset” is a set of ideal outcomes to measure your model’s generated outputs against. Without one, you don’t have a measuring stick to compare different outputs against. You’ll just be going on vibes, which don’t seem problematic when you’re prototyping, but become troublesome when you can’t hand-test every affected surface later in the product cycle, after upgrading a model or changing a prompt. I curated my golden dataset from 14 real, public conversations and their Sonnet-generated summaries. Each input (a conversation thread) is paired with two outputs: summaries, one for a list view and another for recapping long chats in a thread. I chose [Arize Phoenix](https://phoenix.arize.com/) for my eval harness. It’s open-source, local-first, and OpenAI-compatible. It’s maintained by the core engineers at Arize, who I just so happen to work with as well! To kick things off, I made a baseline trace recording these metrics using Claude and the golden dataset. A **trace** is a log of everything that happened during one model call: the input prompt, the output, intermediate steps (if the model used tools or made sub-calls), timing, token counts, and any errors. It's a complete log of one execution that you can replay, inspect, and reason about after the fact. I chose the following metrics to weigh: - JSON validity (code): Does the output parse? - Reference structural validity (code): Do citations point to real messages? - Factual consistency (LLM-as-judge): Does the summary stay faithful to the thread? - Length compliance (code): Does it stay in the target word range? - p50 latency (code): typical case - p95 latency (code): worst case To decide whether an output is good or not, you’ll need an evaluator. There are three kinds of evaluators: - **Human:** the oldest kind—humans have been evaluating code outputs since the beginning of AI research! (Also, the most expensive evaluator.) - **Code-based:** Deterministic, fast, free, reproducible. You use these in unit testing all the time. Was the output formatted correctly? Was it the right type? Did foo === foo ?? The cheapest evaluator. - **LLM-as-judge:** Good for subjective qualities a regex can't capture (tone policing, faithfulness). You give a (usually larger) model the input, the output, and a rubric, and ask it to score. LLM-as-judge is slower and more expensive, so look for ways to measure “good enough” with code. Notice that most of these metrics can be validated with code alone. But for equivalence, I needed an LLM-as-judge to compare outputs to the baseline traces. To find the best model for the job, you’ll need to collect traces from experiments with other models and the golden dataset. ## Step 3: Testing all the models ![图像](https://pbs.twimg.com/media/HIxSRIQWIAAl_bJ?format=png&name=large) My first instinct was to ask the ML engineers I respect and admire if there were any smaller models they thought might be a good starting place. Almost all recommended Gemma 4, a more than capable small model that’s been getting a lot of praise. And if I didn’t have evals, I might have chosen Gemma 4 and saddled my users with a less-than-ideal experience. This is why it’s important to run experiments on a range of models. I chose Gemma 4 E4B-it with 4-bit quantization, weighing in at a hefty 5 GB on disk. This was the upper end of what I could expect a user to voluntarily download on a desktop. To round out the scale from smallest to largest and add vendor diversity, I chose the following models to compete: - **Qwen 2.5 1.5B** was already shipping in the app as a backup when Anthropic was offline. - **Qwen 3 1.7B** is in the same family, same footprint, no architecture change, but an upgrade over the incumbent. - **Llama 3.2 3B** is the most battle-tested model in [node-llama-cpp](https://node-llama-cpp.withcat.ai/), so it tells you what "fully baked, definitely works" looks like at this size class. ![图像](https://pbs.twimg.com/media/HIxSb-oXwAAqAqM?format=jpg&name=large) In Phoenix, I set up each model as an experiment to test its capabilities. This is called a “capability eval,” and you usually run these at the start of a project or when you’re otherwise determining which prompt or model to use for a feature. I ran the evals three times for every input and model combination to help iron out any outliers, so each model collected 84 evals (3\*28 summaries). Each experiment used the same golden dataset and the same evaluators. The only variable was the model. ![图像](https://pbs.twimg.com/media/HIxSrvaXIAA7Lfj?format=jpg&name=large) ## Step 4: Choose the SAGE (Small and Good Enough) One of the challenges with measuring models is that there are rarely clear winners. Often, you end up trading accuracy vs latency. This chart is called a Pareto scatter. Each dot is a model, plotted on two axes: accuracy and latency. The Pareto frontier is the curve traced by models that are both faster and more accurate than any other model. Anything below the frontier is irrelevant because there's a better option available. Anything on the frontier represents a real tradeoff. There's no "best" model on the frontier without first specifying what you're willing to trade, which is exactly what setting success criteria in Step 2 forces you to do. ![图像](https://pbs.twimg.com/media/HIxS9DNWsAAShrA?format=jpg&name=large) Looking at this chart, only Sonnet, Llama 3.2, and Gemma 4 are worth comparing. The two Qwens were soundly surpassed. Even though Qwen 2.5 was the fastest at p50 (the median or 50th percentile), it hallucinated references to nonexistent messages 27% of the time, vs. Llama's 11%. Speed was important, but a fast feature that doesn’t work correctly is just a fast bug. One way to mitigate this would be to run the inference several times and pick the accurate output, but that would eliminate the speed advantage, as comparison adds latency to the equation. Gemma 4 was the quality outlier (95% reference accuracy), but it was disqualified due to latency at 7+ seconds. It was worse than Sonnet by multiple seconds, which users are more than sensitive to. That left Llama 3.2b as the best “good enough” alternative to Claude Sonnet 4.6. **Without evals, comparing these models would have been impossible. I would likely have chosen Gemma 4 because of its popularity and reputation.** The lesson learned: Don’t trust. Evaluate. # Close the gap between SLMs and LLMs with prompt engineering Llama 3.2 was almost my SAGE model, but that 11% hallucination rate had to be snuffed out. This is where prompt engineering comes in. Remember when everyone thought we were going to be prompt engineers? Well, prompt engineering, like evals, is one of a set of skills you need to wrangle models. If [fine-tuning really is dead](https://www.youtube.com/watch?v=h1c_jmk97Ss), as per Anthropic’s Emmanuel Ameisen, prompt engineering has taken its place. Fine-tuning changes what the model knows by updating the model’s weights through retraining, creating a more specialized model. Prompt engineering changes what the model does with what it knows by changing only the inputs (data, prompts) you give the model. The techniques that work also depend on the model class. Reasoning models like GPT-5 and Claude, with extended thinking, now handle chain-of-thought work internally, largely retiring many of the in-context-learning tricks people used in 2022-2024. But on a 3B local model, those tricks still have an impact. The model needs help structuring the output that a reasoning model gives itself. # Revisit “what is good enough” At this point, you’ve narrowed your competition to two models, and you should have a sense of which metrics are deal breakers and which are nice-to-haves. For me, I learned that smaller models consistently failed to conform to word counts, so I accepted that I’d have to use truncation on the UI side for some outputs. You should also have an idea of what the bar is for metrics you’re still tracking: ![Table of four eval bars for shipping the model. JSON and reference structural validity: ≥99%, because unparseable outputs introduce bugs. Factual consistency: ≥95%, leaving 5% slack for genuine ambiguity rather than hallucination. p50 latency: ≤1500ms, the threshold for feeling instant on an M-series Mac. p95 latency: ≤3500ms, keeping the worst case under four seconds.](https://pbs.twimg.com/media/HIxU8VVWcAAT_Sl?format=jpg&name=large) 查看图片描述 Success metrics for what is "good enough" ## One variable per variant Rather than generating a bunch of different prompts and hoping for the best, come up with some theories about what might drive the outputs in the right direction. I needed to reduce the references to conversations that didn’t exist. I could do this by reformatting the input or showing the model “how it’s done” with examples. I could tell it what not to do. I could make it think long and hard before responding. Then I created four variants plus a control to run as experiments with Phoenix: ![Table of five prompt variants tested, each pulling a different lever. Baseline is the control, with minimal instruction, establishing the floor. Reformatted input changes format — same instructions, but the thread is rendered as natural-language numbered messages instead of a JSON array. Few-shot changes demonstration — same instructions, plus three worked input/output examples in the prompt. Explicit rules change constraint — same instructions, plus literal prohibitions like "no preamble," "count words before responding," "never invent messages." Chain of thought changes process — same instructions, restructured so the model identifies key moments before writing the summary.](https://pbs.twimg.com/media/HIxWJAdWsAAg9h5?format=jpg&name=large) 查看图片描述 This isolation allowed me to measure how each prompt impacted each “definition of good.” Phoenix’s compare view lets you compare the same dataset, same evaluators, with the prompt as the variable. ![Table of how each prompt variant performed against the baseline. The baseline (control) scored 77.4% length, 91.2% reference accuracy, 87.1% factual consistency, and 1055ms latency. Reformatted input barely moved quality (+1.2 length, −1.1 ref, +0.6 factual) and added 606ms of latency — worse. Few-shot improved every quality metric meaningfully: +10.0 length, +8.3 ref, +5.8 factual, with a +1673ms cost at p95. Explicit rules regressed across the board: −4.8 length, −6.6 ref, −3.4 factual, latency flat. Chain of thought improved length by 5.9 but regressed reference accuracy by 5.3 and factual consistency by 1.9, while adding 638ms of latency. Few-shot was the only variant that moved every quality metric in the right direction.](https://pbs.twimg.com/media/HIxX5CZXIAEx_eg?format=jpg&name=large) 查看图片描述 All but one of the prompts were noise or actively harmful. If you were going on pure vibes, you might try to “improve” your prompt by explicitly telling the model what not to do without realizing how much it was degrading the outputs. **Few-shot was the standout,** with quality improving across every metric. Llama3.2b might not be good at following instructions, but it’s pretty good at imitating examples. The new prompt got me closer, but there was still work to do to meet the bar. ![图像](https://pbs.twimg.com/media/HIxYp-yXQAAONyA?format=jpg&name=large) ## Code is cheaper than inference Claude Sonnet was capable of meeting my bar for everything but latency. **Llama 3.2B was 16-25% faster**, likely because of the time saved roundtripping to a remote server. However, even with the few-shot prompt, it still fell short on structural validity and length compliance. Since code is cheaper than inference, I looked for deterministic solutions to these problems. - I used CSS truncation to lop off any stray words at the end of a summary. No one will miss them in the context they’re in. - The few-shot approach did bloat input tokens, putting the p95 latency over budget, but I was able to claw that back using a KV cache. - I added a post-hoc validator to strip any \[ref:N\] outside the valid message range. It’s important to check a sampling of traces yourself. The 92.9% vs. Claude's near 100% was dismissed because human review confirmed the gap is an overly strict judge, not actual hallucination. The SLM phrased things differently, but not factually incorrectly. In this way, I was able to get the model to a place where it performed as well or better than Claude Sonnet across the board, shaving almost 2 seconds off the p50 latency and saving myself a monthly bill: ![Two-column comparison of Claude Sonnet (cloud, left) against Llama 3.2 3B with the V3 few-shot prompt plus post-hoc safety nets (local, right) — the shipped configuration. JSON validity: both 100%. Reference structural validity: both 100% — Llama achieves this via a post-hoc validator that strips any [ref:N] tokens outside the valid message range. Factual consistency: Llama 92.9%; Claude has no score because it's the LLM-as-judge and can't fairly score itself. Length compliance: both 100% — Llama achieves this via post-hoc word-count truncation that enforces the length spec deterministically. p50 latency: Claude 3046ms, Llama 1296ms — Llama is more than twice as fast. p95 latency: Claude 4750ms, Llama under 3500ms — achieved with KV cache reuse on the few-shot prefix; V3 alone measured 3998ms. The shipped local config matches or beats Claude on every metric, with code closing the gaps the model couldn't.](https://pbs.twimg.com/media/HIxZrsqXIAA1vIa?format=jpg&name=large) 查看图片描述 **The eval tells you where a model is capable. Use engineering to close the gap on what the model can’t do.** # Life after capability evals Now that the system was working, the next steps involved setting up mechanisms to get the model onto the user’s device, building features with progressive enhancement in mind (what happens while the model is MIA?), and setting up regression evals. These are what alert you when a new user input, a prompt edit, or a model change affects the model’s output. You can add them to your CI/CD to catch these shifts before they reach your customers. Capability evals are often run once, but regression evals live with your testing suites forever. (Let me know if you’d like to hear about that side of the story, too.) # It’s dangerous expensive out there. Take this with you. Every time you call a SOTA model in your stack, you should ask: does this really need a frontier model, or is it a vestige of Prototyping Big? Have you been using LLMs as placeholders for smaller models in your codebase? Can you tighten and streamline your inference? I challenge you to audit one feature in your app this week. Could it run on a local instead of a more expensive frontier model? [Set up Arize Phoenix](https://arize.com/docs/phoenix#quick-starts), then run some of your own prompts and models against lighter ones using [llama.cpp](https://llama-cpp.com/). The results might surprise you. ![图像](https://pbs.twimg.com/media/HIxdnGTXwAAbtAZ?format=png&name=large) Special thanks to the following humans [@evalstate](https://x.com/@evalstate) [@seldo](https://x.com/@seldo) Christoph Henkelmann [@andreban](https://x.com/@andreban) [@\_nancychauhan](https://x.com/@_nancychauhan) [@Cephalization](https://x.com/@Cephalization) for reviewing this article!

MMD 渲染配方 —— Unity 实现可行性分析

Fri, 22 May 2026 04:39:13 GMT

> 原文：https://www.bilibili.com/opus/1178669996791824392 > 作者分享了一套完整的 MMD 渲染管线，用于制作高质量静画/视频。 --- ## 一、使用的 MME 效果清单与技术分析 ### 🎨 色彩与材质 | MME 名称 | 功能 | Unity 实现 | |----------|------|------------| | **nbHSV** | 色相/饱和度/明度调节，分材质控制（hair/face/body） | ✅ 很轻松。用 Shader Graph 写 HSV 调整节点，或直接使用 Post Processing 的 Color Grading + Material Property 区分 | | **lv2shader** | 主渲染着色器，提供精细阴影、柔和阴影（第13行调阴影柔和度）、可分离唇部材质阴影 | ✅ Unity 中这就是 Toon Shader。用 Shader Graph 自定义卡通渲染，half-lambert + ramp texture 即可实现，阴影柔和度通过 ramp 图控制 | | **SPA 贴图** | 高光贴图（Specular），仅给前发上 | ✅ 标准 PBR/Toon shader 中直接使用 Specular/Smoothness 贴图，或自定义高光遮罩 | | **発光コンタクト** | 金属提亮效果 | ✅ Emission map + metallic/smoothness 控制 | | **m4layer 叠加** | 混合多种渲染器结果（如 ps 渲 + dg 渲），增强质感 | ✅ Unity 中可以用多个 Camera + RenderTexture 叠加，或 Shader 中多 Pass 混合 | ### ✨ 后处理效果 | MME 名称 | 功能 | Unity 实现 | |----------|------|------------| | **motionblur3L** | 运动模糊 | ✅ URP/HDRP 内置 Motion Blur，或 Post Processing Stack | | **lightbloom SJ** | 泛光（Bloom），支持 Mask（除眼睛高光外全遮罩） | ✅ 内置 Bloom，用 Custom Pass / Render Feature 做 Bloom Mask | | **g ALB** | 自发光（眼睛高光），可做"关灯"效果 | ✅ Emission material property，用 MaterialPropertyBlock 动态控制强度做开关灯 | | **SSAA 4xL2** | 超采样抗锯齿 | ✅ MSAA 或 SSAA（Render Scale > 1.0）| | **sdFxaa** | FXAA 抗锯齿 | ✅ 内置 FXAA | | **sdUnsharpMask** | USM 锐化 | ✅ Post Processing 中无内置，但可写自定义 Shader 或下载社区方案 | | **sdSSGI** | 屏幕空间全局光照，可调色相/饱和度/阈值，支持遮罩 | ✅ 这是最复杂的。可以用 SDFGI（HDRP）、SSGI 插件、或自己写 Screen Space GI Shader。URP 中可用 Renderer Feature + Compute Shader 近似实现 | | **PostRimLightToon** | 边缘光，可调粗细，替代头发高光 | ✅ Toon Shader 中加 Fresnel/Rim Light 节点即可，非常标准 | | **PostDropHair** | 头发在前额的投影（阴影），精度比 lv2 更高 | ✅ Unity 中可用 Projector 或 Decal 系统，或用第二个 Pass 渲染阴影。URP 中 Custom Render Feature 也适用 | | **ikboken** | 景深（DOF），可随镜头远近 K 帧 | ✅ 内置 Depth of Field（Bokeh 模式） | | **msGradation1/2** | 渐变叠加（增加对比 + 上方渐变光），模拟动画摄影打光 | ✅ UI 层叠加渐变图，或用 Post Processing 自定义 Color Lookup / Gradient Overlay | | **Cheaplens** | 暗角（Vignette） | ✅ 内置 Vignette | | **PostAlphaMask** | 眼睛透过头发（Alpha 排序/遮罩） | ✅ Shader 中控制 Render Queue + Alpha Clip / ZWrite，或用 Stencil Buffer 精确控制 | | **ikEdgeBlur** | 边缘模糊加深，增加质感 | ✅ 可用 Sobel 边缘检测 + 模糊 + 加深，自定义后处理 Shader | | **SwapScreen/SplitView** | 渲染前后对比 | ✅ Editor 工具，Unity 中可用 A/B 对比或 Custom Render Feature 显示 | ### ✏️ 轮廓线 | MME 名称 | 功能 | Unity 实现 | |----------|------|------------| | **krtoon2**（3个） | 轮廓线，分材质控制粗细、五官 noedge、后发 halfedge | ✅ Inverted Hull 方法（经典）、Sobel 后处理描边、或 Unity Toon Shader (UTS) | | **Dippen Natural + 控制器** | 另一套轮廓线系统，分材质调强度（五官 0.0，前发 0.25） | ✅ 同上。Unity 中不同材质用不同描边参数即可 | ### 💡 打光 | 光源 | 功能 | Unity 实现 | |------|------|------------| | **PostPointLight ×4** | 不同颜色/角度的点光源绑定头部：白色提亮、紫色轮廓光、暖色渐变光、发尾提亮。每个光源都有独立的头发/五官/皮肤遮罩 | ✅ Unity 内置 Point Light + 自定义 Light Layer/Culling Mask 精确控制每个光源照哪些物体。遮罩用 Light Cookies 或 Light Layers（URP/HDRP） | | 绑定头部 | 光源 Follow 头部骨骼 | ✅ 直接将 Light 设为头部的子物体，或脚本跟随 | --- ## 二、核心技术栈总结 ``` 📦 渲染管线层次： ├── 材质层 (Material) │ ├── 卡通主渲（lv2shader / Toon Shader） │ ├── HSV 调色（nbHSV） │ ├── 高光贴图（SPA） │ ├── 自发光（g ALB） │ ├── 轮廓线（krtoon2 + Dippen） │ ├── 边缘光（PostRimLightToon） │ ├── 头发投影（PostDropHair） │ └── Alpha 透过（PostAlphaMask） ├── 光照层 (Lighting) │ ├── 多点光源（PostPointLight ×4+） │ └── 屏幕空间 GI（sdSSGI） ├── 后处理层 (Post Processing) │ ├── Bloom（lightbloom SJ） │ ├── 运动模糊（motionblur3L） │ ├── 景深（ikboken） │ ├── 抗锯齿（SSAA + FXAA） │ ├── 锐化（sdUnsharpMask） │ ├── 渐变叠加（msGradation） │ ├── 暗角（Cheaplens） │ └── 边缘模糊（ikEdgeBlur） └── 调试层 └── 渲染对比（SwapScreen） ``` --- ## 三、Unity 中实现的可行性评估 | 方面 | 评估 | |------|------| | **总体可行性** | ✅ **完全可以实现**，所有效果 Unity 都支持 | | **推荐渲染管线** | **URP**（平衡效果与性能）或 **HDRP**（最佳画质）。Built-in 也行但需要更多手写 | | **推荐工具** | Shader Graph + VFX Graph + Post Processing + Custom Render Features | | **难度分布** | 90% 效果 = 简单（内置/社区方案）；sdSSGI = 中等（需手写或找插件）；分材质遮罩系统 = 需要仔细设计 Render Feature / Stencil | | **关键依赖** | Unity Toon Shader (UTS3)、Post Processing、Custom Renderer Features | ### 🎯 在 Unity 中实现这个配方的推荐路线 1. **基础渲染**：用 **UTS (Unity Toon Shader)** 或 **Shader Graph 自建 Toon Shader** 作为主渲，实现 ramp 阴影、轮廓线、边缘光 2. **材质控制**：通过 Material Property Block 或不同 Material 实例，实现分材质控制 HSV、描边粗细、自发光等 3. **光照**：多点光源 + Light Layers / Culling Mask 实现精确遮罩。光源跟随骨骼用脚本绑定 4. **后处理**：URP 内置 Volume 组件一键实现 Bloom、DOF、Motion Blur、Vignette、Color Grading 5. **高级效果**：sdSSGI 和 PostDropHair 用 Custom Render Feature 实现，或找 Asset Store 插件 6. **调试对比**：写一个简单的 Editor 脚本做 SplitView 即可 --- ## 四、结论这篇文章本质上是一套完整的 **三渲二（NPR/Toon）渲染管线**，涵盖了： - **Toon Shader**（卡通阴影 + ramp 控制） - **多套轮廓线系统**（krtoon2 + Dippen，分材质控制粗细和开关） - **完整的后处理链**（Bloom、DOF、Motion Blur、Vignette、Color Grading、锐化、SSGI） - **精细的多光源打光**（4 个以上点光源，各自有独立遮罩，跟随头部骨骼） - **材质分层控制**（头发、皮肤、五官各自独立的 HSV / 发光 / 描边参数） - **高级 NPR 技巧**（眼睛透过头发、头发投影、边缘模糊质感、渐变叠加）所有这些技术在 Unity 中都有成熟的实现方案。核心投入在 **Toon Shader 的自定义** 和 **遮罩系统的设计** 上，其余后处理效果 Unity 基本都有开箱即用的支持。原始配布链接汇总： - nbHSV：https://www.nicovideo.jp/watch/sm39274461 （作者：のべ男） - lv2shader：https://www.bilibili.com/video/BV1DqkSBbEF8/ （作者：winglayer） - motionblur3L：https://seiga.nicovideo.jp/seiga/im3061709 （作者：そぼろ） - lightbloom SJ：https://www.aplaybox.com/details/model/RKjvijIaYsVX （作者：三金络合物） - g ALB：https://bowlroll.net/file/182490 （作者：呉石） - krtoon2：https://kagitaba.fanbox.cc/posts/4962246 （作者：鍵束） - Dippen：https://fujisunflower.fanbox.cc/posts/1422906 （作者：サンフラワーふじ） - sdSSGI：https://www.nicovideo.jp/watch/sm43948524 （作者：サンドマン） - PostRimLightToon：https://bowlroll.net/file/146255 （作者：P.I.P） - PostDropHair：https://bowlroll.net/file/178761 （作者：P.I.P） - ikboken（三金改造版）：https://www.bilibili.com/opus/658890877048979479 - msGradation：https://bowlroll.net/file/210901 （作者：ましまし） - Cheaplens：https://bowlroll.net/file/337397 （作者：そぼろ） - PostAlphaMask：https://bowlroll.net/file/152266 （作者：P.I.P） - SwapScreen/SplitView：https://ux.getuploader.com/Elle_DataPort/download/19 （作者：データ P） - PostPointLight (tn)：https://bowlroll.net/file/188141 （作者：呉石） - PostPointLight：https://www.nicovideo.jp/watch/sm14626601 （作者：ビームマンP） - 発光コンタクト：https://bowlroll.net/file/169064 （作者：996）

个人向2d风格渲染分享

Fri, 22 May 2026 04:37:01 GMT

![](https://i1.hdslb.com/bfs/new_dyn/b797c4e5246122ea3ec427f102432053397838044.png@1416w_798h_1c.avif) 半斤悠哉糖编辑于 2026年03月21日 23:53[000500](https://www.bilibili.com/h5/mall/equity-link/collect-home?item_id=1728705424001&isdiy=0&part=card&from=post&f_source=garb&vmid=397838044&native.theme=1&navhide=1) - 阅读前先声明本文更适合有一定mmd基础的人观看效果预览： ![](https://i1.hdslb.com/bfs/new_dyn/59080c8ea3c30e36a2c520943c5dbfab397838044.png@1192w.avif) --- MME预览： ![](https://i1.hdslb.com/bfs/new_dyn/6c5bda0f02fac312c1132bd9e474aedc397838044.png@812w_1386h.avif) --- 模型栏预览： ![](https://i1.hdslb.com/bfs/new_dyn/0f037de17d3010f5fd4a60fd24de0a86397838044.png@1076w_856h.avif) --- MME部分： - 导入模型（感谢knsy大人）观察到模型整体偏黑载入nbHSV （色调H饱和度S亮度V调节）配布： https://www.nicovideo.jp/watch/sm39274461 作者：のべ男在附件栏 HSV1中给模型整体上hsv off.fx(场景等同样操作)给头发脸部身体材质上hsv on.fx ![](https://i1.hdslb.com/bfs/new_dyn/8ba2b14e895fba26c179584da50d8583397838044.png@1192w.avif) 控制器参数调节如图下（仅改变明度和彩度，正常模型可忽略这步） ![](https://i1.hdslb.com/bfs/new_dyn/7d2407370a7cde3bf423e9071d55a85c397838044.png@1192w.avif) - 载入主渲lv2shader 配布： [【MME配布】Lv2 shader使用说明](https://www.bilibili.com/video/BV1DqkSBbEF8/?spm_id_from=333.337.search-card.all.click&vd_source=84e0f40ab0f2722176e0260243977414&spm_id_from=333.1369.0.0) 作者：winglayer 脸部之类推荐用阴影范围更少的fx 如果有些转头时候嘴巴阴影很奇怪可以单独分离唇部材质上没有影范围的fx（缺点是要避免用手在脸上会产生投影的动作） ![](https://i1.hdslb.com/bfs/new_dyn/00e0708cebade2fbec6b7955ef533c67397838044.png@1192w.avif) fx第13行阴影柔和度改为5（具体看个人喜好，我更喜欢软阴影的质感）前后发不同材质是想让靠近脸部区域的头发颜色增加饱和度金属部分可以上一些提亮的fx 例如発光コンタクト配布：https://bowlroll.net/file/169064 作者：996（加上金属spa效果更好） ![](https://i1.hdslb.com/bfs/new_dyn/75e2441fd583f9ed80020daf4adba071397838044.jpg@1192w.avif) - 载入motionblur3L（动作模糊）里的白背景默认参数载入即可配布：https://seiga.nicovideo.jp/seiga/im3061709 作者：そぼろ - 载入lightbloom SJ（泛光）配布：https://www.aplaybox.com/details/model/RKjvijIaYsVX 作者：三金络合物在附件栏lightbloom Mask中给模型上mask off材质，然后再给眼睛高光上mask on材质 ![](https://i1.hdslb.com/bfs/new_dyn/d2232979a93e879833cd80aa7983fa7c397838044.png@1192w.avif) 控制器参数只调节第一个即可 ![](https://i1.hdslb.com/bfs/new_dyn/5ff7a70f27067984a106e6e375fd268f397838044.png@1054w_672h.avif) - 载入g ALB （自发光）配布：https://bowlroll.net/file/182490 作者：呉石 - 在附件栏ALB EmitterRT解除眼睛高光的fx强制发光 ![](https://i1.hdslb.com/bfs/new_dyn/84a62c39e287334200d053b2f32a3f50397838044.png@1192w.avif) 控制器参数如下颜色倾向看模型瞳孔颜色 ![](https://i1.hdslb.com/bfs/new_dyn/11630a98bb2eda9357ad2c306b2ba8f4397838044.jpg@1028w_1238h.avif) - 载入krtoon2的轮廓线mme（三个都要）配布：https://kagitaba.fanbox.cc/posts/4962246 作者：鍵束 ![](https://i1.hdslb.com/bfs/new_dyn/f4ac5bf94329273e3e43280ec3448987397838044.png@920w_928h.avif) 在附件栏NLRateIndRT内给五官的材质上noedge（不包括脸的材质）后发上halfedge（大部分模型不用特意区分前后发，因为白银建模比较精细后发上默认fx并没有那么美观） ![](https://i1.hdslb.com/bfs/new_dyn/df2dec28c35c6ae71661de70cd1e249a397838044.png@1192w.avif) 轮廓线tr值无参考价值为了预览更好看选择了比较小的参数实际导出会比预览更细需要根据喜好自行导出不同参数的图片来更改 ![](https://i1.hdslb.com/bfs/new_dyn/8add17a5bc6789447a56ecd5e49fbe03397838044.png@750w_580h.avif) - 载入Dippen（轮廓线）内的Natural及控制器DippenController.pmx 载入预设Basic.vmd 配布：https://fujisunflower.fanbox.cc/posts/1422906 作者：サンフラワーふじ附件栏DSWorld给五官等材质上0.0的fx 为了mmd视窗内预览更好看前发材质实际强度为0.25（其实不太重要但是为了严谨还是强调一下控制器参数也可以实战中自己调我基本不会大改跟默认没太大区别想要粗轮廓线时候我更会去pe设置或者动mmd自带的轮廓线粗细） ![](https://i1.hdslb.com/bfs/new_dyn/685bbcb7a4b7182db5002353996fde93397838044.png@1192w.avif) - 载入SSAA 4xL2 （抗锯齿）（可以忽略至今没太看出来这个区别） - 载入sdSSGI （屏幕空间全局光照）配布：https://www.nicovideo.jp/watch/sm43948524 作者：サンドマン默认色相为红色更贴近肤色只动了彩度跟阈值衣服黑色部分上了遮罩 ![](https://i1.hdslb.com/bfs/new_dyn/aa0d256fa4e296dfaa3dc65bb7b0ccd9397838044.jpg@974w_1138h.avif) 也可以尝试控制右上一些参数例如调节增加一下明度跟饱和度 ![](https://i1.hdslb.com/bfs/new_dyn/78a044a8c1a4f254a776149da6299ea6397838044.jpg@1192w.avif) - 载入sdFxaa跟sdUnsharpMask（抗锯齿/锐化）默认参数即可 - 载入PostRimLightToon（边缘光）配布：https://bowlroll.net/file/146255 作者：P.I.P - 附件栏给脸部及五官材质上遮罩 ![](https://i1.hdslb.com/bfs/new_dyn/2df10ff70eb20cf6cbf579aa5b6e0049397838044.png@1192w.avif) 控制器参数只调节了粗细这个mme是为了增加细节突出模型以及替代后发高光spa 后发用spa贴图高光范围太大了所以实际只给前发上了spa贴图（具体用不用看模型） ![](https://i1.hdslb.com/bfs/new_dyn/a792911743f7d4cbfaa077fd4c7f3c1a397838044.png@980w_542h.avif) - 载入PostDropHair（发影）配布：https://bowlroll.net/file/178761 作者：P.I.P 在附件栏DrophairRT给前发上前发影附件栏DrophairMaskRT给前发以及脸部不想被影响的部分上遮罩InvisibleMask（白背景.x也要载入遮罩） ![](https://i1.hdslb.com/bfs/new_dyn/e87f832e13f0d83222e9c26734b8d5be397838044.png@1192w.avif) ![](https://i1.hdslb.com/bfs/new_dyn/f26d285f5d312e20c0833f7406c42d10397838044.png@1192w.avif) 虽然lv2有很精细的投影但是drophair的精度更高（这一步需要确认前面脸部材质是否选择更小的范围） mme的影倾向色选择上我会更偏爱偏灰的色（按照肤色调偏橘or粉）观察到很多新手会倾向高饱和橘色阴影不考虑模型实际肤色or环境色感觉太奇怪了（如图下）（除一些特殊风格） ![](https://i1.hdslb.com/bfs/new_dyn/35637d4b0bb35060d8790f82eefd9bdf397838044.png@1192w.avif) - 载入ikboken（景深）用的是三金老师改造版配布： [三金络合物的动态](https://www.bilibili.com/opus/658890877048979479?spm_id_from=333.1387.0.0&spm_id_from=333.1369.0.0) - 参数看个人喜好所以不推荐丢个大概参考（我会根据镜头远近k景深强度） ![](https://i1.hdslb.com/bfs/new_dyn/d0c77a4980b97d6b8e28d014fe997d11397838044.jpg@1192w.avif) - 载入msGradation1（渐变）跟控制器调节控制器参数在下方增加对比配布：https://bowlroll.net/file/210901 作者：ましまし ![](https://i1.hdslb.com/bfs/new_dyn/29fc6fa5ce4740ba785eddcae90ce7f8397838044.png@982w_550h.avif) - 载入msGradation2跟控制器调节控制器参数在上方打渐变光模拟动画摄影打光（不过不同镜头还是要k） ![](https://i1.hdslb.com/bfs/new_dyn/c07233ac74f0d1b6332730665306b179397838044.png@942w_588h.avif) - 载入Cheaplens（暗角）配布：https://bowlroll.net/file/337397 作者：そぼろ引导视觉中心纯白背景所以调的很淡 ![](https://i1.hdslb.com/bfs/new_dyn/da49d832787bfc0724d248a78ec59d05397838044.png@710w_560h.avif) - 载入PostAlphaMask（眼睛透过）配布：https://bowlroll.net/file/152266 作者：P.I.P 附件栏AlphaMask中把需要透过的材质取消fx 遮挡物（如前发）取消勾选 ![](https://i1.hdslb.com/bfs/new_dyn/875154c4027281fc35dc1c5a9daf0e98397838044.png@1192w.avif) - 载入SwapScreen跟SplitView（渲染对比）配布：https://ux.getuploader.com/Elle\_DataPort/download/19 作者：データ P 需要放在mme列表最后位可以直观对比渲染前后区别（有不同显示模式） --- 打光部分： - 左上角tn PostPointLight绑定头部打白色渐变光提亮（很细微可以忽略或用其他mme提亮）配布：https://bowlroll.net/file/188141 作者：呉石 ![](https://i1.hdslb.com/bfs/new_dyn/b39b465b58dc78673b701cfe16f82e68397838044.png@1192w.avif) - 右上角靠后tn PostPointLight绑定头部修改阴影面积打紫色轮廓光（不知道怎么增加细节了所以渲个类似挑染的感觉） ![](https://i1.hdslb.com/bfs/new_dyn/6541aa31c8fad8fc781ee180532c450a397838044.png@1192w.avif) - 右上角PostPointLight绑定头部打暖色渐变光增加细节配布：https://www.nicovideo.jp/watch/sm14626601 作者：ビームマンP ![](https://i1.hdslb.com/bfs/new_dyn/8f7c7c88310ae3e2ed0d1ee9c6916566397838044.png@1192w.avif) ![](https://i1.hdslb.com/bfs/new_dyn/8dd953d6b2ef06d605bd0f6ebba1591c397838044.png@1192w.avif) - 右下角PostPointLight绑定头部打在发尾处提亮 ![](https://i1.hdslb.com/bfs/new_dyn/5bd6efb0d180a34acce86d793a871a42397838044.png@1192w.avif) ![](https://i1.hdslb.com/bfs/new_dyn/9fda107695d28edb2fc89882d7ae8533397838044.png@1192w.avif) PS：除了第一个光源其他都要设置头发五官还有皮肤的不同遮罩打光的目的是为了增加细节模拟绘画上渐变色的质感 --- 一些我没做完的or可能会用的： - 更多更细分的打光..... - 用m4layer叠加ps渲dg渲之类支持法线的fx增强质感（前发上载入了一个纹理材质来测试）因为叠加m4的缘故整体阴影变浅可以通过修改fx或者添加lv2自带的pmx表情加深 ![](https://i1.hdslb.com/bfs/new_dyn/86224dd9b2ca86a43a47913254f824a1397838044.png@1192w.avif) - 蓝色高光比白色高光更适配粉发可惜不适合渲染很容易过曝（静画选手可以直接画） ![](https://i1.hdslb.com/bfs/new_dyn/d1bc41f25c922a391a906b6f3a2e415a397838044.png@1192w.avif) - ikEdgeBlur 增加一些质感在纠结用or不用多or少放大看可以看到轮廓边缘有一层模糊并加深了颜色（会导致整体颜色变深但是质感up） ![](https://i1.hdslb.com/bfs/new_dyn/7cef948fb91cea4ed85dc130a0fa0365397838044.jpg@1192w_1192h.avif) ![](https://i1.hdslb.com/bfs/new_dyn/186c895f60ef2755341b30376ce35d41397838044.jpg@1192w_1192h.avif) - 不同的肤色材质身体上了偏灰的材质脖子阴影处刚刚好但是感觉手臂部分阴影太灰了用pe分离然后上不同影倾向色（其实是贴图的原因后面用mme强制提亮了） ![](https://i1.hdslb.com/bfs/new_dyn/a0defcfdf4143e774e85b065e82677f8397838044.jpg@1192w.avif) - 校色渲染完观察感觉整体颜色饱和度还是偏高并且色调偏粉了（当然也可能是手机跟电脑色差问题）手机上看感觉脸跟头发颜色跟我预想的中低饱和有点区别需要大改一遍mme参数好麻烦（？）暂时做到这部分了 - 差分用g alb做一些关灯效果用于卡点or转场之类 ![](https://i1.hdslb.com/bfs/new_dyn/809776bc310123469c2b72054942f5e8397838044.png@1192w.avif) --- 感谢阅读到这里！虽然是分享但是我没有提供特别详细的参数更希望大家根据实际摸索而不是一味套配方这样更能有自己的理解跟风格实际渲染我也是做一点调一点每个mme都调整不同参数看如何搭配更合适我也一直在摸索属于自己的风格目前每个视频都是不同的主渲跟mme以后可能还会分享新的配方谢谢喜欢！如果有疑问or错误欢迎在评论区留言保存

先给结论

Wed, 20 May 2026 17:50:42 GMT

你说得对，前一版建议更多是基于你给的两篇文章做工程推理。重新查 Unity 官方文档、Unity 官方发布页和 Gitea 官方文档后，我会把建议调整成下面这版。重点变化是：**Unity 6.2+ 的验证脚本、Gitea Actions、本地 Gitea LFS、分支保护、Build Profile、测试命令细节，都要更实际一些。** --- ## 先给结论针对你的环境：**Unity 6.2+ + 本地自建 Gitea**，我建议优先做这 8 件事： 1. **Unity 版本不要写“6.2+”，要锁定到精确 `6000.2.xf1` 或迁到当前 LTS。** 2. **Gitea 必须启用 Git LFS，并在仓库强制 LFS 规则。** 3. **`.meta` 必须提交，Asset Serialization 保持 Force Text。** 4. **UnityYAMLMerge 要在每个开发机和 Gitea runner 上配置，不是只写 `.gitattributes`。** 5. **把验证脚本拆成 `validate-fast.sh` 和 `validate-full.sh`。** 6. **修正 Unity 测试命令：`-runTests` 不要搭配普通 `-quit`。** 7. **Gitea Actions runner 建议单独机器跑，不要和 Gitea 服务抢资源。** 8. **Gitea 分支保护开启：禁止直接推主分支、要求 PR、要求状态检查、要求审批。** Unity 官方现在的 Unity 6 发布策略里，LTS 推荐用于即将锁版本或线上运营项目，Update release 推荐用于新项目或中期项目；官方页面也显示 Unity 6.3 LTS 是当前 LTS，并说明 Update release 是 production-ready，但支持到下个 release 为止。所以如果你项目已经生产化，最好不要长期停留在模糊的“6.2+”，而是**锁死具体 patch 版本**，或者单独评估迁到 Unity 6.3 LTS。([unity.com](https://unity.com/releases/lts-vs-tech-stream)) --- # 1. Unity 6.2+ 下，原文里的验证脚本需要改你原文里的 `validate-unity.sh` 思路是对的，但有一个关键细节要修正： > **跑 Unity Test Framework 的 `-runTests` 时，不要加普通 `-quit`。** Unity 6.2 Test Framework 文档明确给出的命令是 `Unity.exe -runTests -batchmode -projectPath ... -testResults ... -testPlatform ...`，并且命令行参考里写了：Editor 的普通 `-quit` 在测试运行时不受支持。([docs.unity.cn](https://docs.unity.cn/6000.2/Documentation/Manual/test-framework/run-tests-from-command-line.html)) 所以我建议这样拆： ```bash tools/ validate-fast.sh # EditMode tests + project validation validate-full.sh # PlayMode tests + build/profile validation unity-env.sh # 统一找 Unity 路径 ``` ### `tools/unity-env.sh` ```bash #!/usr/bin/env bash set -euo pipefail : "${UNITY_PATH:=}" if [ -z "$UNITY_PATH" ]; then if command -v Unity >/dev/null 2>&1; then UNITY_PATH="Unity" else echo "UNITY_PATH is not set and Unity was not found in PATH." >&2 echo "Example:" >&2 echo " export UNITY_PATH=/home/build/Unity/Hub/Editor/6000.2.xf1/Editor/Unity" >&2 exit 1 fi fi export UNITY_PATH export PROJECT_PATH="${PROJECT_PATH:-$(pwd)}" export LOG_DIR="${LOG_DIR:-$PROJECT_PATH/Logs}" mkdir -p "$LOG_DIR" ``` ### `tools/validate-fast.sh` ```bash #!/usr/bin/env bash set -euo pipefail SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" source "$SCRIPT_DIR/unity-env.sh" echo "Unity version:" "$UNITY_PATH" -version || true echo "Running EditMode tests..." "$UNITY_PATH" \ -batchmode \ -projectPath "$PROJECT_PATH" \ -runTests \ -testPlatform EditMode \ -testResults "$LOG_DIR/editmode-results.xml" \ -logFile "$LOG_DIR/editmode.log" echo "Running project validation..." "$UNITY_PATH" \ -quit \ -batchmode \ -projectPath "$PROJECT_PATH" \ -executeMethod Project.Editor.Validation.ProjectValidationCommand.RunAll \ -logFile "$LOG_DIR/project-validation.log" ``` ### `tools/validate-full.sh` ```bash #!/usr/bin/env bash set -euo pipefail SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" source "$SCRIPT_DIR/unity-env.sh" echo "Running PlayMode tests..." "$UNITY_PATH" \ -batchmode \ -projectPath "$PROJECT_PATH" \ -runTests \ -testPlatform PlayMode \ -testResults "$LOG_DIR/playmode-results.xml" \ -logFile "$LOG_DIR/playmode.log" echo "Running build validation..." "$UNITY_PATH" \ -quit \ -batchmode \ -projectPath "$PROJECT_PATH" \ -executeMethod Project.Editor.Build.BuildValidationCommand.RunAll \ -logFile "$LOG_DIR/build-validation.log" ``` `-executeMethod` 适合 CI、构建、数据准备、项目验证这类任务；Unity 文档也说明该方法必须是 static，脚本要放在 Editor 代码路径里，失败时可以抛异常或 `EditorApplication.Exit` 返回非零退出码。([docs.unity.cn](https://docs.unity.cn/6000.2/Documentation/Manual/EditorCommandLineArguments.html)) --- # 2. Unity 6.2+ 要把 Build Settings 建议升级成 Build Profiles 建议你的 Unity 版本是 6.2+，所以我建议文档里不要只写传统 Build Settings。Unity 6.2 的命令行参数已经有 `-activeBuildProfile ` 和 `-build `，可以直接从命令行指定 Build Profile 后构建。([docs.unity.cn](https://docs.unity.cn/6000.2/Documentation/Manual/EditorCommandLineArguments.html)) 因此 `docs/build-and-release.md` 建议写成： ```markdown ## Build Profiles Build profiles live under: Assets/_Project/BuildProfiles/ Known profiles: - WindowsDev.asset - WindowsRelease.asset - AndroidDev.asset - AndroidRelease.asset Agents must not create or modify Build Profiles unless the task explicitly requires build configuration changes. ## CI Build Example Unity \ -quit \ -batchmode \ -projectPath . \ -activeBuildProfile "Assets/_Project/BuildProfiles/WindowsRelease.asset" \ -build "Builds/Windows/Game.exe" \ -logFile Logs/build-windows.log ``` 如果你的项目已经有自定义 `BuildPipeline.BuildPlayer` 脚本，也可以继续保留。但在 Unity 6.2+ 文档里，我会明确补一句： > Build Profile 是一等配置资产，修改它等同于修改构建策略，属于高风险变更。 --- # 3. Gitea 本地部署下，Git LFS 是 P0，不是可选项 Unity 项目有大量 `.psd`、`.fbx`、`.png`、`.wav`、`.mp4`、`.blend`、贴图、音频、模型等二进制资源。用本地 Gitea 时，要先确认 Gitea 服务器启用了内置 LFS。Gitea 官方文档要求在 `app.ini` 中启用 `LFS_START_SERVER = true`，并设置 `[lfs] PATH`；文档也注明 LFS server 支持需要服务器安装至少 Git v2.1.2。([docs.gitea.com](https://docs.gitea.com/next/administration/git-lfs-setup)) Gitea `app.ini` 建议： ```ini [server] LFS_START_SERVER = true [lfs] PATH = /var/lib/gitea/data/lfs ``` 仓库里建议执行： ```bash git lfs install git lfs track "*.psd" git lfs track "*.fbx" git lfs track "*.blend" git lfs track "*.png" git lfs track "*.tga" git lfs track "*.wav" git lfs track "*.mp3" git lfs track "*.mp4" git lfs track "*.mov" git lfs track "*.zip" git lfs track "*.unitypackage" ``` 并提交 `.gitattributes`。如果你们 Gitea 走 SSH，而且想用 LFS pure SSH，需要额外谨慎。Gitea 官方文档说 pure SSH LFS 支持可以开启，但默认关闭，并提到 git-lfs 客户端存在会导致 SSH transfers hang 的 open bug， workaround 是客户端设置 `git config --global lfs.ssh.automultiplex false`。除非你非常确定需要纯 SSH LFS，否则建议先用 HTTPS LFS 或常规配置。([docs.gitea.com](https://docs.gitea.com/next/administration/git-lfs-setup)) --- # 4. `.meta`、Force Text、UnityYAMLMerge：建议保留，但要更精确 Unity 官方文档确认：Visible Meta Files 是给 Unity 不直接支持的版本控制系统使用的模式，可以用你选择的 VCS 管理源资产和 metadata。([docs.unity.cn](https://docs.unity.cn/6000.0/Documentation/Manual/class-VersionControlSettings.html)) Unity Editor 设置里，Asset Serialization 的 `Force Text` 会把所有资产转成 Text 模式，包括新资产；文档也说明 Unity 可以用文本格式保存 Scene 来帮助版本控制合并。([docs.unity.cn](https://docs.unity.cn/Manual/class-EditorManager.html)) 因此这几条要保留在 `AGENTS.md`： ```markdown ## Unity Version Control Rules - Version Control Mode must be Visible Meta Files. - Asset Serialization Mode must be Force Text. - `.meta` files must be committed. - Do not manually edit `.meta` files. - Move and rename assets inside Unity Editor whenever possible. - Do not move or rename `.unity`, `.prefab`, `.asset`, `.meta`, `.asmdef` unless the task explicitly requires it. ``` Unity 官方的项目组织建议也强调 `.meta` 记录导入设置，应该提交；并提醒移动资产时最好在 Editor 内移动，这样 `.meta` 会随资产一起移动。官方还提到 Git 默认忽略空目录，空目录的 `.meta` 可能导致协作问题，所以空目录要么避免，要么放 `.keep`。([unity.com](https://unity.com/how-to/organizing-your-project?ampDeviceId=bd73ee3b-a36f-4585-8d3e-967cba74f006&SessionId=1771632000345&Timestamp=1771718400359)) UnityYAMLMerge 这块也要保留，但要补一句： > `.gitattributes` 只是声明某些文件用 `unityyamlmerge`，每个开发机和 runner 还要配置 Git merge driver。 Unity 官方 Smart Merge 文档说明 UnityYAMLMerge 随 Unity Editor 提供，可用于语义正确地合并 scene 和 prefab，并给出了 Git mergetool 配置方式。([docs.unity.cn](https://docs.unity.cn/2021.1/Documentation/Manual/SmartMerge.html)) 建议 `.gitattributes`： ```gitattributes *.cs text eol=lf *.asmdef text eol=lf *.asmref text eol=lf *.unity text eol=lf merge=unityyamlmerge *.prefab text eol=lf merge=unityyamlmerge *.asset text eol=lf merge=unityyamlmerge *.meta text eol=lf merge=unityyamlmerge *.controller text eol=lf merge=unityyamlmerge *.anim text eol=lf merge=unityyamlmerge *.mat text eol=lf merge=unityyamlmerge *.psd filter=lfs diff=lfs merge=lfs -text *.fbx filter=lfs diff=lfs merge=lfs -text *.blend filter=lfs diff=lfs merge=lfs -text *.png filter=lfs diff=lfs merge=lfs -text *.tga filter=lfs diff=lfs merge=lfs -text *.wav filter=lfs diff=lfs merge=lfs -text *.mp3 filter=lfs diff=lfs merge=lfs -text *.mp4 filter=lfs diff=lfs merge=lfs -text *.mov filter=lfs diff=lfs merge=lfs -text ``` --- # 5. Gitea Actions：建议用专门 Unity runner，不要和 Gitea 服务混跑 Gitea Actions 需要 act runner 执行任务，Gitea 官方 quick start 明确建议 runner 跑在和 Gitea instance 分开的机器上，避免消耗过多资源影响 Gitea 本体。Unity CI 会吃 CPU、内存、磁盘 IO、Library cache 和许可证，所以这条对 Unity 项目尤其重要。([docs.gitea.com](https://docs.gitea.com/1.24/usage/actions/quickstart)) Gitea runner 注册时，不要用 `localhost` 或 `127.0.0.1` 作为 instance 地址。Gitea 文档特别提醒，runner 和 job containers 会连接这个地址，使用 loopback 地址通常是坏主意，LAN 地址通常才是正确选择。([docs.gitea.com](https://docs.gitea.com/1.24/usage/actions/quickstart)) 建议 runner 标签按 Unity 版本区分，例如： ```bash ./act_runner register \ --no-interactive \ --instance "http://gitea.lan:3000" \ --token "$GITEA_RUNNER_TOKEN" \ --name "unity-6000-2-runner-01" \ --labels "unity-6000-2,self-hosted" ``` Gitea act runner 可以 Docker 容器模式或 host 模式运行；官方推荐 Docker 更安全、更好管理，但也说明可以直接在 host 跑。Unity Editor + license + platform modules 往往更适合固定 host runner。注意：如果把 runner 用户加入 docker group，Gitea 文档提醒这实际上给了该用户 root 级别系统权限，所以 runner 机器应该隔离，不要和代码仓库主服务混在一起。([docs.gitea.com](https://docs.gitea.com/usage/actions/act-runner)) --- # 6. Gitea 工作流文件建议放 `.gitea/workflows/` Gitea 文档说明 workflow YAML 放在仓库的 `.gitea/workflows/` 目录，例如 `.gitea/workflows/demo.yaml`；Gitea Actions 设计上尽可能兼容 GitHub Actions，但官方也提醒某些 actions 在 SHA256 仓库或 Gitea 运行在 subpath 时可能不正常，包括 `actions/checkout`。([docs.gitea.com](https://docs.gitea.com/1.24/usage/actions/quickstart)) 建议最小 CI： ```yaml # .gitea/workflows/unity-validate.yaml name: Unity Validate on: pull_request: push: branches: - main - develop jobs: unity-fast: runs-on: unity-6000-2 steps: - name: Checkout uses: actions/checkout@v4 with: lfs: true - name: Ensure LFS files are present run: | git lfs install git lfs pull - name: Validate Unity version pin run: | cat ProjectSettings/ProjectVersion.txt "$UNITY_PATH" -version - name: Run fast validation run: | chmod +x tools/*.sh ./tools/validate-fast.sh unity-full: runs-on: unity-6000-2 needs: unity-fast if: gitea.event_name == 'pull_request' steps: - name: Checkout uses: actions/checkout@v4 with: lfs: true - name: Ensure LFS files are present run: | git lfs install git lfs pull - name: Run full validation run: | chmod +x tools/*.sh ./tools/validate-full.sh ``` 如果你的 Gitea 是纯内网、不能访问 GitHub，那么 `actions/checkout@v4` 需要镜像到本地，或者换成你们内部可访问的 checkout action。Gitea 官方示例使用 `actions/checkout@v4`，但也明确说“尽可能兼容”，不是所有场景都保证无差异。([docs.gitea.com](https://docs.gitea.com/1.24/usage/actions/quickstart)) --- # 7. Gitea 分支保护要作为 AI Agent 的硬护栏这点我会比前一版说得更强：如果用 Gitea，本地 pre-commit 不够，必须用 Gitea 的 protected branches。 Gitea 官方文档说明 protected branches 可以对 push、merge、review、status check 做策略限制；还可以禁用直接 push、禁用 force push、要求审批、dismiss stale approvals、阻止 rejected review 的 PR 合并、要求 PR 分支保持最新、要求状态检查成功后才能合并。([docs.gitea.com](https://docs.gitea.com/1.26/usage/access-control/protected-branches)) 建议 `main` / `release/*`： ```plaintext Protected branch: main Push: - Disable push Force push: - Disable force push Pull request: - Required approvals: 1 or 2 - Dismiss stale approvals: enabled - Block merge on rejected reviews: enabled - Block merge if pull request is outdated: enabled - Administrators must follow branch protection rules: enabled Status checks: - Enable status check - Required patterns: - actions/unity-fast - actions/unity-full ``` Gitea 支持 status check patterns，要求匹配的 CI context 在 PR head commit 上成功后才能合并。这个适合把 `validate-fast`、`validate-full` 变成硬门槛。([docs.gitea.com](https://docs.gitea.com/1.26/usage/access-control/protected-branches)) 注意：Gitea 的 protected file patterns 是“阻止修改敏感文件”，不是“要求额外审核”。所以不要轻易把 `ProjectSettings/**`、`Packages/**` 全部加成 protected file pattern，否则以后合法改构建设置也合不了。更好的做法是：这些文件改动时由 PR 模板和 reviewer 检查，同时用 CI 状态检查拦截。Gitea 文档说明 protected file patterns 会阻止触碰这些文件的 commit 或 merge。([docs.gitea.com](https://docs.gitea.com/1.26/usage/access-control/protected-branches)) --- # 8. asmdef 建议是正确的，但要补两条官方依据 Unity 官方 asmdef 文档确认：`.asmdef` 会把所在文件夹及子文件夹脚本编译进独立 assembly，除非子文件夹有自己的 asmdef 或 asmref；不同 assembly 之间依赖必须显式声明引用；循环 assembly 引用不允许。([docs.unity.cn](https://docs.unity.cn/Manual/ScriptCompilationAssemblyDefinitionFiles.html)) 所以原文里“用 asmdef 控制爆炸半径”的建议是成立的。但我建议补两条： ### 第一，Editor assembly 不一定非要所有脚本放顶层 `Editor/` Unity 官方说明，Editor assembly 允许把 Editor 脚本放在项目任意位置，不只限于顶层 `Editor` 文件夹；做法是创建平台特定 assembly，并只包含 Editor 平台。([docs.unity.cn](https://docs.unity.cn/6000.0/Documentation/Manual/assembly-definitions-creating.html)) 建议： ```plaintext Assets/_Project/Scripts/Runtime/Core/Project.Core.asmdef Assets/_Project/Scripts/Runtime/Gameplay/Project.Gameplay.asmdef Assets/_Project/Scripts/Runtime/UI/Project.UI.asmdef Assets/_Project/Scripts/Editor/Project.Editor.asmdef ``` `Project.Editor.asmdef`： ```json { "name": "Project.Editor", "rootNamespace": "Project.Editor", "includePlatforms": ["Editor"], "references": [ "Project.Core", "Project.Gameplay", "Project.UI" ] } ``` ### 第二，测试 assembly 要按 Unity Test Framework 的规则配置 Unity 官方说明 test assembly 可以把测试代码和 shipped application code 分开；并说明 Unity 会通过 `nunit.framework.dll`、`UnityEngine.TestRunner`、`UnityEditor.TestRunner` 等引用识别测试 assembly。([docs.unity.cn](https://docs.unity.cn/6000.0/Documentation/Manual/assembly-definitions-creating.html)) 所以建议文档里明确： ```plaintext Project.Tests.EditMode.asmdef Project.Tests.PlayMode.asmdef ``` 并在 `docs/testing.md` 写清楚每个 test asmdef 引用哪些 runtime assembly。 --- # 9. 序列化迁移协议要加强，这是官方文档支持的高风险点 Unity 官方序列化规则说明，Unity serializer 直接处理字段，不是属性；字段通常需要是 public 或 `[SerializeField]`，且不能是 static、const、readonly，还必须是 Unity 支持的类型。([docs.unity.cn](https://docs.unity.cn/6000.1/Documentation/Manual/script-serialization-rules.html)) 所以 AI Agent 的 Unity 规则里应该明确： ```markdown ## Serialization Rules - Prefer `[SerializeField] private` fields for Inspector wiring. - Do not rename serialized fields casually. - Do not convert serialized fields into properties unless migration is intentional. - Do not make serialized fields `static`, `const`, or `readonly`. - Do not assume properties are serialized. ``` 字段重命名必须用 `[FormerlySerializedAs]`。Unity 官方 API 文档明确说明这个 attribute 用于 rename 字段而不丢失 serialized value。([docs.unity.cn](https://docs.unity.cn/ScriptReference/Serialization.FormerlySerializedAsAttribute.html)) ```csharp using UnityEngine; using UnityEngine.Serialization; namespace Project.Gameplay { public sealed class PlayerMovement : MonoBehaviour { [FormerlySerializedAs("moveSpeed")] [SerializeField] private float movementSpeed = 5f; } } ``` `[SerializeReference]` 要列为高风险。Unity 官方文档说明它会让 Unity 按 reference 而不是 value 序列化，支持多态、null、循环图等，但也有额外开销；文档还说明 managed reference 会记录 ID、fully qualified class name 和字段值。因此类名、命名空间、程序集移动都会更敏感。([docs.unity.cn](https://docs.unity.cn/Documentation/ScriptReference/SerializeReference.html)) 建议在 `AGENTS.md` 加： ```markdown ## High-Risk Serialization Changes Treat these as high risk: - Renaming serialized fields - Moving MonoBehaviour or ScriptableObject classes between namespaces - Moving classes between asmdefs - Changing `[SerializeReference]` type names - Changing ScriptableObject schema - Changing prefab or scene serialized data Required: - Add `[FormerlySerializedAs]` for field renames. - Document affected prefabs, scenes, ScriptableObjects, and assets. - Run `./tools/validate-full.sh`. ``` --- # 10. `AGENTS.md` 里要加入 Unity 6.2 + Gitea 专属规则我建议你的 Unity 版文章里把 `AGENTS.md` 改成更贴近实际环境： ```markdown # Project Unity Agent Rules ## Environment - Unity version is pinned by `ProjectSettings/ProjectVersion.txt`. - Do not upgrade Unity version or packages unless explicitly asked. - Repository is hosted on self-managed Gitea. - CI runs through Gitea Actions on dedicated Unity runners. - Large binary assets are stored with Git LFS. ## Required Validation For normal C# and editor-tooling changes: ./tools/validate-fast.sh For scene, prefab, ScriptableObject, Addressables, ProjectSettings, Packages, Build Profiles, input, animation, or asmdef changes: ./tools/validate-full.sh ## Version Control - Commit `.meta` files. - Do not manually edit `.meta`. - Do not move or rename Unity assets outside the Editor unless explicitly required. - Do not modify `ProjectSettings/`, `Packages/`, or Build Profiles unless the task requires it. - Do not add large binary files outside Git LFS tracking. ## Test Command Rule When running Unity Test Framework from command line, use `-runTests -batchmode`. Do not add normal `-quit` to test commands. ## Build Profile Rule Build Profiles are build configuration assets. Changing them is high risk and must be mentioned in the final response. ## Final Response Always state: - Files changed - Asset, prefab, scene, ScriptableObject, asmdef, Build Profile, or ProjectSettings impact - Validation command run - Whether Unity/Gitea CI still needs to verify anything ``` 这版比原文章更准确，尤其是测试命令和 Gitea CI 部分。 --- # 11. 我会调整原文中的落地优先级 ## P0：立即做 1. 锁定 Unity 精确版本，不写模糊 `6.2+` 2. Gitea 启用 LFS 3. 仓库提交 `.gitattributes` 4. Unity 设置 Visible Meta Files 5. Unity 设置 Force Text 6. 配置 UnityYAMLMerge 到开发机和 runner 7. 建立 `tools/validate-fast.sh` 8. 建立 Gitea Actions runner，独立机器运行 9. Gitea `main` 分支保护，禁止直接 push 10. CI 要求 `unity-fast` 状态检查通过 ## P1：一周内做 1. `validate-full.sh` 2. PlayMode smoke tests 3. Build Profile 验证命令 4. asmdef 依赖检查 5. 序列化字段迁移协议 6. PR 模板加入 Unity 风险清单 7. Gitea 要求 PR approval 和 stale approval dismiss ## P2：项目变大后做 1. Addressables Analyze / content build 验证 2. Dedicated build runner 3. Unity Accelerator 或共享 Library cache 方案 4. 资源体积预算 5. 自动生成 project index 给 Agent 读取 6. AI cleanup 先 report-only --- ## 最后一句总结根据 Unity 官方文档和 Gitea 官方文档校准后，我会把你文章里的建议从： > “给 Agent 写规则” 升级成： > **Unity 6.2+ 项目在 Gitea 上通过 LFS、UnityYAMLMerge、asmdef、分层验证脚本、Gitea Actions、分支保护和序列化迁移协议，给 Agent 设置不能绕过的工程护栏。** 最重要的修正是这三条： 1. **Unity Test Framework 命令不要加普通 `-quit`。** 2. **Gitea 本地部署必须先把 LFS、runner、分支保护、状态检查打通。** 3. **Unity 6.2+ 要把 Build Profile 当成高风险配置资产纳入 Agent 规则。**

基于证据复核后的 Unity 文档改进建议

Wed, 20 May 2026 08:30:36 GMT

在搜索 Unity 6.2 官方文档、Claude Code/AGENTS.md 现状、Gitea Actions 实践、game-ci 等关键来源后，下面是**基于事实而非推测**的修订建议。重点考虑你提到的 **Unity 6.2+** 和 **自建 Gitea** 这两个约束。 --- ## 一、需要立刻修正的事实性错误 ### 1. AGENTS.md 与 CLAUDE.md 的关系需要明确我之前默认 AGENTS.md 等价于 CLAUDE.md，事实更复杂： - AGENTS.md 是 2025 年由 Sourcegraph、OpenAI、Google、Cursor 等合作推出，现由 Linux Foundation 旗下的 Agentic AI Foundation 维护，被 Claude Code、Cursor、GitHub Copilot、Gemini CLI、Windsurf、Aider、Zed、Warp、RooCode 等广泛支持。 - **但有一个关键例外**：Claude Code 截至 2026 年 4 月仍未原生读取 AGENTS.md，相关 GitHub issue 积累了数千 upvote 但 Anthropic 没给出时间表，标准变通方案是写 AGENTS.md 然后做 symlink：`ln -s AGENTS.md CLAUDE.md`。 **建议**：在你的 Unity 文档里明确写出这个 symlink 工作流，否则 Claude Code 用户拿到仓库后会读不到 AGENTS.md。 ### 2. AGENTS.md/CLAUDE.md 当前长度严重超标这是我上一轮没指出的最大问题。当前 Unity 文档的 AGENTS.md + 附录 B 的所有 docs 加起来有几千行，但研究表明： - 前沿思考型 LLM 大约能可靠遵循 150–200 条指令；非思考型模型和小模型衰减得更快。 - Claude Code 的系统提示已用掉约 50 条指令——大约占 agent 能可靠遵循指令的三分之一，因此 CLAUDE.md 应包含尽量少的指令。 - 研究建议保持 CLAUDE.md 简洁，理想情况下不超过 300 行。 - 最常见的错误是把代码风格指南塞进 CLAUDE.md——**永远不要让 LLM 干 linter 的活**，LLM 又贵又慢，应当尽可能用确定性工具。 **这意味着你当前 AGENTS.md 的大部分内容应该删掉**，特别是： - 详细的 C# Style 规则（交给 `.editorconfig` + Roslyn Analyzer） - 大段 Unity Lifecycle Rules（交给 Microsoft.Unity.Analyzers） - 完整的反模式 ✗/✓ 对照表（如果 analyzer 能拦就用 analyzer 拦）只保留**确定性工具无法表达的项目级约束**：模块依赖方向、入口脚本、`[FormerlySerializedAs]` 迁移要求、`-executeMethod` 入口、不可碰的资源目录等。 --- ## 二、Unity 6.2 特有的修正 ### 1. Build Profiles 已取代 Build Settings（Unity 6.0+）文档里仍说 `File > Build Settings`，这是 2022.3 LTS 的旧术语。 - Unity 6 引入了 Build Profiles，帮助跨平台和环境定义并复用构建配置。 - 每个 Build Profile 是一组配置设置，通过 `File > Build Profiles` 访问。 - Unity 将每个 Build Profile 保存为 `.asset` 文件（例如 `Assets/Settings/Build Profiles/AndroidDevelopment.asset`），并应提交到版本控制。 **建议**： - `docs/build-and-release.md` 必须围绕 Build Profile `.asset` 文件设计，而不是过去的 BuildSettings.asset。 - 把 Build Profile 资产明确加入 AGENTS.md 的"不要随意修改"清单——它们直接决定 CI 产物。 - Unity 6.2 还为 Web、Android、iOS、macOS、Windows 提供了每个 Build Profile 的 Diagnostics 设置覆盖，需要在 AGENTS.md 提一下"不要在没有 release owner 同意下开关 Diagnostics"。 ### 2. Unity Test Framework 已成为核心包文档里写"Add or update EditMode tests"和"PlayMode tests"暗示两者必须分开。Unity 6.2 已经不需要： - Unity Test Framework 已成为核心包，其用户手册已整合到 Unity 核心手册。 - UTF 2.0 通过引入 `RequiresPlayModeAttribute` 移除了必须把 EditMode 和 PlayMode 测试放在不同 assembly 的要求；Editor-only 测试 assembly 现在可以包含会运行在 Play Mode 的测试（带 `[RequiresPlayMode]`），平台特定 assembly 也可用 `[RequiresPlayMode(False)]` 让测试不在 Play Mode 运行。 - UTF 2.0 还支持用 .NET Task 异步编程模型编写异步测试。 **建议**： - 简化目录结构——不必再强制 `Tests/EditMode/` 和 `Tests/PlayMode/` 两个 asmdef。可以保留一个 `Project.Tests` asmdef，用属性区分测试模式。 - AGENTS.md 里建议用 `[RequiresPlayMode]` 而不是机械地把测试塞到 PlayMode asmdef。 ### 3. 关于异步：Awaitable 已是官方推荐，但 UniTask 仍更完整文档里只是泛泛地说"Prefer cancellation-aware async flows when using UniTask or Tasks"。Unity 6 的现实更具体： - Unity 6 引入了 awaitable 类型 `Awaitable`，可以认为是 UniTask 的子集，并且 Awaitable 的设计受 UniTask 影响。 - 在 Unity 2023.1 及更新版本中，`await UniTask.WaitForEndOfFrame()` 不再需要 MonoBehaviour，因为它使用 `UnityEngine.Awaitable.EndOfFrameAsync`。 - Unity 内置的 Awaitable 很方便（尤其对想避免外部依赖的库），但 UniTask 仍是更完整、面向生产的工具包：frame-aware delays、WhenAll/WhenAny、细粒度 PlayerLoopTiming、一等公民的 CancellationToken 支持；对应用/游戏，UniTask 仍是不错的默认选择。 **建议**：在 `docs/conventions.md` 给出明确指引： - 项目内部异步代码：用 `Awaitable`（零依赖、官方）。 - 高频/对 GC 敏感的代码、需要 WhenAll/WhenAny 的：用 UniTask。 - 不要让代理在同一代码路径混用 `Task` 和 `UniTask`，否则容易产生隐式同步上下文切换。 --- ## 三、关于 Gitea CI 的修正（最大改动）我之前给的 GitHub Actions YAML 直接放到你的 Gitea 上会踩坑。事实是： ### 1. Gitea Actions 与 GitHub Actions 兼容但有限制 - Gitea Actions 直接在 Gitea UI 中提供 GitHub Actions 兼容的 CI/CD，但 GitHub Marketplace 的第三方 actions 不能直接用——`actions/checkout`、`actions/setup-python`、`actions/cache` 等流行 action 在 gitea.com 有镜像并可正常工作；act_runner 首次从互联网获取并本地缓存。 - Gitea Actions 基于 nektos/act，act_runner 通过 spawn act 实例运行 workflow；act 试图尽可能兼容 GitHub Actions，但并非完美匹配。 **这意味着**： - `game-ci/unity-test-runner@v4` 和 `game-ci/unity-builder@v4` **在 Gitea Actions 上不一定开箱即用**——它们依赖 GitHub Marketplace。 - `anthropics/claude-code-action@v1`（我上一轮推荐的）同样不能直接在 Gitea 跑。 ### 2. 自建 Gitea 推荐走 GameCI 的 GitLab 路线（Docker 镜像直接调用） GameCI 提供两套体系：GitHub Actions 和 GitLab CI。前者依赖 marketplace，后者依赖 Docker 镜像。**对自建 Gitea，应该参考 GitLab 路线**： - GameCI 的 unity3d-gitlab-ci-example 使用 game-ci 发布的 unity3d Docker 镜像。 - 典型调用方式：用 `unityci/editor:-base-` 镜像，传入 `UNITY_EMAIL`、`UNITY_PASSWORD`、`UNITY_SERIAL` 环境变量，然后通过 `xvfb-run` 调用 `unity-editor -batchmode -nographics`。 **建议的 Gitea Workflow 雏形**（`.gitea/workflows/unity-validate.yml`）： ```yaml name: Unity Validate on: pull_request: workflow_dispatch: jobs: test: runs-on: ubuntu-latest container: image: unityci/editor:6000.2.16f1-base-3 # Unity 6.2 LTS image options: --user root steps: - uses: actions/checkout@v4 with: lfs: true - name: Activate Unity license run: | mkdir -p /root/.cache/unity3d /root/.local/share/unity3d/Unity/ echo "$UNITY_LICENSE" | base64 -d > /root/.local/share/unity3d/Unity/Unity_lic.ulf env: UNITY_LICENSE: ${{ secrets.UNITY_LICENSE }} - name: Run EditMode tests run: | xvfb-run --auto-servernum --server-args='-screen 0 640x480x24' \ unity-editor -batchmode -nographics -logFile /dev/stdout -quit \ -projectPath . -runTests -testPlatform EditMode \ -testResults Logs/editmode-results.xml - name: Run project validation run: | xvfb-run --auto-servernum --server-args='-screen 0 640x480x24' \ unity-editor -batchmode -nographics -logFile /dev/stdout -quit \ -projectPath . \ -executeMethod Project.Editor.Validation.ProjectValidationCommand.RunAll ``` 注意几点： - **不要**用 `game-ci/unity-test-runner@v4`——它会去 GitHub marketplace 拉。 - act_runner 要用 **DinD 或 Docker 模式**，而且要给 runner 配置足够大的镜像（Unity 镜像超 10GB）。Gitea act_runner 默认是 `ubuntu-22.04:docker://node:16-bullseye`，如果默认镜像不够用，可改为 `ubuntu-22.04:docker://<你需要的镜像>`。 - Unity 个人版/专业版的 license 流程不同：Unity Personal License 在 `.ulf` 中产生一个 serial token，需要先在本机 Unity Hub 激活然后定位 `.ulf` 文件。 ### 3. CI 缓存策略要本地化 `actions/cache@v4` 在 Gitea 上是默认禁用的：act_runner 配置文件里 `cache.enabled` 默认是 `false`，需要在配置中启用 cache server 才能使用 `actions/cache`。 - 在 act_runner 的 `config.yaml` 里把 `cache.enabled` 设为 `true`。 - 缓存 `Library/` 是 Unity CI 的关键——首次 import 一个中等项目就要 10–20 分钟，缓存命中后能降到 1–2 分钟。 ### 4. Claude Code 自动化在 Gitea 上的路径我之前推荐的 `anthropics/claude-code-action` 是 GitHub Action。要在 Gitea 上做"垃圾回收"，更现实的方案是： - 在 act_runner 容器里直接安装 `claude` CLI（npm install @anthropic-ai/claude-code）。 - 用 cron 触发 workflow，workflow 里 shell 调用 `claude -p "..."`。不要假设 `claude-code-action@v1` 可用，除非你愿意自己 fork 并镜像它的依赖。 --- ## 四、Roslyn Analyzer 落地的精确步骤上一轮我笼统地建议"引入 Microsoft.Unity.Analyzers"。实际在 Unity 里安装方式与 .NET 项目**完全不同**： - Unity 不支持通过 NuGet 直接安装 Roslyn Analyzers。需要把 NuGet 包当作 `.zip` 下载、解压，找到 `.dll`。 - 把 `.dll` 拖入 Unity 后，需要在 Plugin Inspector 关闭 Any Platform、关闭 Editor 和 Standalone 平台，然后在 Asset Labels 创建并分配一个名为 `RoslynAnalyzer` 的标签（大小写必须完全匹配）。 - Unity 识别 `RoslynAnalyzer` 标签并把带此标签的资产作为 Roslyn Analyzer 或 source generator 处理。 - 从 Visual Studio Tools for Unity 4.3.2.0 起，Microsoft.Unity.Analyzers 会自动包含在 Unity 生成的所有 csproj 中——所以**在 VS/VS Code 里用 Game Development with Unity workload 时不需要手动装，但 CI 上跑 `dotnet format` 时要装**。 **建议补一份 SKILL**：`.agents/skills/unity-roslyn-analyzers/SKILL.md`，写清这套 "下载 → 拖入 → 关闭平台 → 加 `RoslynAnalyzer` 标签"的流程，避免代理用 `dotnet add package` 这种在 Unity 里无效的指令。 --- ## 五、.editorconfig 与 dotnet format 的实际限制上一轮我说"加 `.editorconfig` 就能让代理统一格式"。实际有坑： - Unity 官方推荐用 `.editorconfig` 而不是 VS 设置导入导出，因为它跨 IDE 共享且能进版本控制。 - 在项目根目录创建 `.editorconfig` 文件并填入想要的配置——Unity 6 生成的 csproj 会自动应用。 - **但**：Unity 项目里每次新增/删除/重命名脚本都会重新生成 sln 和 csproj，覆盖手动改动。所以不能把 editorconfig 引用塞进 csproj 自定义部分，必须放在仓库根目录，让 Unity 重新生成时自动 pick up。 - `dotnet format` 在 Unity 项目上跑要先生成 sln/csproj，且 `dotnet format` 可能会 restore、编译并运行指定项目/方案中的 analyzer，只应在受信任的代码上调用。 **建议**的 pre-commit 调整： ```yaml - repo: local hooks: - id: dotnet-format-whitespace name: dotnet format whitespace # 使用 whitespace 子命令避免触发完整 analyzer 链 entry: bash -c 'dotnet format whitespace --verify-no-changes --include "$(git diff --cached --name-only --diff-filter=ACM | grep "\.cs$" | tr "\n" " ")" || true' language: system files: "\\.cs$" pass_filenames: false ``` 注意用 `dotnet format whitespace` 子命令而不是完整 `dotnet format`：`dotnet format whitespace` 只运行与空格格式化相关的规则，速度快、不依赖 analyzer 是否安装。 --- ## 六、UnityYAMLMerge 路径在 Unity 6 的实际位置我之前给的路径是 `2022.3.0f1`。Unity 6.2 通过 Unity Hub 安装时的真实路径是： - Windows: `C:\Program Files\Unity\Hub\Editor\6000.2.xfx\Editor\Data\Tools\UnityYAMLMerge.exe` - macOS: `/Applications/Unity/Hub/Editor/6000.2.xfx/Unity.app/Contents/Tools/UnityYAMLMerge` Unity 6.0 官方文档的 SmartMerge 页面确认：要自定义 UnityYAMLMerge 合并行为，配置 `mergerules.txt` 文件，该文件在 Unity 安装的 Editor/Data/Tools 文件夹中。**这与文档里的写法一致**，但需要把示例版本号从 `2022.3.0f1` 换成 `6000.2.x`。另外，官方推荐的 git 配置使用 `[mergetool "unityyamlmerge"] trustExitCode = false` 加上 `cmd = '' merge -p "$BASE" "$REMOTE" "$LOCAL" "$MERGED"`——你文档里的写法是对的，无需改动。 --- ## 七、Skills 的当前标准 Skills 部分我上一轮没指出来： - Anthropic 已在 2025 年 12 月 18 日把 Agent Skills 发布为跨平台便携性的开放标准。 - Skill 最简形式是一个包含 SKILL.md 文件的目录；该文件必须以 YAML frontmatter 开头，包含必需的 name 和 description 元数据；启动时 agent 把每个 skill 的 name 和 description 预加载到系统提示，这是渐进披露的第一层。 - Agent Skills 在 Claude.ai、Claude Code、Claude Agent SDK 和 Claude Developer Platform 上都已支持。你当前的 SKILL.md 模板**已经符合规范**（有 frontmatter、有 name/description）。但建议在 AGENTS.md 里把 skill 路径写清楚：Cursor 用 `.cursor/rules/`、Codex 用项目根的 AGENTS.md + 嵌套 AGENTS.md，**Skills 主要是 Claude Code 的能力**，不要假设其他工具能直接读 `.agents/skills/`。 --- ## 八、最终落地优先级（按 ROI 排序） 1. **砍掉 AGENTS.md 80% 的内容**——保留项目独有约束（asmdef 边界、`[FormerlySerializedAs]`、入口脚本、不可碰目录），其余下放到 `docs/` 让 agent 按需阅读，并新增 symlink `ln -s AGENTS.md CLAUDE.md`。 2. **重写 CI 章节为 Gitea 方案**：删除所有 `game-ci/*@v4` 与 `anthropics/claude-code-action`，改用 `unityci/editor` 容器 + 直接 shell 调用 + DinD runner。 3. **更新 Unity 6.2 术语**：Build Settings → Build Profiles；Tests asmdef 简化为单一 asmdef + `[RequiresPlayMode]`；UnityYAMLMerge 路径改为 Hub 路径。 4. **新增 .editorconfig + Roslyn Analyzer Skill**：用确定性工具替代 AGENTS.md 中的风格规则，写清楚 Unity 特有的"下载 dll + RoslynAnalyzer 标签"安装步骤。 5. **明确 Awaitable vs UniTask 选型规则**：在 `docs/conventions.md` 给出场景化建议，而非笼统提一句。修订完后，整份 Unity 文档会从"看起来全面"变成"agent 在 Gitea + Unity 6.2 上能真的跑起来"。

How to Become a Hermes Agent Operator

Wed, 20 May 2026 06:20:47 GMT

![图像](https://pbs.twimg.com/media/HIYCwg-WwAAnt_j?format=jpg&name=large) learn how to operate and master Hermes Agent. set up the agent control room template, configure specialist agents, and grow from one agent to a whole marketing company on one VPS. most AI tools answer questions. Hermes agent runs your workflows end-to-end. it navigates your browser, executes terminal commands, schedules cron jobs, monitors your inboxes, drafts the work, and posts the result to wherever you live: telegram, discord, slack, the email thread you're in right now. built by [@NousResearch](https://x.com/@NousResearch) and it's open source with 150,000 github stars. Currently ranks #1 on OpenRouter for global token usage. its the framework I have built my entire marketing operation around for the past few weeks, and the article you are about to read is how I would set it up if I were starting today. ![图像](https://pbs.twimg.com/media/HIX9LXpXMAAi6Y-?format=jpg&name=large) ## what you'll get from this article - what hermes agent is and why marketers (not just developers) should care - the reader-friendly version of the architecture: the brain, the personality, the skillset, and how they all live in one folder - the use cases I am personally running on hermes, with the four posts I have published about them - the four-part mental model (you, control room, agents, optional task bus) and the four levels of setup, from "one agent on your laptop" to "a fully automated agent team on a VPS that you control from your phone" - the prototype → production methodology I use to take a marketing workflow from messy idea to autonomous deployment - the resources I would have wanted on day one: docs, the community atlas, the people to follow, the meetups happening right now - the honest trade-offs and where this still breaks I am not selling you anything in this article. hermes is open source, Nous Portal has a free tier, and most of the community ecosystem is free too. fork, change, make it yours. ## what hermes agent is **short version:** an autonomous agent that gets more capable the longer it runs. **longer version:** hermes is a framework built by Nous Research that turns a model into a persistent operator. it has its own memory that survives between sessions. it writes its own skills as it works. it ships with 123 skills already built in (github workflows, obsidian, google workspace, linear, notion, typefully, perplexity, deep research, plus 100+ more). it lives wherever you put it, on your laptop, in a docker container, on a VPS, in a serverless runtime. and you can talk to it through 20+ surfaces: telegram, discord, slack, email, voice mode, or just your terminal. > 5月13日 > > Hermes Agent changed how I work it's the highest leverage agent framework you can set up right now what makes it different: > it routes tasks to the right model based on complexity and cost > learns your voice and preferences over time > handles context switching without if you have used claude code or openclaw, hermes is the same shape with a different philosophy. > hermes is rails. opinionated defaults, batteries included, productive on day one with minimal setup, the agent does more thinking for you. > openclaw is linux. primitives, guarantees, explicit control, the agent does exactly what you told it to and nothing more. both are valid. I run hermes because the bundled defaults compound. every project I start with hermes is one where the agent already knows how to do 100+ things before I write a line of configuration. that head start is worth it for me. I´ve also noticed that hermes doesn´t have where near the same issue with gateway disconnecting or bugging. proof is in the numbers Nous Research just hit: - **#1 on OpenRouter** for global token usage (out of every model and framework on the platform) - **150,000 github stars** on the hermes repo - **123 bundled skills** before the agent writes one of its own - **70+ built-in tools** in the gateway, plus 300+ models through one subscription - **6 deployment targets**: local, docker, ssh, daytona, singularity, modal - **20+ messaging surfaces**: telegram, discord, slack, email, voice if you are an AI marketer and you have not started running hermes yet, you are leaving compounding capability on the table every week. ## how it works (the reader-friendly version) every hermes agent has three things. **a brain.** memory lives at ~/.hermes/memories/. two files, MEMORY.md and USER.md, inject at session start. your voice rubric, your brand notes, your customer language, last week's corrections, all of it loads before the first prompt. sessions are stored in sqlite, recall across sessions is full-text searchable. **a personality.** soul.md is where the vibe lives. concise. sarcastic. blunt. formal. fast or thoughtful. you can spin up six agents and give each one a different soul, same brain underneath. one is your outbound rep with a closer's energy. another is your researcher who likes long sentences. another is your assistant who keeps everything short. ![图像](https://pbs.twimg.com/media/HIYD1qSW8AAVLf1?format=jpg&name=large) ## a skillset. 123 skills out of the box: github PRs, obsidian, google workspace, linear, notion, typefully, perplexity, deep research, browser control, web scraping, vision, voice, scheduling. and the closed learning loop: as the agent works, it writes new skills along the way. your own skills library grows on top of the 123 without you having to write any of them. then there is what the agent can talk to. - the **tool gateway**: one subscription, 300+ models, plus web scraping and browser automation built in - **MCP integration**: any external service that speaks Model Context Protocol becomes a tool your agent can use - **20+ messaging surfaces**: telegram, discord, slack, email, voice, plus the CLI itself ![图像](https://pbs.twimg.com/media/HIX_LFwXgAEmDW7?format=jpg&name=large) and where the agent can live. - your laptop (local) - a docker container (isolated, portable, the way I run mine) - an ssh session on a VPS (so it runs even when your laptop is closed) - daytona, singularity, modal (serverless if you don't want to manage infrastructure) the closed learning loop is what makes this different from a smart chatbot. the agent watches itself work, writes new skills as it learns the shape of your work, refines its memory periodically, and recalls past context across sessions using a mix of full-text search and LLM summarization. you do not have to re-teach it next week. > the rule I tell people new to hermes is this: do not try to write your own skills on day one. run real work, let the agent watch, and let the harness write the skills. you build a custom skill library faster by working than by writing prompts. ## what I'm running on hermes I am an AI marketer, not a coder. most of what I run on hermes is marketing infrastructure with the occasional internal tool. here is the actual list: - **a personal assistant** that handles business and private, lives in telegram, flags the four emails worth reading every morning, schedules my reminders, summarizes meetings I missed - **a marketing workflow prototyping bench** where I test new flows (lead magnet, ad creative review, content sprint) against real work for 2-3 runs before promoting them - **specialized marketing agents**: SEO, outbound / BD, design review, content writing, each one with its own soul and its own scope - **a company brain** that monitors slack, chats, emails, transcripts, voice memos, and makes all of it queryable. when I ask "what did we say to that client about pricing last month" I get the answer in 3 seconds instead of 30 minutes of digging - **an SEO agent** that runs the full pipeline from keyword seed to published article in one docker container, 21 steps, no human in the middle until the final review - **a content distribution agent** that takes a piece of long form (this article, for example) and atomizes it across LinkedIn, X, Threads, with platform-specific hooks - **an orchestrator agent** that does not produce work itself, just routes requests to the right specialist based on what I'm asking for the blueprint I posted that summarized it: > 5月13日 > > the org chart for my Hermes Agent company four layers, all isolated docker containers on one vps: 1. company brain - vision, brand, customers, products. the context every other layer inherits 2. orchestrator hermes agent - reads the company brain, picks the right department, the SEO agent in particular is worth zooming in on, because it is the one I have shipped publicly and the one that maps cleanest to the architecture in the rest of this article. five layers, all inside one docker container, 21 steps from keyword seed to published article. the 21 steps look like this in the terminal: ```markdown [research + ideate] 01 keyword seed 02 serp snapshot 03 competitor extraction 04 intent + format analysis 05 content + visual gap 06 internal validation 07 external validation [production] 08 angle + positioning brief 09 visual strategy brief 10 outline 11 draft 12 image gen 13 flowchart gen 14 visual qa 15 article qa [distribution] 16 publish prep 17 schema 18 internal linking 19 syndication 20 analytics setup 21 monitoring ``` the layers above this pipeline: 1. **company brain** at the top: vision, brand, audience, products. every agent reads from this 2. **orchestrator hermes agent**: takes the topic or keyword seed and routes it to the seo agent 3. **seo brain**: ranking playbook, voice rules, content formats, visual style guide, success criteria per format. all seo-specific context lives here 4. **three sub-agents** inside the SEO agent, each one handling a phase: 5. **research + ideate**: keyword seed, serp snapshot, competitor extraction, intent and format analysis, content and visual gap, internal and external validation 6. **production**: angle and positioning brief, visual strategy brief, outline, draft, image gen, flowchart gen, visual and article qa 7. **distribution**: publish prep, schema, internal linking, syndication, analytics, monitoring 8. **one docker container** holds all three sub-agents. they share env, memory, and tools. sub-profiles switch context per phase. one process, one filesystem, one set of credentials. why one container instead of three: seo work is sequential. research feeds the brief, the brief feeds production, production feeds distribution. every step needs memory of what was decided upstream. splitting into three containers means shuttling state across boundaries, which gets expensive and breaks the chain. every other specialized agent in the company runs on the same template. clone the SEO agent template, swap the brain (seo brain → outbound brain, or → design brain, or → support brain), and you have a new agent for any function with the same five-layer shape. > 5月15日 > > how my hermes seo agent works in the org chart it runs the full pipeline from keyword seed to published article, 21 steps, all inside one docker container the structure: LAYER 1: company brain shared context: vision, brand, audience, products. every agent reads from this > the layers are not decoration. they are the reason the agent does not lose context as the work gets specialized. the company brain stays stable while the worker iterates. the brain layers make the worker disposable. I also hosted Nous Research at our [@EspressioAI](https://x.com/@EspressioAI) HQ in Lisbon for a Hermes Agent evening recently. [@yeahfortommy](https://x.com/@yeahfortommy) from Nous ran a Q&A, Simao from noticed .so showed an agent harness with autoresearch, and I walked through how we are using hermes for growth at Espressio. > 4月27日 > > we are hosting @NousResearch for a Hermes Agent evening at the Espressio HQ tomorrow @touchgrassklub is organizing in collabration with Talent Protocol Here´s the agenda for the evening: > I´ll start by talking about using Hermes Agent for growth, what we're shipping at if you are in Lisbon and want to come to the next one, I will post when it's scheduled. ## from one agent to a full fleet before the levels, the mental model. the setup has four parts: - **you** are the operator. you have direct access to every part of the system. - the **agent control room** is the side control plane. it is not an agent you chat through. it is a folder at /root/vps-agents that documents and governs the whole fleet. you open it, edit it, inspect it, or ask claude, codex, or hermes to use it when you are managing the system. - the **hermes agents** are the workers. some are specialists (seo, dev, cmo, ops). one of them can optionally be an orchestrator. - the **agent task bus** is an optional handoff desk that sits between the orchestrator and the specialists. you only need it once you have an orchestrator in play. the whole thing looks like this: ```markdown ┌───────┐ │ YOU │ the operator └───┬───┘ │ ┌─────────────────────────────┼─────────────────────────────┐ │ │ │ control path orchestrated path direct path │ │ │ ▼ ▼ ▼ ┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐ │ AGENT CONTROL ROOM │ │ HERMES │ │ SPECIALIST AGENT │ │ /root/vps-agents │ │ ORCHESTRATOR │ │ │ │ │ │ (optional door) │ │ seo · dev · cmo · │ │ docs · rules · │ └─────────┬──────────┘ │ ops · life │ │ runbooks · env-map │ │ delegates │ │ │ · registry │ ▼ │ talk to it │ │ │ ┌────────────────────┐ │ directly, │ │ side control plane │ │ AGENT TASK BUS │ │ no routing │ │ no raw secrets │ │ /srv/agent-bus │ │ │ │ │ └─────────┬──────────┘ │ │ └────────────────────┘ │ │ │ │ routes │ │ └───────────────▶ │ │ │ └────────────────────┘ the agent control room governs every agent in this diagram. it is the single source of truth, and the place you go to manage the fleet, not the place you go to run work through it. ``` the storage split matters more than people think: ```markdown /root/vps-agents → control room: docs, rules, runbooks, architecture no raw secrets, ever /srv//data → live runtime: secrets, memory, skills, sessions, crons this is where each hermes agent lives ``` the control room contains the answers to questions like which agents exist, what they do, what ports they use, what credentials they reference, what each agent can and cannot do, and how to restart, debug, or rebuild any of them. the live runtime contains the actual workings. > the control room is the brain that defines the system. the live runtime is the body that runs it. you can rebuild the body from the brain. you cannot rebuild the brain from the body. inside the control room: ```markdown /root/vps-agents/ README.md CLAUDE.md agents/ / inventory.md docker.md env-map.md runbook.md backup.md shared/ security.md commands.md api-keys-sop.md orchestrator-and-fleet-skills.md ``` and inside each agent's runtime at /srv//data/: ```markdown .env config.yaml SOUL.md memories/ skills/ cron/ sessions/ logs/ state.db ``` ## three ways you interact ```markdown control path: you ──────► agent control room (add agents, rotate keys, update docs, debug setup) direct path: you ──────► hermes-seo-espressio (talk to a specialist directly, fastest) orchestrated path: you ──► hermes-orchestrator ──► task bus ──► specialists ──► you (one front door, routes and synthesizes multi-agent work) ``` - **control path** is the meta layer. used when adding agents, reviewing docs, checking ports, rotating keys, debugging setup. - **direct path** is the fastest. used when you already know which agent does the work. - **orchestrated path** is the synthesizer. used when you want one front door that routes and combines work across multiple specialists. ## level 1: one agent you have one hermes agent. that is it. the control room can still exist (recommended) but it only documents that one agent. ```text you → one hermes agent control room → documents that one agent ``` best for: initial setup, your personal hermes, root install documentation, simple docker migration. one agent, lived-in, with a personality you have tuned and a memory that has started to build. fill in SOUL.md with the voice you want, MEMORY.md with the stable facts about your business, and USER.md with the stable facts about you. connect it to telegram or discord so it lives where you do. start using it on real tasks. let it touch your tools. let it write its own skills along the way. MEMORY.md holds the stable facts (what your business is, who your customers are, what your products do). USER.md holds the stable facts about you (timezone, working hours, recurring projects, preferred output formats). both get refined every week as you correct the agent in real conversations. ## level 2: direct specialist agents you have multiple specialized agents, but you still talk to each one directly. no orchestrator yet. ```markdown you → hermes-life you → hermes-seo-espressio you → hermes-dev you → hermes-cmo ``` the control room documents all of them. best for: clear role separation, testing which agents are useful, avoiding premature orchestration, keeping credentials scoped per agent. > the trap to avoid here is reaching for an orchestrator before you have proven your specialists are useful. spin up two or three, run them directly, and only add an orchestrator when you find yourself wanting one front door. when to spin up a new agent vs stay with what you have: ```markdown needs its own credentials → new agent needs its own long-term memory → new agent ongoing repeated work that is a separate role → new agent otherwise stay with what you have ``` bad pattern: one mega-agent with every credential and every memory layer mashed together. you lose isolation, you lose the ability to revoke access cleanly, and the agent gets confused about which voice to use. ## level 3: orchestrator + specialists you add hermes-orchestrator as a front door. you can still talk directly to specialists, but the orchestrator can route work and synthesize results. ![图像](https://pbs.twimg.com/media/HIX8GPyWAAAQd4c?format=jpg&name=large) the orchestrator reads the control room to know which agents exist, what each one does, where task queues live, what requires approval, which actions are forbidden, and where docs and runbooks are. it does not need to ask you any of that, it reads it. best for: cross-functional work, delegation, summary and synthesis, one main interface for multi-agent workflows. > the orchestrator is the moment your setup stops being a collection of agents and starts being a team. it is also the moment where the control room earns its keep, because the orchestrator is only as good as the docs it reads. what a quick check-in on the fleet looks like from my laptop or phone: ```markdown $ ssh hermes welcome to hermes-vps-1. last login: thu may 15 09:14:22 hermes-vps-1 ~ $ cd vps-agents hermes-vps-1 ~/vps-agents $ docker ps --format \ "table {{.Names}}\t{{.Status}}\t{{.Image}}" NAMES STATUS IMAGE hermes-orchestrator up 14 hours hermes-runtime hermes-seo-espressio up 8 hours hermes-runtime hermes-cmo up 8 hours hermes-runtime hermes-outbound up 4 hours hermes-runtime hermes-life up 12 hours hermes-runtime hermes-vps-1 ~/vps-agents $ cat agents/hermes-seo-espressio/runbook.md # runbook: hermes-seo-espressio restart: docker compose restart hermes-seo-espressio logs: docker logs -f hermes-seo-espressio shell: docker exec -it hermes-seo-espressio bash ... ``` > 5月13日 > > my entire Hermes Agent setup is controlled from one folder on a VPS I can manage it from my laptop or phone in under 10 seconds, spin up isolated agents per project, and never lose context here's the full setup: > bash command "ssh hermes" auto-connects to the VPS > session ## level 4: automated agent team ame shape as level 3, but with recurring workflows and stronger automation. weekly seo reports run on cron. server health checks fire daily. backup verification runs without you asking. cross-agent business workflows kick off on schedule. best for: weekly seo reports, content operations, server health checks, backup verification, cross-agent business workflows. > level 4 is what a marketing department in your terminal looks like. it does not need you to start the day. it shows up to work on its own, files reports, checks itself, and only pings you for the decisions that need taste. ![图像](https://pbs.twimg.com/media/HIX8niUXgAE50SA?format=png&name=large) ## the control hierarchy one principle to keep in your head as you climb the levels. the control room is for config, docs, runbooks, and governance. it documents which agents exist, what they do, where they run, which credentials they reference, what each agent can and cannot do. it is the admin panel for the fleet, including the orchestrator. it is not where you go to do work. for work, you talk to the agents directly. either a specialist (when you know which agent owns the job) or the orchestrator (when you want one front door to route across specialists). ## the setup guide: point your agent at the repo now you understand the architecture. here is how you build it. I shipped a public template that holds the exact structure described above, plus the skills your agent needs to set it up for you. it lives at [github.com/shannhk/hermes-agent-control-room](https://github.com/shannhk/hermes-agent-control-room). ![图像](https://pbs.twimg.com/media/HIX9YQEWoAId_oy?format=jpg&name=large) you can clone it manually, but the point is that you do not have to. if you have claude code or codex on your laptop, the agents do most of the work after you hand over a Hetzner API key. the automated flow: ```markdown you ──► generate a Hetzner API key (5 min: sign up, generate a token, drop it in your .env) │ ▼ agent ──► create-vps skill spins up a Hetzner box, generates an SSH key, writes the alias to ~/.ssh/config so \`ssh hermes\` works │ ▼ agent ──► setup-control-room skill installs Node, Docker, Claude Code, Codex CLI, Hermes Agent, then clones the repo to the VPS at /root/agent-control-room │ ▼ you ──► finish interactive auth on the VPS (claude /login, codex, hermes) │ ▼ agent ──► agent-control-room skill registers your first hermes agent in the docs, fills in the runbook, sets up the env-map │ ▼ you are at level 1 with a documented agent ``` within ten to fifteen minutes you have: - a fresh Hetzner VPS with the right tooling installed - the control room cloned at /root/agent-control-room on the VPS - the bundled skills linked into ~/.claude/skills on the VPS - one hermes agent registered, runbook filled in, env-map written - an SSH alias on your laptop so ssh hermes connects instantly ## the prototype → production methodology most workflows do not start as production ones. they start messy. a flow that runs SEO research, drafts an article, schedules it in Typefully, and posts it to LinkedIn does not exist in your head fully formed. you discover it by running it. hermes is the prototyping environment for this. here is the four-step path I use to take any new marketing workflow from idea to autonomous deployment: 1. **prototype in hermes**. open your main hermes agent, describe what you want to happen, and let it try. it will get most of it wrong on the first run. that's fine. 2. **run it 2-3 times against real work**, correcting drift each time. the harness watches every correction and starts writing the skill as it learns the shape. by run three the agent is doing most of what you want without coaching. 3. **fine-tune in a dedicated workspace**. pull the workflow into a separate Claude Code workspace (or a fresh hermes agent if you prefer), tighten the prompts, lock the routing, add error handling, decide what should run on cron and what should be triggered. 4. **deploy to a VPS on a schedule**. once it survives a week of real runs without you babysitting it, push it to its own docker container on your VPS, set the cron, walk away. I learned this pattern after burning a few weekends trying to write production-ready agents from scratch. you cannot write a production agent from scratch. you have to grow one. hermes makes the growing part fast. ![图像](https://pbs.twimg.com/media/HIX9voFWQAAserb?format=png&name=large) 1. prototype in hermes 2. fine-tune in a dedicated workspace 3. deploy autonomous on a VPS ## the models I run on hermes hermes gives you the framework. the model underneath is your choice. through the tool gateway you can route to 300+ models from one subscription, switching per agent or per task. what I personally run today: - **claude opus 4.7** for the creative work: copywriting, voice, hook generation, content drafting, anything where taste and writing quality matter - **codex (gpt 5.5)** for the structured work: coding, planning, multi-step workflows, browser automation, scraping, anything where the steps need to be tight and the output predictable I run both. opus writes. codex builds and plans. hermes makes routing easy, you point each agent at the model that fits the work it does. if you can only run one, the answer depends on what kind of work your fleet is doing. heavy on content and copy? start with claude opus 4.7. heavy on infrastructure, automation, and engineering workflows? start with codex. you can always add the second model later through the same tool gateway. ## honest trade-offs I am not going to pretend hermes is perfect. three real trade-offs. **1\. the bundled defaults are also** [opinions.hermes](https://opinions.hermes/) ships with strong defaults for how memory works, how skills get written, how the agent uses tools. that is the whole pitch. but it also means if you want primitives with explicit control over every step, hermes will feel heavy. openclaw is the better fit for that taste. pick the tool that matches your philosophy. **2\. level 3 and 4 have a real learning curve.**docker, VPS, SSH, the control room folder structure, the orchestrator skills, none of this is "install and go." you should not jump to level 3 if you are not already running hermes at level 1 daily. **3\. the model still** [matters.hermes](https://matters.hermes/) is a framework that makes a good model great. it does not make a small model into a strategist. use the strongest models you can afford for the work that matters (your orchestrator, your strategy agent, your brain). drop to cheaper models for the work that does not (research scraping, draft generation, batch processing). > none of this is magic. it is a framework that pays back because the memory persists, the skills accumulate, and the agents stay scoped. apply it to the wrong-sized model and you get a confused team. apply it to the right one and you get a team. ## resources if you are starting today, here is what I would read in order. - **the official docs**: [hermes-agent.nousresearch.com/docs](https://hermes-agent.nousresearch.com/docs). start with the install guide, then read the skills page so you understand what ships out of the box - **the control room template (my repo)**: [github.com/shannhk/hermes-agent-control-room](https://github.com/shannhk/hermes-agent-control-room). the exact structure I described above, ready to clone. control-room-first template for managing hermes agents from one VPS agent to specialist teams and orchestrated workflows. fork it and make it yours - [hermesatlas.com](https://hermesatlas.com/): the community-curated map of 100+ open source tools, plugins, workspaces, and integrations built on hermes. categorized by domain (memory providers, workspaces, skill registries, deployment, orchestration). also includes the Hermes Handbook, a beginner-friendly walkthrough. weekly updates, free newsletter - [@Teknium](https://x.com/@Teknium) **on X**: Nous Research founder. ships hermes updates almost daily. the codex runtime integration, the DeepSeek V4 Flash free tier on Nous Portal, the pretext skills, all came through his feed first - [@NousResearch](https://x.com/@NousResearch) **on X**: the official account, official feature announcements - **the meetups**: there are hermes meetups happening in person now (Lisbon, Ventura, more cities). worth showing up if there's one near you. you learn more in 90 minutes of side conversations than in a week of reading ![图像](https://pbs.twimg.com/media/HIYBywiXMAAxMZi?format=jpg&name=large) Hope you got some value out of this, appreciate you reading the whole thing. \-- Shann

0. 先问最危险的问题

Mon, 11 May 2026 07:42:30 GMT

好。换成 Linus / Musk / Mario 的视角，直接开砍。 --- ## 0. 先问最危险的问题 > **这个 harness 真的需要存在吗？** 第一性原理下的答案，先不是“怎么设计 harness”，而是： > **一个 agent 要可靠地改代码，物理上、信息上、组织上，最少需要什么？** 答案就三件事： 1. **一个能跑代码和测试的隔离环境**（否则无法验证）； 2. **一个能产生 diff 并回滚的版本控制**（否则无法收敛、无法安全）； 3. **一个能在关键节点说“yes/no”的人**（否则无法承担责任）。 **就这三件。** 其他全部是工程修饰。你上一版那 13 个模块？**先全部假设不存在。** 然后一个一个加回来，加不回来的那个，就是该删的。 > Musk: “如果你删掉的东西里没有 10% 后来要加回来，说明你删得不够狠。” --- ## 1. 砍掉之前那套架构，先列出“看起来必要”的东西上一版里的模块： 1. Task Intake & Contract Builder 2. Agent Kernel 3. Self-Reflection Ledger 4. ACI / Tool Registry 5. Sandbox 6. Policy Engine 7. Tool Gateway 8. Verification Engine 9. Auto-Repair Loop 10. Human Review Console 11. PR Workflow 12. Repo Memory / AGENTS.md 13. Observer / Stability Monitor / Risk Scorer / Learning Loop / ... **听起来都对。但 90% 是“给屎抛光”。** 第一性原理拷问每一个：**没有它，agent 还能不能完成任务？** --- ## 2. 一个一个砍 ### ❌ 砍掉：Task Intake & Contract Builder **质疑**：你为什么要一个“合同构建器”？ issue 本身就是合同。acceptance criteria 写不清楚，是**人**的问题，不是 harness 的问题。做一个 contract builder 只是把“人写不清需求”这个事实包装成一个组件而已。 **保留的最小核**：agent 启动时，必须能回答一句话—— > “完成的判定标准是什么命令返回 0？” 如果回答不出来，停。让人补。**这就是全部 intake 逻辑。** 不需要 schema，不需要 builder，不需要服务。一个字段：`done_when: `。 --- ### ❌ 砍掉：Self-Reflection Ledger（那个十几个字段的 JSON） **质疑**：agent 真的需要维护 assumptions、confidence、context_map、risk.score 这些东西吗？不需要。这些是**给人看的安慰剂**。真正驱动 agent 下一步决策的，物理上只有两样： 1. **上一次动作的输出**（test 输出、编译错误、diff）； 2. **目标是否达成**（`done_when` 是否返回 0）。 LLM 自己会在 context 里推理这些。你把它结构化成 20 个字段，只是为了让 dashboard 好看。 **保留的最小核**：一个 append-only 的事件流。 ``` event = {ts, action, output, done_check_result} ``` 就这样。**没有 confidence、没有 risk score、没有 assumption list。** 需要的时候 grep。 --- ### ❌ 砍掉：Policy Engine + Tool Gateway 的复杂 YAML **质疑**：你列了 20 条 allow/ask/deny 规则。物理上，危险动作的本质是什么？危险动作的第一性定义只有一个： > **离开 sandbox 影响外部世界的动作。** 不是“修改 auth 目录”。auth 目录在 sandbox 里改，改坏了 rollback 就行，**它不危险**。真正危险的是： - `git push` 到 protected branch - 网络出站到非白名单 - 读 secret - 调用生产 API - merge / deploy **这些都在 sandbox 边界上。** 所以 policy engine 不需要 200 行 YAML。需要的是： > **sandbox 默认什么都不能出去。要出去的东西，列一个白名单。** 白名单大概 5 行： ``` allowed_egress: - github.com/api (read-only token) - npm/pypi registry allowed_writes: - branch: agent/* forbidden: - everything else ``` **Default deny。这一条规则替代了你上一版整个 policy matrix。** > Linus: “你的策略矩阵品味有问题。20 条规则说明你没想清楚边界在哪。” --- ### ❌ 砍掉：Stability Monitor / Lyapunov V / Risk Scorer **质疑**：你真的要给 agent 算一个 Lyapunov 函数吗？不要。这是控制论隐喻用过头了。物理上，agent 失控只有三种表现： 1. **同一个测试连续失败 N 次**（振荡）； 2. **diff 越改越大**（发散）； 3. **超过 token / time / $ 预算**（资源耗尽）。三个 if 语句解决： ```python if same_check_failed_n_times(3): stop_and_ask_human() if diff_lines > budget: stop_and_ask_human() if cost > budget or time > budget: stop_and_ask_human() ``` **不需要 V 函数，不需要 7 个权重，不需要 risk scorer。** 三个计数器。 --- ### ❌ 砍掉：Auto-Repair Loop 作为独立模块 **质疑**：repair 和正常 implement 有什么本质区别？没有。都是“看 observation，决定下一步”。把它做成独立模块，是工程师的强迫症。Agent 自己在 loop 里看见测试失败，自然会修。**唯一需要的是 retry budget。** **保留的最小核**：上面那三个计数器之一。**不需要 FailureKind 枚举，不需要分类器，不需要 hypothesis generator。** 这些都是 LLM 本来就会做的事，你不要替它做。 --- ### ❌ 砍掉：Observer / State Estimator / Context Model / Risk Model / Human Model / ... **质疑**：这些“model”物理上存在吗？不存在。它们是文档，不是运行时对象。 LLM 的 context window 就是它的 state estimator。你做一个外部 state estimator，本质是在和 LLM 抢工作，而且做得比它差。 **保留的最小核**：只保留 LLM 自己拿不到的信息： - 上次运行的命令输出（已经在事件流里）； - repo 的根目录约定（一个 `AGENTS.md`）； - 当前 diff（`git diff` 就够）。 **就这三个。不要 Repo Model，不要 Context Model。** --- ### ❌ 砍掉：Specialist Agents（Reviewer / Tester / Security / Docs sub-agents） **质疑**：为什么需要多个 agent？ 90% 的情况下，不需要。多 agent 是 2024 年的时髦病。真正需要“另一个视角”的场景只有一个：**当前 agent 卡住了，需要 fresh context 重新看问题。** 这时候开一个新 session，把 failure ledger 喂进去，**这不是 sub-agent，这是 retry with reset**。 **保留的最小核**：一个 `reset_and_retry(with_summary)` 动作。**不要 ReviewerAgent class，不要 SecurityAgent class。** --- ### ❌ 砍掉：Evidence Pack / Review Cockpit / 复杂 HITL UI **质疑**：人 review 的时候，物理上需要什么？需要： 1. **diff**； 2. **测试结果**； 3. **agent 说的一句话：我做了什么、为什么、卡在哪**。 **这就是 GitHub PR。** 已经存在了 15 年。不要做 review cockpit。不要做 evidence pack schema。**让 agent 开 PR，人在 PR 里 review。** 你做的所有 dashboard，没有人会用第二次。 > Linus: “GitHub PR 已经把这事解决了。你重新发明一个，只是为了在简历上写一行。” --- ### ❌ 砍掉：Learning Loop / Eval Pipeline / Trajectory Store（在 v0/v1 阶段） **质疑**：你都还没让 10 个任务跑通，learning loop 学什么？先跑 100 个真实任务，**手工**看失败模式，**手工**改 prompt 和 AGENTS.md。等手工改不动了，再做自动化。 > Musk 第 5 步：“自动化放在最后，不要先自动化错误的流程。” --- ## 3. 砍完之后，剩下什么？ ``` ┌─────────────────────────────────────────────────┐ │ 1. Sandbox（container + git worktree） │ │ - default-deny egress │ │ - 只能 push 到 agent/* branch │ ├─────────────────────────────────────────────────┤ │ 2. Agent loop（一个 while 循环） │ │ while not done_when_passes(): │ │ action = llm.decide(history, AGENTS.md) │ │ output = run_in_sandbox(action) │ │ history.append((action, output)) │ │ if stuck(): break_and_open_pr() │ ├─────────────────────────────────────────────────┤ │ 3. Tools（5 个，不是 15 个） │ │ - shell (in sandbox) │ │ - read_file / write_file (=patch) │ │ - run_tests (= shell, 但显式语义) │ │ - open_pr │ │ - ask_human(question) │ ├─────────────────────────────────────────────────┤ │ 4. 三个 if 语句（stuck 检测） │ │ - same failure x3 │ │ - diff > budget │ │ - cost/time > budget │ ├─────────────────────────────────────────────────┤ │ 5. GitHub PR（review 界面，免费的） │ ├─────────────────────────────────────────────────┤ │ 6. AGENTS.md（repo-level 配置，一个文件） │ └─────────────────────────────────────────────────┘ ``` **就这 6 件事。** 不到 500 行代码可以跑起来。 --- ## 4. 一张表：砍前 vs 砍后 | 上一版 | 砍后 | 原因 | |---|---|---| | Task Intake & Contract Builder | 一个 `done_when` 字段 | issue 就是合同 | | Self-Reflection Ledger (20 字段) | append-only 事件流 | LLM 自己会反思 | | Policy Engine (200 行 YAML) | sandbox 边界 + 5 行白名单 | 危险 = 出 sandbox | | Tool Gateway + Registry | 5 个工具函数 | 不需要 registry 抽象 | | Verification Engine | `done_when` 命令 | 一个 shell command | | Auto-Repair Loop | 同一个 loop | repair 不是独立阶段 | | Stability Monitor / Lyapunov | 3 个计数器 | 不需要控制论隐喻 | | Observer / State Estimator | git diff + 事件流 | LLM context 就是 state | | Repo / Risk / Human Model | AGENTS.md | 一个 markdown 文件 | | Specialist sub-agents | reset_and_retry | 多 agent 是病 | | Review Cockpit / Evidence Pack | GitHub PR | 已经存在 | | Learning Loop / Eval | 先手工 | 自动化放最后 | | Trajectory Store / OpenTelemetry | 事件流写文件 | 等有规模再说 | | Risk-Adaptive Autonomy (v2) | 删 | 过早抽象 | **模块数：13 → 6。代码量大概砍掉 80%。功能损失：0。** --- ## 5. 加回来的部分（≤10%）砍完之后跑一阵，**真的不够用的时候**，按这个顺序加： 1. **第一个加回来的，几乎一定是：事件流的查询 UI**。因为你会需要 debug 失败 case。但是是 `grep + jq`，不是 dashboard。 2. **第二个可能加回来的：cost dashboard**。因为账单会让你心疼。 3. **第三个可能加回来的：protected path 列表**。但只有在你**真的被烫到一次**之后才加，而不是预防性地加 200 行 YAML。 4. **Eval set**：当你改 prompt 不知道有没有 regression 的时候。 **注意：Specialist agents、Risk Scorer、Stability Monitor、Contract Builder 这些，大概率永远不会加回来。** 这就是它们该被删的证据。 --- ## 6. 第一性原理的几条硬规则砍完之后，留下这几条作为防止反弹的纪律： 1. **任何模块，先证明没有它系统会坏，再加。** 不是“有了会更好”就加。 2. **任何抽象层，先有 3 个具体用例，再抽。** 不是预想中可能有用就抽。 3. **任何 schema，字段先少后多。** 加字段容易，删字段难。 4. **任何 policy，default deny。** 白名单永远比黑名单短。 5. **任何 dashboard，先用 grep。** grep 不够用了再做 UI。 6. **任何 sub-agent，先用 reset。** reset 不够用了再做 sub-agent。 7. **任何自动化，先手工跑 100 次。** 手工烦死了再自动化。 --- ## 7. 留给你的挑战去看你上一版那 13 个模块。对每一个，回答三个问题： 1. **删掉它，第一个会坏掉的具体场景是什么？** 写不出来 → 删。 2. **这个场景一年发生几次？** < 10 次 → 删，发生时手工处理。 3. **LLM 自己能不能在 prompt 里做这件事？** 能 → 删，让 LLM 做。我赌你能再砍掉 30%。然后：**先写 500 行让它跑起来，跑 50 个真实任务，再来谈架构。** > Talk is cheap. Show me the harness running on real PRs.

1. 把 coding agent 看成控制系统

Mon, 11 May 2026 07:35:23 GMT

我的最大启发是： > **Coding agent harness 不应该被设计成“流程编排器”，而应该被设计成一个面向不完全确定智能体的“分层反馈控制系统”。** > Agent 是主控制器，harness 是观测器、约束器、安全联锁和人机协同外环；人是慢速高价值反馈源，而不是每一步的按钮审批员。下面把钱学森《工程控制论》的思想映射到 coding agent harness。 --- ## 1. 把 coding agent 看成控制系统可以这样建模： | 控制论概念 | Coding agent harness 中的对应物 | |---|---| | 参考输入 `r` | 用户任务、issue、acceptance criteria | | 被控对象 `P` | 代码库、sandbox、测试环境、CI、依赖系统 | | 控制器 `C` | Agent kernel，自主决定 workflow | | 执行器 | edit、patch、bash、test、git、PR 等工具 | | 输出 `y` | diff、测试结果、lint、typecheck、CI、PR 状态 | | 误差 `e` | 当前实现与目标之间的差距 | | 反馈 | 测试结果、编译错误、review comment、CI failure | | 扰动 `d` | 需求不清、隐藏依赖、flaky test、LLM 幻觉、环境不一致 | | 观测器 | run state、trajectory log、failure ledger、risk estimator | | 人类外环 | reviewer、approver、product/architecture decision maker | 所以 agent harness 的本质不是： ```text step1 -> step2 -> step3 -> step4 ``` 而是： ```text 目标 -> agent 决策 -> 工具动作 -> 环境反馈 -> 状态估计 -> 自修复/升级/继续 ``` 也就是一个闭环系统。 --- ## 2. “稳定性第一”：自治必须有边界钱学森强调控制系统首先要稳定。对应到 coding agent，稳定性不是数学上不发散，而是： - 不无限循环； - 不越改越大； - 不为了修一个测试破坏更多模块； - 不绕过测试； - 不修改敏感路径； - 不擅自 merge / deploy； - 不把不确定性伪装成完成。因此 harness 里应该有一个 **Progress / Stability Monitor**。可以定义一个类似 Lyapunov 函数的风险-误差度量： ```text V = w1 * 未满足验收条件数量 + w2 * failing checks 数量 + w3 * diff 范围复杂度 + w4 * 风险路径修改程度 + w5 * 未验证假设数量 + w6 * retry 消耗 + w7 * agent 自信度不足 ``` 每轮 agent 动作之后，harness 评估： ```text V 是否下降？风险是否扩大？是否触碰 hard constraint？是否进入重复失败模式？ ``` 如果 `V` 连续几轮不下降，就不应该继续让 agent 盲目修，而应进入： 1. 缩小问题； 2. 回滚到 checkpoint； 3. 重新计划； 4. 请求 human review。这就是控制论里的稳定性约束。 --- ## 3. “反馈是灵魂”：测试、CI、review 都是反馈信号 Coding agent 最大的错误设计，是把 LLM 当成一次性生成器。控制论视角下，它必须是反馈驱动的： ```mermaid flowchart LR Goal[Task / Acceptance Criteria] --> Agent[Agent Controller] Agent --> Action[Tool Action / Patch] Action --> Env[Repo / Sandbox / CI] Env --> Feedback[Test / Lint / Diff / Error] Feedback --> Observer[State Estimator] Observer --> Agent ``` 因此 harness 应该强制 agent 不断回答： - 当前目标是什么？ - 现在观察到了什么？ - 哪个假设被证实？ - 哪个假设被推翻？ - 下一步实验是什么？ - 当前风险是否变大？ - 是否需要人类判断？这就是我之前提到的 **Self-Reflection Ledger**。它不是让模型暴露完整思维链，而是保存工程化状态摘要： ```ts type ReflectionLedger = { current_goal: string; observed_facts: string[]; assumptions: { text: string; confidence: "low" | "medium" | "high"; evidence?: string; }[]; failed_attempts: { action: string; result: string; suspected_cause: string; }[]; next_experiment: string; risk_change: "lower" | "same" | "higher"; need_human?: { reason: string; decision_needed: string; }; }; ``` 这相当于 agent 的“状态估计器”。 --- ## 4. “不完全确定系统”：LLM agent 天然就是不确定控制对象钱学森特别关注不完全确定系统，这一点对 agent harness 极其关键。 Coding agent 面临的不确定性包括： | 不确定性 | 表现 | |---|---| | 需求不确定 | issue 写得模糊，验收标准缺失 | | 环境不确定 | 本地测试和 CI 不一致 | | 代码库不确定 | 隐藏约定、文档过期、历史 debt | | 模型不确定 | hallucination、过度自信、上下文遗漏 | | 工具不确定 | flaky test、package install 失败 | | 人类偏好不确定 | reviewer 风格、架构偏好、产品取舍 | 所以 harness 要走 **鲁棒控制** 思路： - 默认不信单一信号； - 测试、静态分析、diff review 多重验证； - 对高风险动作加安全裕度； - 对不确定需求要求 human clarification； - 对失败修复设置 retry budget； - 对 agent confidence 低的地方自动升级； - 对敏感路径使用更严格 policy。核心不是让 agent 永远正确，而是让系统在 agent 不完全可靠时仍然整体可靠。 --- ## 5. “可观测性”：看不见就无法控制如果 harness 不能观察 agent 的状态，就无法安全放权。所以每个 run 至少要观测： - agent 看过哪些文件； - 为什么认为这些文件相关； - 改了哪些文件； - diff 范围多大； - 跑了哪些测试； - 哪些测试失败； - 是否新增测试； - 是否触碰 protected path； - 是否出现循环行为； - 是否修改了任务目标； - 是否有未验证假设。一个很重要的判断： > 如果 acceptance criteria 没有对应的可观测验证方式，这个任务对 agent 来说就是“不可观测系统”。此时 agent 不应该强行完成，而应该： 1. 先补测试； 2. 生成验证计划； 3. 请求人类确认验收标准； 4. 或把 PR 标记为需要人工重点 review。 --- ## 6. “可控性”：agent 不是万能控制器控制论里，系统可控才谈得上控制。对应到 coding agent： - 如果工具权限不够，agent 不可控； - 如果没有测试命令，质量不可控； - 如果依赖服务无法启动，验证不可控； - 如果任务涉及产品决策，agent 不可控； - 如果需要线上数据但无安全访问方式，agent 不可控； - 如果修改范围超出 policy，agent 不可控。所以 harness 应该有一个 **controllability check**： ```text 这个任务是否能在当前 sandbox + 当前工具 + 当前权限 + 当前验证手段下完成？ ``` 如果不能，agent 应该升级，而不是乱试。 --- ## 7. “采样控制”：不是每一步都让人看，而是在关键采样点介入采样控制思想非常适合 human-in-the-loop。人不应该连续介入，因为人类反馈成本高、速度慢。应该采用 **事件触发式采样**： ### 自动运行的采样点 - 每次工具调用后记录 observation； - 每次 patch 后更新 diff summary； - 每次测试后更新 failure ledger； - 每次 commit 后记录 checkpoint。 ### 需要人类采样的关键点 - 修改 protected path； - 新增依赖； - 修改数据库 schema； - 修改 auth/payment/security； - 扩大 scope； - 多次 auto-repair 失败； - 需要产品/架构取舍； - 准备从 draft PR 转 ready； - 最终 merge。这就符合你的要求： > 人在 loop，但主要是 review，不是所有步骤都审批。 --- ## 8. “分层控制”：agent 主导 workflow，harness 控制边界，人类控制方向我建议采用三层控制结构。 ```mermaid flowchart TD Human[Human Supervisor
目标/取舍/最终 review] --> Harness[Harness Supervisor
policy/risk/checkpoint] Harness --> Agent[Agent Controller
自主 workflow] Agent --> Tools[Tools / Actuators] Tools --> Repo[Repo / Sandbox / CI] Repo --> Observer[Observer / Feedback] Observer --> Agent Observer --> Harness Harness --> Human ``` ### 第一层：agent 内环高频、自动、自主。负责： - 查代码； - 制定计划； - 修改代码； - 运行测试； - 修复失败； - 写 PR summary。 ### 第二层：harness 监督环中频、规则化、确定性。负责： - policy gate； - risk scoring； - checkpoint； - retry budget； - sandbox isolation； - allowed/ask/deny； - trajectory logging。 ### 第三层：human 外环低频、高价值。负责： - 需求澄清； - 架构选择； - 风险确认； - review； - merge； - 事后调整 policy。这样 agent 仍然是 workflow 的主导者，但系统整体不会失控。 --- ## 9. “模型即对象”：agent 操作的不是代码，而是代码系统的模型钱学森强调数学建模。对 coding agent harness 来说，不能只给 agent raw repo。应该维护多个模型： ```text Task Model 任务目标、验收条件、非目标 Repo Model 模块、依赖、owner、protected path Risk Model 哪些改动危险，哪些需要审批 Progress Model 当前完成度、失败点、剩余 gap Context Model 相关文件、相关 symbol、调用链 Human Model reviewer 偏好、项目规范、历史反馈 ``` Agent 每一步不是“凭感觉写代码”，而是在这些模型上做决策。这也解释了为什么 `AGENTS.md` 很重要。它相当于这个控制系统的局部参数配置： ```md # AGENTS.md ## Build pnpm test pnpm typecheck ## Protected areas Ask before editing: - auth/** - payments/** - infra/** - migrations/** ## Review expectations - Add tests for bug fixes - Keep PR small - Do not introduce new dependency without approval ``` --- ## 10. “从特殊到一般，再回到特殊”：harness 要沉淀通用控制规律钱学森的技术科学路径对 harness 设计很有启发： 1. 从具体 repo、具体 bug、具体 PR 中观察 agent 失败模式； 2. 抽象出通用策略； 3. 写入 harness policy、eval、tool design； 4. 再回到具体 repo 中应用。例如： | 具体失败 | 抽象规律 | harness 改进 | |---|---|---| | agent 改太多文件 | scope 失控 | diff budget + risk escalation | | agent 反复修同一测试 | 闭环振荡 | retry cap + replan gate | | agent 绕过测试 | 目标函数错误 | forbid disabling tests | | agent 不理解 repo 惯例 | 模型不完整 | AGENTS.md + repo memory | | agent 看不到 CI 差异 | 观测不足 | CI feedback ingestion | | reviewer 总提同类意见 | 人类反馈未沉淀 | review comment classifier | 这样 harness 会越用越强，而不是每次从零开始。 --- ## 11. “最优控制”：优化目标不是只让 agent 完成任务 Agent harness 的目标函数不应该是： ```text maximize task completion ``` 而应该是多目标优化： ```text maximize correctness minimize risk minimize unnecessary human interruption minimize diff size minimize time minimize cost maximize maintainability maximize reviewer trust ``` 也就是说，agent 不是“尽快写完代码”，而是要在约束下找到最优控制策略。例如： - 低风险文档修改：agent 可以直接改、测试、开 draft PR； - 中风险 bug fix：agent 自动修，但必须跑测试； - 高风险 auth 修改：先生成方案，请人 review，再动手； - 需求不清：不要写代码，先 ask human； - CI 多次失败：不要继续乱修，生成 failure report。 --- ## 12. 可以把 harness 重构成这几个控制论模块我会在原架构上增加这些模块： ```text 1. Reference Manager 管理任务目标、验收标准、非目标。 2. Observer / State Estimator 从工具输出、diff、测试、CI 中估计当前状态。 3. Agent Controller 由 agent 自主决定下一步 workflow。 4. Supervisory Controller harness 的确定性控制层：allow / ask / deny。 5. Stability Monitor 检测循环、发散、scope creep、风险上升。 6. Repair Controller 管理自动修复、retry budget、rollback。 7. Human Feedback Interface 在关键节点请求 review、approval、clarification。 8. Learning / Identification Loop 从历史 run 中更新 policy、AGENTS.md、eval。 ``` 核心循环可以是： ```python while not done: y = observe_workspace() x = estimate_state(y) e = compare_with_acceptance_criteria(x, task) V = compute_progress_risk_score(x, e) if violates_hard_constraints(x): rollback_or_stop() ask_human() if not observable(task, x): agent.add_tests_or_request_clarification() if not controllable(task, tools, policy): escalate_to_human() action = agent.decide_next_action( task=task, state=x, error=e, risk=V, available_tools=tools, ) gate = supervisor.evaluate(action, state=x, risk=V) if gate == "allow": result = execute(action) elif gate == "ask": result = ask_human_with_evidence_pack(action, x) else: result = deny_and_return_reason(action) update_state(result) if progress_not_improving_for_n_steps(): replan_or_revert_or_escalate() ``` --- ## 13. 最重要的架构启发我会把这句话作为设计原则： > **让 agent 自主，不等于让系统无控制；真正的自治来自闭环反馈、状态观测、稳定性约束和分层监督。** 所以，最好的 coding agent harness 不是一个强行规定 agent 每一步怎么做的 workflow engine。它应该是： ```text 一个面向不完全确定智能体的工程控制系统。 ``` 其中： - agent 决定 workflow； - harness 提供可观测性、可控性、稳定性； - tests/CI/review 提供反馈； - human 是外环监督者； - policy 是安全边界； - checkpoint/rollback 是稳定机制； - failure ledger 是系统自省； - eval 和历史轨迹是系统学习。一句话总结： > **用钱学森控制论思想看，coding agent harness 的核心不是“编排 agent”，而是“控制 agent 所在的工程闭环”：让 agent 在反馈中自主决策，在约束中自动修复，在关键节点接受人的高价值校正。**

0. 调研后的关键判断

Mon, 11 May 2026 07:23:36 GMT

下面是一套**Agent 主导的 coding agent harness 架构**。核心思想：**workflow 由 agent 动态决定；harness 只提供安全边界、工具接口、持久化、验证、审批与可观测性**。人不是每步批准，而是作为 **reviewer / approver** 出现在关键风险点。 ## 0. 调研后的关键判断 Anthropic 把“workflow”和“agent”区分得很清楚：workflow 是预定义代码路径，agent 是由 LLM 动态决定过程和工具使用；OpenAI 的 agent 指南也强调 agent 应能管理 workflow、判断完成、必要时自我纠正或把控制权交还给人。你的需求应采用后者：**agent owns workflow, harness owns constraints**。([anthropic.com](https://www.anthropic.com/engineering/building-effective-agents?__from__=talkingdev)) Coding agent 的关键不是把 shell 全放开，而是设计好 **Agent-Computer Interface / ACI**。SWE-agent 的经验是：专用文件查看器、代码搜索、受控 edit、edit 时 linter、明确空输出提示，比裸 bash 更适合 agent。([github.com](https://github.com/SWE-agent/SWE-agent/blob/main/docs/background/aci.md)) Human-in-the-loop 应做成可持久化的 interrupt / approval：运行可以暂停、保存状态、等待人批准或修改后继续；OpenAI Agents SDK 和 LangGraph 都支持这种“暂停—审批—恢复”模式。人介入的典型触发器应是**高风险动作**和**超过失败阈值**，而不是每个动作都问。([openai.github.io](https://openai.github.io/openai-agents-python/human_in_the_loop/)) --- ## 1. 总体架构 ```mermaid flowchart TD U[User / Issue / Ticket] --> Intake[Task Intake & Contract Builder] Intake --> Kernel[Agent Kernel: workflow self-director] Kernel --> State[Run State / Self-Reflection Ledger] Kernel --> ACI[Agent-Computer Interface] Kernel --> Critics[Specialist Agents as Tools
Reviewer / Tester / Security / Docs] ACI --> ToolGateway[Tool Gateway + Policy Engine] ToolGateway -->|allow| Sandbox[Ephemeral Sandbox / Worktree / Container] ToolGateway -->|ask| HITL[Human Review / Approval Console] ToolGateway -->|deny| Kernel Sandbox --> Verifier[Verification Engine
tests / lint / typecheck / build / security] Verifier --> Kernel Kernel --> Repair[Auto-Repair Loop] Repair --> ACI Kernel --> PR[Branch / Commit / Draft PR] PR --> CI[CI + Checks] CI --> HITL HITL -->|approve / request changes / edit plan / take over| Kernel HITL -->|final approve| Merge[Human Merge / Release Gate] State --> Obs[Trajectory Store / Audit / Metrics / Evals] Sandbox --> Obs ToolGateway --> Obs HITL --> Obs ``` **一句话分工：** | 层 | 责任 | |---|---| | Agent Kernel | 自己决定下一步：查代码、制定计划、编辑、测试、修复、请 reviewer、开 PR | | Harness | 提供受控工具、沙箱、权限、checkpoint、验证、审批、日志 | | Human | 只 review 关键节点：高风险计划、敏感工具、最终 PR、失败升级 | --- ## 2. 核心组件设计 ### 2.1 Task Intake & Contract Builder 输入可以是 issue、Jira、Linear、Slack、人工 prompt。第一步不是直接写代码，而是生成结构化任务合同： ```ts type TaskSpec = { task_id: string; goal: string; repo: string; branch_base: string; acceptance_criteria: string[]; non_goals: string[]; risk_hints: string[]; required_checks: string[]; human_contact?: string; }; ``` **自动继续条件：** - 需求清楚； - 影响范围低； - acceptance criteria 可测试； - 不涉及安全、权限、支付、数据迁移、生产配置。 **请求人类澄清条件：** - 需求模糊； - 目标和现有行为冲突； - agent 发现多个可行方案且 trade-off 明显； - 需要产品决策。 --- ### 2.2 Agent Kernel：让 agent 主导 workflow Agent Kernel 不应是固定 DAG，而应是一个“可约束的自治 loop”。 ```python while not state.done and budget.remaining(): state = load_run_state() action = agent.decide_next_action( task=state.task, repo_summary=state.repo_summary, plan=state.plan, failures=state.failure_ledger, tools=tool_registry.available_tools(), policy_summary=policy.visible_rules(), ) decision = policy.evaluate(action, state) if decision.type == "deny": observation = ToolObservation.denied(decision.reason) elif decision.type == "ask_human": checkpoint(state, action) human_decision = approval_service.interrupt(action, evidence_pack(state)) observation = apply_human_decision(human_decision) else: observation = tool_gateway.execute(action) state = agent.reflect_and_update_state(state, action, observation) if verifier.should_run(action, state): report = verifier.run(state) state = agent.reflect_and_repair_or_continue(state, report) ``` 重点：**下一步由 agent 决定**，但每个动作都经过 `Policy Engine` 和 `Tool Gateway`。 --- ### 2.3 Self-Reflection Ledger：可自省，但不要存 raw chain-of-thought 建议保存的是**可审计的工程推理摘要**，而不是完整隐藏思维链。 ```ts type AgentRunState = { phase: | "intake" | "discover" | "plan" | "implement" | "verify" | "repair" | "review" | "done" | "escalated"; current_goal: string; plan: Array<{ id: string; description: string; status: "todo" | "doing" | "done" | "blocked"; evidence_refs: string[]; }>; assumptions: Array<{ text: string; confidence: "low" | "medium" | "high"; validated_by?: string; }>; context_map: Array<{ file: string; relevance: string; symbols?: string[]; }>; failure_ledger: Array<{ check: string; command: string; failure_summary: string; suspected_causes: string[]; attempted_fixes: string[]; next_experiment?: string; }>; risk: { score: "low" | "medium" | "high" | "critical"; reasons: string[]; requires_human_gate: boolean; }; verification: { tests_added: string[]; tests_run: string[]; lint: "pass" | "fail" | "not_run"; typecheck: "pass" | "fail" | "not_run"; ci: "pass" | "fail" | "pending" | "not_run"; }; open_questions: string[]; done_criteria_status: Record; }; ``` 每次重要 observation 后，agent 必须更新： 1. 我现在认为问题在哪里； 2. 哪些证据支持； 3. 哪些假设未验证； 4. 下一步实验是什么； 5. 什么情况下停止并升级给人。这就是“自省”，但保持工程可读、可审计。 --- ## 3. ACI / Tool 层设计不要把 agent 直接接到完整 shell。给它一组专用工具。 ### 推荐工具集 | 工具 | 用途 | 默认权限 | |---|---|---| | `repo_search(query)` | 搜代码、符号、调用点 | allow | | `view_file(path, range)` | 分段查看文件 | allow | | `semantic_search(query)` | 基于 embeddings / AST 的上下文检索 | allow | | `dependency_graph(symbol)` | 查调用关系、依赖 | allow | | `apply_patch(diff)` | 应用 patch | 条件 allow | | `edit_file(path, range, replacement)` | 精确编辑 | 条件 allow | | `run_test(selector)` | 运行测试 | allow | | `run_lint()` | lint | allow | | `run_typecheck()` | typecheck | allow | | `safe_bash(cmd)` | 受限 shell | policy gated | | `git_diff()` | 查看 diff | allow | | `git_commit(message)` | 提交到 agent 分支 | 条件 allow | | `open_draft_pr()` | 创建 draft PR | ask / allow by repo policy | | `request_review()` | 请求人类 review | allow | | `mark_ready_for_review()` | 转正式 PR | ask | | `merge_pr()` | 合并 | deny，必须人类 | SWE-agent 的经验支持这种方向：专用 ACI、受控编辑器、linter、目录搜索和文件查看器能显著改善 agent 操作代码库的能力。([github.com](https://github.com/SWE-agent/SWE-agent/blob/main/docs/background/aci.md)) --- ## 4. Sandbox / Workspace 每个任务创建独立 workspace： - 独立 git worktree 或 clone； - 独立 branch； - Docker / microVM sandbox； - 最小权限 token； - 默认无生产 secret； - 网络 egress allowlist； - 所有命令 stdout / stderr 入库； - 文件改动全部通过 diff 记录。 OpenHands 文档把 sandbox 定义为 agent 执行命令、编辑文件、启动服务的环境，并推荐 Docker sandbox 以隔离 host；GitHub Copilot cloud agent 也在临时开发环境中探索代码、改代码、跑测试和 linter。([docs.openhands.dev](https://docs.openhands.dev/openhands/usage/runtimes/overview)) --- ## 5. Policy Engine：只在关键节点问人建议实现 **allow / ask / deny** 三态策略。 ### 默认策略矩阵 | 动作类型 | 默认策略 | |---|---| | 读文件、搜索、查看 git diff | allow | | 运行本地测试、lint、typecheck | allow | | 修改普通业务代码 | allow if sandbox + diff 小 + 非 protected path | | 修改测试、文档 | allow | | 新增依赖 | ask | | 修改 lockfile | ask | | 修改 auth、crypto、payment、permission、tenant isolation | ask | | 修改 DB migration / schema | ask | | 修改 CI/CD、Dockerfile、部署脚本 | ask | | 访问外网 | ask，必须说明目的和域名 | | 读取 secret / `.env` / credential | deny | | 删除大量文件 | ask 或 deny | | force push / rewrite history | deny | | merge 到 protected branch | deny | | prod deploy / kubectl apply / terraform apply | deny | | 关闭测试、安全扫描、绕过 CI | deny | OpenAI 的 agent 指南建议把 guardrails 做成多层防御，并在高风险动作或超过失败阈值时触发人类介入；Agents SDK 也支持对敏感工具调用暂停等待批准。([openai.github.io](https://openai.github.io/openai-agents-python/guardrails/)) ### Policy-as-code 示例 ```yaml protected_paths: - "infra/**" - ".github/workflows/**" - "migrations/**" - "auth/**" - "payments/**" - "security/**" - "**/.env*" tools: repo_search: default: allow view_file: default: allow deny_if_path_matches: - "**/.env*" - "**/secrets/**" apply_patch: default: allow ask_if: - path_matches: protected_paths - diff_lines_gt: 500 - deletes_files: true - modifies_public_api: true deny_if: - path_matches: ["**/.env*", "**/private_keys/**"] safe_bash: allow_patterns: - "git status" - "git diff*" - "pytest*" - "npm test*" - "pnpm test*" - "ruff*" - "mypy*" - "tsc*" ask_patterns: - "npm install*" - "pnpm add*" - "pip install*" - "curl*" - "wget*" - "docker*" deny_patterns: - "rm -rf /*" - "git push --force*" - "kubectl *" - "terraform apply*" - "aws *" - "gcloud *" merge_pr: default: deny ``` --- ## 6. Auto-Repair Loop Auto-repair 应该是 harness 的一等能力，而不是“失败后再问人”。 ### 失败分类 ```ts type FailureKind = | "syntax" | "lint" | "typecheck" | "unit_test" | "integration_test" | "build" | "security_scan" | "dependency" | "flaky_test" | "merge_conflict" | "ambiguous_requirement" | "environment"; ``` ### 修复流程 ```mermaid flowchart TD F[Failure Report] --> C[Classify Failure] C --> R[Retrieve Relevant Context] R --> H[Generate Hypothesis] H --> P[Patch Candidate] P --> T[Run Targeted Check] T -->|pass| Full[Run Broader Verification] T -->|fail| Reflect[Reflect + Update Failure Ledger] Reflect --> Budget{Retry Budget Left?} Budget -->|yes| R Budget -->|no| Escalate[Human Escalation Pack] Full -->|pass| Continue[Continue Workflow] Full -->|fail| Reflect ``` AutoCodeRover 的论文中也采用了“生成 patch 后跑测试，失败则重新调用 patch generation agent”的验证重试思路。([zhiyufan.github.io](https://zhiyufan.github.io/files/ISSTA2024a.pdf)) ### 默认 retry budget | 场景 | 自动修复次数 | |---|---:| | lint / format | 5 | | syntax / typecheck | 4 | | 单元测试失败 | 3 | | 集成测试失败 | 2 | | flaky / 环境问题 | 1，然后标记不确定 | | 安全扫描失败 | 1，然后 ask | | schema / infra / auth 相关失败 | ask | ### 必须升级给人的条件 - 同一失败连续 3 次； - agent 想扩大改动范围超过原计划； - 需要修改 protected path； - acceptance criteria 不可验证； - 修复会改变 public API； - agent confidence 低； - CI 和本地结果冲突； - 需要产品或架构决策。 --- ## 7. Human Review Console 人类界面不是聊天窗口，而是 **review cockpit**。 ### 每次 ask human 必须提供 Evidence Pack ```ts type EvidencePack = { task_summary: string; current_phase: string; proposed_action: string; why_needed: string; risk_reasons: string[]; files_touched: string[]; diff_summary?: string; tests_run: string[]; failing_checks?: string[]; alternatives_considered: string[]; rollback_plan: string; requested_decision: | "approve_tool" | "approve_plan" | "choose_option" | "clarify_requirement" | "review_pr" | "take_over"; }; ``` ### 人类可做的动作 | 人类动作 | Agent 后续 | |---|---| | Approve | 继续执行 | | Reject | 重新计划 | | Edit plan | 更新 state，继续 | | Add constraint | 写入 task contract | | Request changes | agent 自动修复 | | Take over | 停止 agent，保留 branch | | Final approve | 人类合并或发布 | LangGraph 的 HITL 支持 approve/reject，也支持 review 并编辑 graph state 后恢复执行，这正适合这里的 review cockpit。([docs.langchain.com](https://docs.langchain.com/oss/python/langgraph/human-in-the-loop)) --- ## 8. PR / Review 工作流建议采用 GitHub-style PR 作为最终交付边界： 1. agent 自动创建 branch； 2. agent 自动 commit 小步改动； 3. agent 自动跑本地 checks； 4. agent 可自动创建 draft PR； 5. agent 写 PR summary、测试证据、风险说明； 6. human review； 7. reviewer comment 触发 agent repair； 8. CI 全绿后，human merge。 GitHub Copilot cloud agent 的产品形态也是让 agent 研究 repo、制定计划、在 branch 上改代码、跑测试和 linter，然后通过 PR 让开发者 review；GitHub 文档还强调这种方式让步骤通过 commit 和 logs 可见。([docs.github.com](https://docs.github.com/en/copilot/using-github-copilot/coding-agent/about-assigning-tasks-to-copilot)) ### PR 模板 ```md ## Summary - What changed - Why ## Acceptance Criteria - [x] ... - [ ] ... ## Files Changed - `src/foo.ts`: ... - `tests/foo.test.ts`: ... ## Verification - [x] unit tests: `...` - [x] lint: `...` - [x] typecheck: `...` - [ ] integration tests: not run, reason: ... ## Agent Notes - Assumptions: - Risks: - Areas needing human attention: ## Rollback - Revert commit: ... ``` --- ## 9. Repo Instruction / Agent Memory 每个 repo 应该有一个 agent-facing instruction 文件，例如： ```md # AGENTS.md ## Build - pnpm install - pnpm test - pnpm typecheck ## Code style - Use existing service pattern in `src/services` - Do not introduce new state management library ## Testing - Add unit tests for bug fixes - Prefer integration tests for API route changes ## Protected areas - Ask before editing `infra/**` - Ask before editing `auth/**` - Never read `.env*` ## PR etiquette - Small commits - Include verification output - Draft PR first ``` Claude Code 的最佳实践也建议把 bash 命令、测试 runner、repo etiquette、架构决策等写进项目说明，并像代码一样 review / prune 这些 instructions。([anthropic.com](https://www.anthropic.com/engineering/claude-code-best-practices?_bhlid=6d35c25e04616eabb4469b78a6e8848b61863ba0)) --- ## 10. 推荐落地版本 ### MVP v0：Review-first - agent 可读 repo； - agent 生成 plan； - human approve plan； - agent 生成 patch； - human review diff； - 不自动 push。适合刚开始建立信任。 ### v1：Autonomous implementation - 自动 branch； - 自动 patch； - 自动测试； - 自动修复； - 失败阈值后 ask； - 自动 draft PR； - human final review。这是你描述的主要目标形态。 ### v2：Risk-adaptive autonomy - 低风险任务全自动到 draft PR； - 中风险任务 plan gate； - 高风险任务 tool gate + final review； - reviewer comment 自动修复； - metrics 驱动逐步放宽 allowlist。 --- ## 11. 关键指标 | 指标 | 目的 | |---|---| | Human interrupts per task | 是否打扰太多 | | Auto-repair success rate | 自修复能力 | | First CI pass rate | 初次质量 | | PR acceptance rate | 人类 review 通过率 | | Reviewer comments per PR | 代码质量 | | Mean time to draft PR | 效率 | | Escalation reasons | 哪些地方需要改工具或策略 | | Denied tool calls | 安全策略是否合理 | | Reverted agent PRs | 线上质量 | | Protected-path edit attempts | 风险行为监控 | --- ## 12. 最小可实现技术栈如果你要快速做： - **Orchestrator**：LangGraph / Temporal / 自研 event loop； - **Checkpoint**：Postgres； - **Sandbox**：Docker container + git worktree； - **Tool Gateway**：Python/FastAPI service； - **Policy Engine**：YAML + deterministic matcher； - **Agent State**：JSONB + event sourcing； - **Verification**：repo-specific command registry； - **Review UI**：GitHub PR + lightweight approval dashboard； - **Observability**：OpenTelemetry + trajectory viewer。 OpenHands 的新 SDK 论文也强调 production software agents 需要 flexible implementation、reliable/secure execution、human interaction interfaces、sandboxed execution、lifecycle control 和 security analysis；这和上面的 harness 分层基本一致。([arxiv.org](https://arxiv.org/abs/2511.03690)) --- ## 13. 最重要的设计原则 **不要让 harness 决定 workflow。** Harness 应该决定： - 哪些工具存在； - 哪些动作安全； - 何时 checkpoint； - 何时 ask human； - 如何验证； - 如何记录证据； - 如何回滚。 **Agent 应该决定：** - 先查哪里； - 是否需要写测试； - 先修哪个失败； - 是否需要更多上下文； - 什么时候调用 reviewer/tester/security sub-agent； - 什么时候认为任务完成； - 什么时候主动升级给人。这样才能满足你的要求：**agent 主导、自省、自动修复；人在 loop，但主要是 review，不是 babysit 每一步。**

Using Claude Code: The Unreasonable Effectiveness of HTML

Mon, 11 May 2026 07:23:30 GMT

![图像](https://pbs.twimg.com/media/HHz_ftzaIAAwkQs?format=jpg&name=large) Markdown has become the dominant file format used by agents to communicate with us. It’s simple, portable, has some rich text capability and is easy for you to edit. Claude has even gotten surprisingly good at using ASCII to make diagrams inside of markdown files. But as agents have become more and more powerful, I have felt that markdown has become a restricting format. I find it difficult to read a markdown file of more than a hundred lines. I want richer visualizations, color and diagrams and I want to be able to share them easily. I'm also increasingly not editing these files myself, but using them as specs, reference files, brainstorming outputs, etc. When I do make edits, I’m usually prompting Claude to edit them, which removes one of markdown’s largest benefits. I’ve started preferring HTML as an output format instead of Markdown and increasingly see this being used by others on the Claude Code team, this is why. (if you want to start with some examples, you can see a bunch here: [https://thariqs.github.io/html-effectiveness](https://thariqs.github.io/html-effectiveness/), just be sure to come back and read more about why) # Why HTML? ## Information Density ![图像](https://pbs.twimg.com/media/HHz_q48aAAAaCfW?format=jpg&name=large) HTML can convey much richer information compared to markdown. It can of course do simple document structure like headers and formatting, but it can also represent all sorts of other information such as: - Tabular data using tables - Design data with CSS - Illustrations with SVG - Code snippets with script tags - Interactions using HTML elements with javascript + CSS - Workflows using SVG and HTML - Spatial data using absolute positions and canvases - Images using image tags I would go so far as to say that there is almost no set of information that Claude can read that you cannot fairly efficiently represent with HTML. This makes it a highly efficient way for the model to communicate in-depth information to you and for you to revie wit. I’ve found that in the absence of being able to do this, the model may do more inefficient things in markdown like ASCII diagrams or, my favorite, estimating colors with unicode characters like in this screenshot from Claude Code. ![图像](https://pbs.twimg.com/media/HH0CDc6a8AAy1bv?format=png&name=large) Claude Code trying to show color in markdown ## Visual Clarity & Ease of Reading ![图像](https://pbs.twimg.com/media/HH0AgqJbcAAaEcZ?format=jpg&name=large) As Claude is able to do more complex work, it is also writing larger and larger specs and plans. In practice, I've found I tend to not actually read more than a 100-line markdown file, and I certainly am not able to get anyone else in my organization to read it. But HTML documents are much easier to read, Claude can organize the structure visually to be ideal to navigate with tabs, illustrations, links, etc. It can even be mobile responsive so you can read it differently based on your form factor. ## Ease of Sharing Markdown files are fairly hard to share since most browsers do not render them natively well. You often have to add them as attachments to emails or messages. With HTML, as long as you upload the file (for example to S3), you can share the link easily. Your colleagues can open it wherever they wish and easily reference it. The chance of someone actually reading your spec, report or PR writeup is much much higher if it’s in HTML. ## Two-way Interaction ![图像](https://pbs.twimg.com/media/HH0Ao0tbYAAOF9e?format=jpg&name=large) HTML can allow you to interact with the document, for example you might want to ask it to add sliders or knobs to adjust a design or allow you to tweak different options in the algorithm to see what happens. You can also ask it to let you copy these changes into a prompt to paste back into Claude Code. Read more about my playgrounds post to see examples of this two way interaction: [https://x.com/trq212/status/2017024445244924382](https://x.com/trq212/status/2017024445244924382) **Data Ingestion** Why use Claude Code to make HTML files instead of ClaudeAI or Claude Design for example? One of the biggest reasons is all the context Claude Code can ingest. For example, when writing this article, I asked Claude Code to read through my code folder and find all the HTML files I’ve generated, group and categorize them and then make an HTML file with all diagrams representing each type. The diagrams you see in this article are a direct result of that. Besides the file system, Claude Code can find additional context using your MCPs (like Slack, Linear, etc.), your web browser (with Claude in Chrome), your git history, etc. ## It’s Joyful Making HTML documents with Claude is just more fun and makes me feel more involved and invested in the creation, and that by itself is enough. ## How to Get Started I’m a little bit afraid that people will read this article and turn it into a /html skill or something. While there might be some value in that, I want to emphasize that you don’t need to do much to get Claude to do this. You can just ask it to “make a HTML file” or “make a HTML artifact”. The trick is knowing what you want the artifact to do and how you might use it. You may over time make a skill, but for now I’d suggest just prompting from scratch to get a hang of how to use it in different cases. # Use Cases To make this more concrete, I’ve made many different HTML files for different use cases. You can view all of them here: [https://thariqs.github.io/html-effectiveness/](https://thariqs.github.io/html-effectiveness/) but here’s an overview. ## Specs, Planning & Exploration HTML is a rich canvas for Claude to dive into a problem. When I start working on a problem instead of a simple markdown plan I expect to make a web of HTML files. For example, I might start with asking Claude Code to brainstorm and create some explorations of different options. I would then ask it to expand more into one, maybe make mockups or code snippets. Finally, when I feel good I’ll ask it to write an implementation plan. When I’m happy with the plan I’ll create a new session and pass in all of these files for it to implement. When verifying I’ll also ask the verification agent to read in the files and it will have much broader context on what is needed. ![图像](https://pbs.twimg.com/media/HH0BFWLbMAEk_7T?format=jpg&name=large) **Example Prompts:** - I'm not sure what direction to take the onboarding screen. Generate 6 distinctly different approaches — vary layout, tone, and density — and lay them out as a single HTML file in a grid so I can compare them side by side. Label each with the tradeoff it's making. - Create a thorough implementation plan in a HTML file, be sure to make some mockups, show data flow and add important code snippets I might want to review. Make it easy to read and digest. **Use Cases:** - Exploring other ways to implement something in code - Exploring multiple visual designs ## Code Review & Understanding Code can be difficult to read in a Markdown file. But with HTML we can render diffs, annotations, flowcharts, modules, etc. Use this to understand code that the agent has written, to get code review or to explain a PR to someone reviewing your code. I find this often works better than the default Github diff view, and I attach a HTML code explainer to every PR I make now. ![图像](https://pbs.twimg.com/media/HH0BRSQbMAAuuof?format=png&name=large) **Example prompt:** Help me review this PR by creating an HTML artifact that describes it. I'm not very familiar with the streaming/backpressure logic so focus on that. Render the actual diff with inline margin annotations, color-code findings by severity and whatever else might be needed to convey the concept well. **Use Cases:** - Creating a PR - Reviewing a PR - Understanding a topic in Code ## Design & Prototypes Claude Design is based on HTML because HTML is incredibly expressive at design, even if your end surface is not HTML. Claude can sketch out a design in HTML and then write it in your language of choice, be it React, Swift, etc. You can also prototype interactions, such as animations, actions, etc. Consider asking Claude to make sliders, knobs, etc. to tune in exactly what you’re looking for. ![图像](https://pbs.twimg.com/media/HH0BXqjboAAHGsw?format=jpg&name=large) **Example prompt:** I want to prototype a new checkout button, when clicked it does a play animation and then turns purple quickly. Create a HTML file with several sliders and options for me to try different options on this animation, give me a copy button to copy the parameters that worked well. **Use this for:** - Creating design system artifacts - Adjusting components - Visualizing component libraries - Prototyping Joyful Animations ## Reports, Research & Learning Claude Code is incredibly good at synthesizing information across multiple data sources and converting it into a report for readability. You can prompt Claude to search your Slack, your codebase, git history, the internet, etc. and use it to generate extremely readable reports for yourself, for leadership, for your team, etc. You could assemble this in the form of a long HTML document, an interactive explainer or even a slideshow/deck. Ask Claude to use SVG for diagrams to help visualize it. For example, for my posts on prompt caching, I asked Claude to prepare an in-depth research file in HTML for me to read on all of our changes to prompt caching after reading the git history. ![图像](https://pbs.twimg.com/media/HH0Bp86bUAAJDyZ?format=jpg&name=large) **Example prompt:** I don't understand how our rate limiter actually works. Read the relevant code and produce a single HTML explainer page: a diagram of the token-bucket flow, the 3–4 key code snippets annotated, and a "gotchas" section at the bottom. Optimize it for someone reading it once. **Use this for:** - Summarize how a feature works - Explain a concept to me - Weekly status reports to your boss - Incident reports to your leadership - SVG illustrations, flowcharts, technical diagrams, etc # Custom Editing Interfaces Sometimes it’s hard to describe what you want purely in a text box. In this case, I'll ask Claude to build me a throwaway editor for the exact thing I'm working on. Not a product, or a reusable tool, but a single HTML file, purpose-built for this one piece of data. The trick is always to end with an export: a "copy as JSON" or "copy as prompt" button that turns whatever I did in the UI back into something I can paste into Claude Code. ![图像](https://pbs.twimg.com/media/HH0FbKebUAAsRPr?format=jpg&name=large) **Example prompts:** - I need to reprioritize these 30 Linear tickets. Make me an HTML file with each ticket as a draggable card across Now / Next / Later / Cut columns. Pre-sort them by your best guess. Add a "copy as markdown" button that exports the final ordering with a one-line rationale per bucket. - Here's our feature flag config. Build a form-based editor for it, group flags by area, show dependencies between them, warn me if I enable a flag whose prerequisite is off. Add a "copy diff" button that gives me just the changed keys. - I'm tuning this system prompt. Make a side-by-side editor: editable prompt on the left with the variable slots highlighted, three sample inputs on the right that re-render the filled template live. Add a character/token counter and a copy button. **Use this for:** - Reordering, triaging, or bucketing anything (tickets, test cases, feedback) - Editing structured config (feature flags, env vars, JSON/YAML with constraints) - Tuning prompts, templates, or copy with live preview - Curating datasets, approve/reject rows, tag examples, export the selection - Annotating a document, transcript, or diff and exporting the annotations - Picking values that are painful to express in text: colors, easing curves, crop regions, cron schedules, regexes. ## Frequently Asked Questions I’ve been telling many people about how I’ve switched to HTML and I’ve seen a few repeated questions. **Isn’t it less token efficient?** While markdown often uses fewer tokens, I’ve found that the added expressiveness of HTML and the much higher likelihood of me reading it means I get overall better output. With the 1MM context window in Opus 4.7, the increased token usage is not really noticeable in the context window. **When do you use markdown for now?** I have honestly stopped using markdown altogether for almost everything, but I’m probably far on the HTML maximalist side of things. **How do I view the HTML file?** I tend just open it in a browser locally (you can ask Claude to open it), or upload to S3 if you want a shareable link. **Doesn't this take longer to generate than markdown?** This does take longer! HTML can take 2-4x longer than Markdown, but I've found the results are worth it. **What about version control?** This is honestly one of the biggest downsides of HTML, HTML diffs are noisy and hard to review compared to Markdown. **How do I get Claude to match my taste / not make it ugly?** The frontend design plugin helps Claude make good HTML files. But to match your own companies style, you can create a single design system HTML file by pointing Claude at your codebase. You can then use that design system file as a reference for other html files. ## Stay in the Loop All of the above is to say that I think the real reason I use HTML is that I feel much more in the loop with Claude. I had begun to fear that because I had stopped reading plans in depth I would simply have to leave Claude to make its choices. But I am happy to say instead that I feel more in the loop than ever before when using HTML. I hope you do too.

从 Skills 到分层 Workflow：AI Agent 工程化的下一层抽象

Sun, 10 May 2026 16:26:37 GMT

![图像](https://pbs.twimg.com/media/HH27K7gbQAANEU5?format=jpg&name=large) 很多 AI Agent 项目走到中后期，都会开始沉淀 skills。一开始，这几乎是必然动作。写代码要有 TDD skill，排查问题要有 debug skill，提交前要有 review skill，写文档要有 writing skill。每一个 skill 都像是在给 Agent 补一块局部能力：让它在某类任务上更稳定、更专业、更符合团队习惯。但系统一旦继续生长，一个更深的问题就会浮出来：**skills 的数量增加，并不必然带来 Agent 行为的稳定性。** 表面上看，这是一个能力组织问题；但在机制层面，它其实是一个工程控制问题。小规模时，问题是“有没有合适的 skill”；规模扩大后，真正的问题变成了：**这些能力在什么阶段被调用，由什么责任视角调用，依据什么输入推进，留下什么产物，又由什么机制决定是否可以进入下一步。** 这也是我设计 Unified Skills 时真正想解决的事情：**不是再做一组 skills，而是把 skills 组织成一套分层 workflow。** 更准确地说，skills 解决的是“怎么做”，而 workflow 解决的是“什么时候做、由谁做、做到什么程度算通过、失败时退回哪里、过程证据留在哪里”。当 Agent 开始承担连续工程任务时，真正要治理的就不再是能力本身，而是能力进入流程的方式。 ## 一、Skills Library 的上限，不是能力不足，而是工程失稳 ![图像](https://pbs.twimg.com/media/HH28XHobYAASd6l?format=png&name=large) 很多团队一开始搭建 skills library，逻辑都很自然：把常见问题抽成一组可复用方法论，需要时调用。 - 需要写测试时，调用 TDD skill； - 需要调试时，调用 debug skill； - 需要审查时，调用 review skill； - 需要写作时，调用 writing skill。这种方式当然有效。因为它把零散经验变成了可复用结构，让 Agent 至少在局部问题上不再完全依赖临场发挥。但这套方式的上限也很清楚。问题不在于 skill 没价值，而在于**工程工作从来不是一次技能调用，而是一条连续推进的责任链。** 当任务复杂度上升，系统会开始暴露出四类典型失稳。 1\. 时序失稳：Agent 会跳步骤 Agent 很容易从一个模糊想法直接进入实现，然后在最后补一句“已验证”。skill 可以告诉它测试该怎么写，debug 应该怎么做，但 skill 本身并不能天然约束它：**什么时候才有资格开始写，什么时候必须先澄清，什么时候必须先停下来做设计。** 这意味着，skills library 解决了局部执行质量，却没有解决阶段推进合法性。系统依旧可能以一种“看起来很高效，实际上不断跳过前置条件”的方式运行。 2\. 责任失稳：Agent 会自证通过更危险的问题，是 self-confirming loop。同一个 Agent 可以自己理解需求，自己做设计，自己列计划，自己完成实现，再自己 review，最后得出“没有问题”的结论。问题不在于它不努力，而在于工程系统不能把“提出问题、执行任务、判断通过”全部交给同一个认知视角。这是工程里最常见却最容易被忽视的风险：**不是模型能力不够，而是责任边界没有切开。** 3\. 证据失稳：过程不可追踪一次对话里，Agent 看起来好像完成了很多工作：澄清了需求、查了资料、做了设计、列了计划、写了实现、通过了 review。但几天之后回看，往往很难回答以下问题： - 当时的需求边界到底是什么？ - 哪些外部资料被采纳，哪些被拒绝？ - 设计时讨论过哪些替代方案？ - 计划里哪些任务允许并行，哪些必须串行？ - review 审的是 spec 完整性，还是只是代码风格？ - ship 时有没有留下导出、同步、发布和回滚记录？如果这些问题无法回放，那么整个流程看似完成，实际上却缺少可审计证据。它更像一场即时表演，而不是一个可复盘的工程过程。 4\. 治理失稳：skills 之间没有组织关系 TDD、review、debug 当然都可以是好 skill，但如果它们只是平铺在一个目录里，Agent 仍然要在运行时临场决定：先调哪个，什么时候切换，什么情况下跳过，失败后回到哪里。而这个“临场决定顺序”的过程，本身就是最大的随机性来源。所以，skills library 可以提升局部能力，但不能单独解决工程稳定性。因为它解决的是**能力复用问题**，不是**流程治理问题**。 ## 二、Workflow 不是 Skills 的顺序表，而是阶段协议 ![图像](https://pbs.twimg.com/media/HH28dXfa8AACPB-?format=png&name=large) 从这个角度看，Unified Skills 的第一层升级，并不是多做几个 skill，而是把 skills 放进一个明确的阶段流里。主路径不是“需要什么就调用什么”，而是： /refine -> /design -> /plan -> /build -> /review -> /ship 这条路径背后的判断是：**工程交付需要状态机，而不是自由联想。** /refine：把模糊想法收敛成可验证规格 /refine 的任务不是“继续聊一聊需求”，而是把模糊想法压缩成可验证的 spec。它关心的是： - 问题是什么； - 用户是谁； - 成功标准是什么； - 约束有哪些； - 当前缺哪些外部事实； - 最终产物类型是什么。这一阶段如果没有收敛清楚，后面所有实现都可能在错误目标上越跑越快。 /design：在实现前冻结创作与体验判断很多系统最容易犯的错，是把 design 偷偷塞进 build。也就是说，边做边想、边写边改、边实现边决定体验。这在小任务里看起来无伤大雅，但在多产物系统里会迅速失控。UI、文章、deck、视觉稿这些产物，本质上都需要先完成创作和体验层面的判断，再进入生产。否则，build 阶段就会不断替代 design 阶段，最终让整个流程失去边界。 /plan：定义任务拓扑，而不是写一份待办清单 /plan 真正要做的不是列出一些 todo，而是定义任务拓扑： - 哪些任务必须串行； - 哪些任务可以并行； - 哪些文件或模块允许写入； - 哪些步骤完成后才能进入下一阶段； - 哪些风险需要前置处理。计划不是形式主义，它的意义在于把“工作如何展开”从运行时 improvisation，变成可审查的结构。 /build：消费已批准输入，而不是重新发明目标 /build 才是实现和内容生产真正发生的地方。但它最重要的纪律不是“认真执行”，而是：**只消费已经批准的 spec、design 和 plan，而不是在实现过程中重新定义目标。** 这是很多 Agent workflow 会失效的关键点。因为如果 build 可以随时回写目标、改写边界、替代 review，那前面的阶段就会全部失去约束意义。 /review：门控，而不是口头确认 /review 也不应该只是“帮我看看”。真正的 review 是门控。它的职责不是鼓励，不是润色，也不是在明显缺失时给一句“整体不错”。它必须有能力阻断流程：只要发现 blocking 问题，就要明确退回 /build，必要时甚至退回 /plan 或 /refine。 /ship：交付完成，不等于实现结束最后的 /ship 处理的不是代码本身，而是交付动作：发布、导出、同步、记录、回滚信息、交付痕迹。工程系统最容易被忽略的一点是：**代码写完，并不等于交付完成。** 交付真正结束，必须以可追踪的收尾动作为标志。因此，workflow 的意义不在于给 skills 排一个顺序，而在于把每个阶段变成有输入、有输出、有门控的协议。它解决的不是“快点进入某种能力”，而是“任务如何合法地推进到下一状态”。 ## 三、真正的升级，不只是阶段流，而是纵向分层 ![图像](https://pbs.twimg.com/media/HH28gSsasAAM9kq?format=png&name=large) 如果只有阶段流，这个系统依然可能退化成一组更长、更复杂的 prompt。真正让 Unified Skills 变成工程系统的，不只是横向阶段，而是纵向分层。我把它抽象成六层： CANON -> Command -> Agent -> Skill -> Artifact -> Hook / validate 这不是六种文件分类，而是六种不同的系统职责。每一层都解决一个问题，同时拒绝解决另一个问题。一个方案是否成熟，不看它能不能跑通一次，而看它能不能被复用、治理和演化。 ## 四、CANON：所有 Workflow 的宪法 ![图像](https://pbs.twimg.com/media/HH28kqLboAEXHvJ?format=png&name=large) 最上层是 CANON.md。它不是某个具体 skill，也不是项目说明书，而是所有阶段、角色和技能都必须继承的全局纪律。它定义的不是某类任务技巧，而是不可放松的底线：先陈述假设、控制范围、验证优先、遇到矛盾先停止并澄清、调试先找根因、不做 yes-machine。这一层解决的是**纪律统一**，而不解决具体任务策略。为什么它必须独立存在？因为如果没有 CANON，每个 skill 都会带着自己的隐含价值观。TDD skill 强调测试，debug skill 强调根因，review skill 强调质量，但它们之间缺少统一的行为合同。久而久之，整个系统会出现一种隐性腐蚀：局部方法论都很认真，整体行为却越来越不一致。所以 CANON 的作用不是让技能更强，而是让所有局部方法论不能为了局部方便，绕过全局纪律。这也是第一条原则：**具体能力可以增加纪律，但不能放松纪律。** ## 五、Command：阶段控制器，而不是快捷入口 Command 层回答的问题是：**现在处在哪个阶段，这个阶段应该读什么、产出什么、通过条件是什么。** 在 Unified Skills 里，Command 不是 prompt shortcut，而更像 workflow controller。例如 /plan 的职责，不是“调用一个计划类 skill”，而是定义计划阶段的合法输入、合法输出和门控条件。它要消费已经批准的 spec 和 design，产出 03-plan.md，在大型任务里拆出子计划和并行矩阵，并明确写入范围和风险点。这一层解决的是**阶段推进协议**，而不解决具体任务的方法论。换句话说，Command 不负责告诉你“debug 怎么做”，也不负责告诉你“review 怎么看代码”；它负责回答的是：**当前阶段是否具备进入条件，当前产物是否达到通过条件。** 这也是第二条原则：**workflow 需要阶段状态机，而不是能力快捷方式。** ## 六、Agent：责任视角，而不是人格表演 Agent 层最容易被误解。很多系统引入 agent，是为了让模型“扮演”产品经理、架构师、设计师、审查员。但如果角色只停留在语气层面，它最多制造一点表演感，并不能真正提升工程质量。 Unified Skills 里的 Agent 层，核心不是角色扮演，而是**责任切分**。需求分析、任务计划、软件实现、规格审查、代码质量审查、发布判断，最好不要全部由同一个视角闭环完成。不是因为模型不能同时做这些事，而是因为工程系统不能把“提出问题、执行任务、判断通过”压在同一个认知回路里。例如： - review agent 不应该重新定义需求，它应该基于已批准的 spec 判断实现是否完整； - software engineer agent 不应该在 build 阶段决定任务拓扑，它应该在 plan 的约束内实现； - design reviewer 不应该只说“视觉不错”，它应该阻断缺少证据来源、模式综合和采纳/拒绝理由的设计稿。这一层解决的是**责任分离**，而不解决阶段协议定义。这是第三条原则：**Agent 的核心价值是责任分离，不是人格化。** ## 七、Skill：真正可复用的方法论单元 ![图像](https://pbs.twimg.com/media/HH280MjbAAA0NyW?format=png&name=large) Skill 层是最具体的一层，也是最容易被过度简化的一层。一个合格的 skill，不应该只是一段“请你认真做 X”的提示词。它必须至少说明： - 什么时候进入； - 什么时候退出； - 具体步骤是什么； - 哪些说法是常见借口； - 哪些情况必须停止； - 如何验证自己做完了。这也是为什么 Unified Skills 里的 SKILL.md 往往不只是“做法说明”，而是会包含入口/出口、流程、红旗、常见说辞、验证清单，强纪律技能甚至会定义 Iron Law。这一层解决的是**方法论复用**，而不解决整体工作流编排。也就是说，Skill 负责回答“这一类事情怎么做”，但不负责回答“现在是不是该做这件事”。后者属于 Command 和 Agent 的职责。这是第四条原则：**Skill 是执行方法论，不是工作流总控。** ## 八、Artifact：把过程变成证据链 ![图像](https://pbs.twimg.com/media/HH28o2CacAE7eKg?format=png&name=large) 如果只看对话，Agent 的工作很容易变成一段不可回放的即时表演。今天看起来它澄清了需求、做了设计、写了计划、完成了实现、通过了 review；但过几天回头看时，很多关键信息已经散失在上下文里，既无法复盘，也无法迁移。所以 Unified Skills 把 artifact 作为 workflow 的一层，而不是一组顺手保存的文档。 01-spec.md、02-design.md、03-plan.md、04-review.md、05-ship.md 这些文件，并不是文档洁癖，而是 Agent 行为的审计轨迹。它们记录的不是“写过什么”，而是“为什么这样推进、为什么这样取舍、为什么允许进入下一阶段”。这一层解决的是**过程审计与复盘**，而不解决运行时拦截。与此同时，artifact 也是多产物 workflow 成立的前提。软件、文档、文章、deck、视觉稿并不共享同一种构建路径。如果没有 artifact\_type，系统就很容易用软件工程的方式处理所有产物，或者用内容创作的方式绕开软件质量门控。这是第五条原则：**没有 artifact，workflow 就缺少可审计证据。** ## 九、Hook / Validate：把约定变成护栏 ![图像](https://pbs.twimg.com/media/HH28rl9bQAAlOaV?format=png&name=large) 只靠提示词约束 Agent，是不稳定的。提示词可以提醒模型不要做破坏性操作，但运行时 hook 才能拦截破坏性命令。文档可以要求技能命名规范、索引一致、根文档同步，但 validate 才能发现 README、AGENTS、skills-index、锁文件和 hooks 实现之间是否已经发生合同漂移。这也是 Unified Skills 里 hooks 和 ./validate 的意义所在。它们不负责替代思考，也不负责替代高层 review；它们负责把一部分纪律从“应该遵守”变成“违反就会暴露”。这一层解决的是**运行时护栏与维护期漂移暴露**，而不替代高层判断本身。这一点非常关键。高层纪律如果没有低层护栏，最终就只是建议。系统最容易发生的腐蚀，不是某个 prompt 突然写错，而是多个合同慢慢表面不一致：README 说一套，AGENTS 说一套，skills-index 还是旧的，hooks 实现又是另一套。等这些漂移积累起来，Agent 在不同入口读到的，就不再是同一个系统真相。这是第六条原则：**高层纪律必须有低层护栏，否则只是建议。** ## 十、两阶段 Review：分层门控的一个具体例子分层 workflow 不是抽象口号，它必须体现在具体门控设计里。Unified Skills 里的 review，就是一个很典型的例子。这里的 review 不是单阶段“代码看起来怎么样”，而是拆成两关。第一关：Spec Compliance 第一关先检查实现是否覆盖了 spec 中定义的功能需求、边界条件、错误路径和验收标准。它关心的不是“写得漂不漂亮”，而是：**实现了什么，是否把该做的事情做全了。** 第二关：Code Quality 只有在第一关通过后，才进入第二关。第二关才讨论 correctness、readability、architecture、security、performance 等质量维度。它关心的是：**这些功能是如何被实现的，代价和质量是否合理。** 这个拆分看似简单，但它体现了分层 workflow 的价值。如果功能都没实现完整，就急着讨论代码风格，审查资源会被浪费；如果功能缺失和质量问题混在一起，反馈也会变得模糊。两阶段 review 的作用，就是把问题类型切开：先确认做没做对，再确认做得好不好。这不是让 review 更复杂，而是让门控更有顺序、更有边界、更有退回路径。 ## 十一、从 Prompt 到 Workflow，再到治理结构 ![图像](https://pbs.twimg.com/media/HH28uR0aUAAkLvv?format=png&name=large) 回到最开始的问题：AI Agent 工程化到底需要什么？更长的 prompt 有用，但不够。更多的 skills 有用，但也不够。真正需要设计的，是 skills 之间的组织关系，以及这些组织关系如何进一步沉淀为治理结构。也就是说： - 用 CANON 定义不可放松的全局纪律； - 用 Command 定义阶段状态机； - 用 Agent 定义责任视角； - 用 Skill 承载可复用方法论； - 用 Artifact 留下过程证据； - 用 Hook / validate 把规则变成护栏。这套结构的目标，不是让 Agent 显得更复杂，而是让它在复杂任务里更可控。一个系统是否工程化，不看它能不能完成一次任务，而看它能不能稳定地推进、回退、审计和复盘。所以，prompt 是表达，skill 是方法，workflow 是制度，而 layered workflow 才是治理结构。 ![图像](https://pbs.twimg.com/media/HH286ASaIAA-git?format=png&name=large) 所谓 Agent 工程化，真正工程化的不是生成能力本身，而是**任务推进权、通过判定权、责任边界和证据链的分配方式。** 这也是 Unified Skills 想表达的核心判断：**AI Agent 的下一层抽象，不是继续堆 skills，而是把 skills 放进一套有阶段状态机、责任分离、证据链和运行护栏的分层 workflow。** 进一步而言，真正成熟的 Agent 系统，不应该只是“会做很多事”，而应该知道：**在什么阶段，由什么模块，以什么责任，依据什么证据，完成什么结果。** 技术方案的价值，不在于提出一个新名词，而在于重新划清问题边界，并给出可落地的系统结构。对 Agent 来说也一样。真正的工程化，不是把能力堆得更多，而是让系统知道自己何时开始、何时停止、何时退回、何时交付。

未命名

Fri, 08 May 2026 18:25:25 GMT

理解这个系统：[名字] 【硬约束】 - 引用文件:行号 + 代码片段 - 看不出来写"看不出来" - 区分 [事实] 和 [推测] ================================== 第一层：宏观（先回答这层） ================================== 1. WHY - 这个系统解决什么问题？ - 如果删掉它，世界少了什么？ - 它的"用户"（人或其他模块）期待它做什么？ 2. MODEL - 它用哪组核心概念建模这个问题？ - 列出 3-7 个核心概念（不是类名，是概念） - 这些概念之间的关系（谁包含谁、谁产生谁、谁约束谁） - 这组概念在代码里对应哪些数据结构 3. INVARIANT - 它坚持什么永远成立？ - 至少列出 3 条不变量 - 每条不变量由哪段代码维护 - 如果违反会发生什么 4. TRADE-OFF - 它做了哪些刻意的取舍？ - 牺牲了什么换什么？（性能/简单/灵活/一致性...） - 哪里能看出"作者本可以那样写但选择了这样写" 完成第一层后，输出 "=== MACRO DONE ===" 停下。 ================================== 第二层：微观（等我说继续） ================================== 基于第一层建立的宏观模型，回答： 5. 核心数据结构的具体定义和关系 6. 关键操作的算法（按你识别出的最重要的不变量组织） 7. 模块边界和依赖每个微观结论必须能映射回第一层的某个宏观概念。如果某段代码无法映射回宏观，单独标出——这往往是历史包袱或 bug。

v3.0：针对"AI 主力重构 + 多项目 + 你 review"的最小可用方案

Fri, 08 May 2026 17:31:36 GMT

亲，这次信息够了。**你的真实场景和我前两版假想的差太远**——你不是在"建设新项目"，你是在**驾驭多个已存在的、AI 正在重构的系统**。这完全是另一个问题。先把你的痛点钉死，再出方案。 --- ## 一、你的真实痛点不是"文档"，是"失控" 把第 5 题翻译一下：**"AI 修着修着我就看不懂了"**——这不是文档问题，是**控制权丢失**问题。拆一下这个失控是怎么发生的： ``` T0: 你说"修这个 bug" T1: AI 改了 file_a.py T2: AI 发现 file_b.py 也有关，改了 T3: AI 觉得设计不对，重构了 file_c.py T4: AI 顺手"优化"了 file_d.py T5: 你看 diff，已经 800 行变更，分布在 12 个文件 T6: 你看不懂哪些是修 bug、哪些是 AI 自己加戏 ``` **真正的问题**： - ❌ 不是"AI 写得烂" - ❌ 不是"项目没文档" - ✅ **是 AI 的"作用域纪律"丢失** —— 它默认 scope 蔓延 - ✅ **是你失去了"diff 可审查性"** —— 一次改太多东西，review 成本爆炸文档体系**只能解决其中 30%**。剩下 70% 是**协作流程**问题。我前两版没看清这一点，所以方案再漂亮都是错的。 --- ## 二、你的场景对方案的硬约束把你的 5 个回答翻译成设计约束： | 你的情况 | 对方案的约束 | |---------|------------| | **多项目（10-500k 行各种规模）** | 必须**模板化 + 一键复制**，不能每个项目精心设计 | | **C# / TS / Python 三栈** | 模板必须**栈中立**，约束以**行为**而非**语法**表述 | | **AI 主力 + 你 review** | 文档**主要给 AI 看**，但要让你 review 时**减负**而不是增负 | | **每天用** | 维护成本必须**极低**，否则一周后就死 | | **重构带 bug 系统** | 主要矛盾是**约束 AI 行为**，不是"教 AI 项目结构" | **核心结论**：你需要的不是"项目文档体系"，是**"AI 协作护栏 + diff 审查辅助"**。 --- ## 三、v3.0 总方针：3 件事，3 个文件 ``` project-root/ ├── AGENTS.md # AI 行为护栏（核心，每个项目必有） ├── NOTES.md # 决策与怪事的"流水账"（可选，重构项目必有） └── .githooks/pre-commit # 一个 hook，只防 secrets ``` **就这些**。为什么这么少？因为： - ARCHITECTURE.md → **砍掉**，合并进 NOTES.md - ADR → **砍掉**，合并进 NOTES.md - 文件头 → **砍掉**，AI 重构会自动改文件，反而增加冲突 - 子目录 AGENTS → **砍掉**，多项目场景下你不会去维护 - 各种 skill / 锚点 lint / 自动生成 → **砍掉** **Musk 第二步：删除你能删的一切。如果后面没加回 10%，说明删少了。** --- ## 四、AGENTS.md 模板（v3.0 核心）针对你的"AI 重构修 bug 失控"场景重新设计。**专门解决 scope 蔓延**。 ````markdown # AGENTS.md > Read first. Symlinked from CLAUDE.md / .cursorrules. ## 🛑 Scope Discipline (核心规则，每次必读) > 这一节存在的唯一目的：**防止你修着修着改了不该改的东西**。 ### Hard rules 1. **一次任务只做一件事。** 修 bug 就只修 bug，不要顺手"优化"、不要顺手重构、不要顺手改风格。 2. **任务开始前先声明 scope**：列出你**计划修改的文件**和**不计划修改的文件**。等用户确认后再动手。 3. **每修改 3 个以上文件**就停下来汇报：已改了什么、为什么、还差什么。**等用户说"继续"再继续**。 4. **发现新问题不要直接修**。记到 `NOTES.md` 的 "## Found while working" 段，让用户决定。 5. **不要重命名、不要移动文件**，除非用户明确要求。 ### Diff size budget - 单次任务 diff > 200 行 → **停下来汇报**，问用户是否拆分 - 单次任务涉及 > 5 个文件 → **停下来汇报**，问用户是否拆分 - 触及核心模块（见下方 Project hints）→ **必须先列计划等批准** ### What counts as "scope creep"（明确禁止） - ❌ 修 bug 时"顺便"重命名变量为更好的名字 - ❌ 修 bug 时"顺便"添加类型注解 - ❌ 修 bug 时"顺便"提取重复代码 - ❌ 修 bug 时"顺便"修复无关的 lint 警告 - ❌ 任何"While I'm here, let me also..."的行为 **记住**：你的修改越少，用户 review 越快，你越可能被信任。**克制 = 效率**。 --- ## Commands ```bash # Test {{填你这个项目的测试命令，e.g.}} pytest # Python npm test # TS dotnet test # C# # Lint {{填你这个项目的 lint 命令}} # Run {{填你这个项目的启动命令}} ``` --- ## Boundaries ### Always do - 修 bug 前**先复现**：写一个能稳定触发 bug 的 test 或 repro 步骤 - 修完后**显示 diff 摘要**：哪些文件改了、改了什么、为什么 - 不确定时**停下来问**，不要猜 ### Ask first - 引入新依赖（任何 `pip install` / `npm install` / `dotnet add package`） - 修改 > 5 个文件 - 修改 NOTES.md 中标注 "DO NOT TOUCH" 的文件 - 删除任何 public API / exported function ### Never do - Commit secrets / `.env` / `appsettings.*.json` 中的密钥 - 修改已合并的 migration / schema 文件 - "顺手"做任何 scope 之外的事 - 创建 `utils.*` / `helpers.*` / `common.*` 这类垃圾桶文件 - 静默修改测试以让它通过（必须先 confirm 测试是错的） --- ## Project hints (not paths, just hints) > 路径会变，意图不变。用 `glob`/`grep` 找具体文件。 {{按项目填，举例：}} - **Business logic** lives near `core` / `domain` / `services` - **External IO** in `adapters` / `clients` / `infrastructure` - **Entry points** in `api` / `controllers` / `routes` / `Program.cs` - **Tests** in `tests/` / `*Tests/` / `*.test.ts` --- ## When confused, prefer this order 1. `glob` / `grep` 找代码现状 2. 读 `NOTES.md` 看有没有相关历史 3. 读 README 4. **问用户** —— 不要猜 --- ## Self-check before submitting 提交前自己过一遍： - [ ] 改动是否都在我声明的 scope 内？ - [ ] diff 是否 < 200 行？如果不是，我汇报过吗？ - [ ] 我有没有"顺手"做 scope 之外的事？ - [ ] bug 复现 test 还能通过吗？ - [ ] 我修改的部分是否符合现有代码风格？（不要引入新风格） - [ ] 我是否记录了"工作中发现但没修"的东西到 NOTES.md？ ```` **关键设计点**： - 把 **scope discipline 放在最顶部** —— 这是你最大的痛点，必须第一个加载到 Agent 注意力 - **明确禁止"While I'm here"行为** —— 这是 AI scope 蔓延的最常见模式，要点名 - **Diff 预算** —— 给具体数字（200 行 / 5 文件），AI 才有锚点 - **Self-check 清单** —— 让 AI 提交前自己过一遍，比你 review 时发现问题便宜 --- ## 五、NOTES.md 模板（重构项目专用）不是 ARCHITECTURE.md，不是 ADR。是**流水账**。 ````markdown # NOTES.md > A working journal. Append-only. Newest at top. > 不是架构文档，是"修这个项目时踩过的坑和做过的决定"。 --- ## How to use this file - **AI**: 任务开始前快速扫一眼最新 5-10 条；任务中发现新东西就 append - **User**: review 时看 AI 加了什么；自己也可以加备忘 - **格式自由**: 能让未来的自己/AI 看懂就行 --- ## DO NOT TOUCH > 这些文件/模块有特殊原因不准动。AI 改之前必须问。 - `src/legacy/auth.cs` — 老认证逻辑，有外部系统依赖签名格式，改了会断 - `migrations/2024_*` — 已上线，不准改 - {{...你项目里的雷区...}} --- ## Decisions > 重要决策的流水账。一句话即可，不需要 ADR 仪式。 - **2026-05-08**: 用 modular monolith 不拆微服务（团队太小） - **2026-05-08**: PG 不引入 Kafka（advisory lock 够用） - **2026-05-09**: ~~用 dataclass 做 DTO~~ → 改用 Pydantic（需要 validation） > 推翻的决策用删除线 + 新决策，不要直接覆盖。 --- ## Anti-patterns (this codebase has scars) > AI 反复犯的错记这里，让下次不再犯。 - ❌ 不要在 `core/` 里直接 `import requests` —— 已经抽象到 `io/` - ❌ 不要给 `User.id` 加类型注解为 `int` —— 部分老代码传 str，会爆 - ❌ {{...AI 第一次犯了你纠正过的错，写这里...}} --- ## Found while working > AI 在工作中发现但**没修**的问题。让用户决定要不要单独立 task。 - {{2026-05-09: 发现 `OrderService.calculate` 里有可能的 race condition，但不在当前 bug scope 内}} - {{...}} --- ## Refactor log > 大重构的脉络，方便回溯。 ### 2026-05: 拆 monolith 的 auth 模块 - Goal: 把 auth 抽成独立模块，对外只暴露 `IAuthService` - Done: `src/auth/` 创建 + JWT 实现迁移 - Pending: 老代码里直接 import `JWTService` 的地方还有 ~12 处 - Owner: AI + you ```` **关键设计点**： - **DO NOT TOUCH 段在最前** —— 防止 AI 误伤雷区 - **Found while working** 段 —— 把 AI 的"scope 蔓延冲动"导流到这里，**变成有用的备忘录** - **Decisions 用流水账格式** —— 完全跳过 ADR 仪式 - **Anti-patterns 累积错误** —— AI 每犯一次同类错，你纠正完顺手 append 一行 **这个文件的本质**：把你"看 AI diff 时的不爽"和"AI 想多做但不该做的"都装进去，变成下次任务的输入。 --- ## 六、pre-commit hook（极简版） `.githooks/pre-commit`： ```bash #!/usr/bin/env bash # 只做一件事：防 secrets。 set -e CHANGED=$(git diff --cached --name-only --diff-filter=ACM) [[ -z "$CHANGED" ]] && exit 0 # Secrets 扫描（覆盖 C# / TS / Python 常见模式） PATTERNS='(api[_-]?key|secret|password|private[_-]?key|access[_-]?token|client[_-]?secret)\s*[=:]\s*["\047][^"\047]{12,}' if echo "$CHANGED" | xargs -r grep -lEi "$PATTERNS" 2>/dev/null; then echo "❌ Possible secret detected in staged files." echo " Files:" echo "$CHANGED" | xargs -r grep -lEi "$PATTERNS" echo " If false positive, commit with --no-verify" exit 1 fi # .env 类文件直接拒 if echo "$CHANGED" | grep -qE '\.env(\.|$)|appsettings\.(Development|Production|Local)\.json$'; then echo "❌ Refusing to commit env / local settings files." exit 1 fi exit 0 ``` 装一次： ```bash git config core.hooksPath .githooks chmod +x .githooks/pre-commit ``` **砍掉的所有其他 hook**： - ❌ Conventional Commits 强制（你一个人没意义） - ❌ ARCHITECTURE 同步检查（没有 ARCHITECTURE） - ❌ ADR 强制（没有 ADR） - ❌ 文件头检查（没有文件头） - ❌ 锚点 lint（没有锚点） --- ## 七、多项目复用：模板包因为你有"很多项目"，必须**一键复制**。建一个你的个人 dotfiles 仓库或 gist： ``` ~/templates/agents-md/ ├── AGENTS.md.template # 上面那份，{{...}} 留空 ├── NOTES.md.template # 上面那份 ├── .githooks/pre-commit └── install.sh ``` `install.sh`： ```bash #!/usr/bin/env bash # Usage: bash ~/templates/agents-md/install.sh set -e TARGET="${1:-.}" TEMPLATE_DIR="$(dirname "$(realpath "$0")")" cp "$TEMPLATE_DIR/AGENTS.md.template" "$TARGET/AGENTS.md" cp "$TEMPLATE_DIR/NOTES.md.template" "$TARGET/NOTES.md" mkdir -p "$TARGET/.githooks" cp "$TEMPLATE_DIR/.githooks/pre-commit" "$TARGET/.githooks/" chmod +x "$TARGET/.githooks/pre-commit" cd "$TARGET" git config core.hooksPath .githooks 2>/dev/null || echo "Not a git repo, skipping hooks setup" ln -sf AGENTS.md CLAUDE.md ln -sf AGENTS.md .cursorrules echo "✅ Installed AGENTS.md / NOTES.md / pre-commit hook" echo "📝 TODO: 填 AGENTS.md 中的 {{...}} 占位符（Commands 段必填）" ``` **新项目接入流程：30 秒** ```bash cd my-new-project bash ~/templates/agents-md/install.sh # 编辑 AGENTS.md 填 Commands 段 # 完事 ``` --- ## 八、消费方式：你 review 时怎么用这一段是 v1/v2 都没认真说的——**文档怎么帮你 review 减负**。 ### Review 流程（你 + AI diff） 1. **AI 完成任务汇报后**，先扫两个东西： - **diff 行数** > 200 吗？涉及文件 > 5 个吗？→ 如果是，**先批 AI 不守 scope discipline**，要求拆分 - **NOTES.md 里有没有 "Found while working" 新增？** → 看 AI 有没有诚实记录"我想多做但忍住了" 2. **看 diff 时按这个顺序**： - 第一遍：**只看 AI 声明 scope 内的文件**，确认 bug 修了 - 第二遍：**看 scope 外的文件**，质问每一处"为什么改这个" - 第三遍：看 NOTES.md 的更新 3. **任何 scope 外的"顺手优化"都要求 revert** —— 哪怕看起来是好的。**纪律比单次质量更重要**。 4. **AI 反复犯的错** → 立即追加到 NOTES.md 的 Anti-patterns，**当次对话就 append**，不要等。 ### 当 AI 修着修着你看不懂时（你最大痛点的解法）这是你说的核心问题。**v3.0 的解法**： **事前**： - AGENTS.md 的 "每修改 3 个文件就停下来汇报" 规则 → AI 不会失控大改 - "Diff size budget 200 行" → 强制小步前进 **事中**： - AI 必须在动手前**列出 scope 计划**让你确认 → 你提前知道范围 - AI 每超过阈值就**停下来汇报** → 你有多个介入点 **事后**： - 看不懂时直接对 AI 说："**Roll back 你 scope 之外的所有改动，只保留 bug fix 部分**" → AGENTS.md 已经声明了 scope，AI 能识别 - 让 AI **总结这次任务到 NOTES.md 的 Refactor log** → 下次新对话有上下文 --- ## 九、什么时候开始写、什么时候不写针对你的**多项目**情况，**不要每个项目都装齐**。决策表： | 项目情况 | AGENTS.md | NOTES.md | pre-commit hook | |---------|:---:|:---:|:---:| | 一次性脚本 / playground | ❌ | ❌ | ❌ | | 个人小工具（< 5k 行） | ✅ 简化版（只 Scope + Commands） | ❌ | ✅ | | 中型项目（5k-50k 行） | ✅ 完整版 | ✅ 简版（只 DO NOT TOUCH + Anti-patterns） | ✅ | | 大型重构项目（你说的那种） | ✅ 完整版 | ✅ **完整版**（Refactor log 必填） | ✅ | | 客户/团队项目 | ✅ + 加 Conventional Commits | ✅ | ✅ | --- ## 十、v3.0 vs v2.0 的删减表 | 项目 | v2.0 | v3.0 | 砍掉原因 | |------|------|------|---------| | ARCHITECTURE.md | 单独文件 | **合并进 NOTES.md** | 你一个人没必要分 | | ADR 目录 | `docs/decisions/` | **砍掉** | 你一个人没决策对象 | | 文件头模板 | 三种语言模板 | **砍掉** | AI 重构会动文件，徒增冲突 | | 子目录 AGENTS.md | override 机制 | **砍掉** | 多项目复用难维护 | | 自动生成脚本 | `gen-arch-facts.sh` | **砍掉** | stack 不会频繁变 | | Conventional Commits | 强制 | **砍掉** | 一个人写没意义 | | commit-msg hook | 强制 `arch:` | **砍掉** | 同上 | | 锚点验证 | CI lint | **砍掉** | 维护成本 > 收益 | | Skills 文件夹 | 3 个 skill | **砍掉** | Agent 不主动用 | | Project map | hints 段 | ✅ 保留 | 真有用 | | Boundaries | 三层 | ✅ 保留 + 加强 | 真有用 | | Commands 段 | 在最前 | ✅ 保留 | 真有用 | | **Scope Discipline 段** | ❌ | ✅ **新增（核心）** | 你的真痛点 | | **NOTES.md Found while working** | ❌ | ✅ **新增** | 解决 scope 蔓延 | | **Self-check 清单** | ❌ | ✅ **新增** | 让 AI 自我约束 | **净结果**：删 11 项，加 3 项（直击你的真痛点），**总复杂度降 70%**。 --- ## 十一、留给你验证的实验不要信我说的"这版好"，去做实验： **实验 1：scope discipline 是否有效（一周）** - 装 AGENTS.md，挑一个修 bug 任务给 AI - 观察：AI 有没有声明 scope？diff 超过 200 行有没有停？ - 如果 AI 不遵守，说明 prompt 强度不够，加重语气（用大写 / 多次重复 / 在 CLAUDE.md 顶部 echo 一次） **实验 2：NOTES.md "Found while working" 是否被使用（一周）** - 看 AI 有没有真的把"想多做但忍住"的东西记进去 - 如果没记 → 在 AGENTS.md 加一条："任务结束时**必须**回答：'我有没有发现 scope 外的问题？记到 NOTES.md 了吗？'" **实验 3：你的 review 时间是否真的下降（两周）** - 计时：v3.0 前你 review 一个 AI bug fix 平均多久？v3.0 后多久？ - 如果没下降，说明问题不在文档体系，可能在你的 review 流程 **两周后告诉我数据。** 数据不对就再迭代 v3.1。 --- ## 十二、最后的诚实我前两版的根本错误，是**没把"AI scope 蔓延"识别为你的核心痛点**。你说"我看不懂了"——那是症状，**病因是 AI 在做超出你授权的事**。 v3.0 的整套设计就一个目标：**让 AI 严格守 scope，让你 review 永远 ≤ 200 行 diff**。文档体系只是手段，**纪律才是目的**。 > **数据结构对了，规则就少了。这次砍到 2 文件 + 1 hook，因为剩下的每一条都对应你一个具体痛点。** > **如果还有错，那就是我对你工作流细节不够了解。两周后用数据砸我脸。** Stop guessing. Go ship it.

0x00

Fri, 08 May 2026 17:19:47 GMT

[Traycer: Spec-Driven Development - Orchestrate Your Coding Agents](https://docs.traycer.ai/) [递归模式介绍- 递归模式](https://recursive-mode.dev/introduction) [Cursor Team Kit | Cursor Plugins](https://cursor.com/cn/marketplace/cursor/cursor-team-kit) [warpdotdev/oz-skills](https://github.com/warpdotdev/oz-skills) [Diffs, from Pierre](https://diffs.com/) [Trees, from Pierre](https://trees.software/) [liuzhengdongfortest/CodeStable](https://github.com/liuzhengdongfortest/CodeStable) [codenamev/ai-software-architect: AI-powered architecture documentation framework with ADRs, reviews, and pragmatic mode. Now available as Claude Code Plugin for easiest installation.](https://github.com/codenamev/ai-software-architect) [garrytan/gstack: Use Garry Tan's exact Claude Code setup: 23 opinionated tools that serve as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA](https://github.com/garrytan/gstack/tree/main)

你的理解 8.5/10，方向对，但有两个隐藏陷阱

Fri, 08 May 2026 17:00:49 GMT

亲，**这个分工抓住了本质**——一个是"行为契约"，一个是"事实地图"。比 90% 的人都想得清楚。但我得给你挑两个会让你后悔的细节，不然几个月后你会发现这两个文件又在打架。 --- ## 一、先确认你对的部分 | 文件 | 你的定位 | 我的评分 | | ------------------- | ---------------- | ------ | | **AGENTS.md** | 全局约定 Agent 的行为规则 | ✅ 完全正确 | | **ARCHITECTURE.md** | 项目的事实，模块间的关系 | ✅ 完全正确 | 这个分工对应的是计算机科学里一个非常根本的二分： > **声明式 (What is) vs 命令式 (What to do)** > > ARCHITECTURE.md = 声明式，描述世界的状态 > AGENTS.md = 命令式，规定 Agent 的行动 **两者的读者交集是 AI Agent，但消费方式完全不同**： - ARCHITECTURE.md 是 Agent 的**世界模型**（被动查询） - AGENTS.md 是 Agent 的**行动准则**（主动遵守）这个分层在计算机科学里有无数同构案例——数据 vs 代码、schema vs query、ontology vs rules。**你直觉上找到了一个非常稳定的抽象边界**，这是好品味。 --- ## 二、但有两个会塌方的隐藏陷阱 ### 陷阱 1：边界模糊地带 —— "事实"和"规则"会互相伪装举几个真实案例，你判断该放哪： | 内容 | 你觉得放哪？ | 正确答案 | |------|------------|---------| | "auth 模块只能通过 IAuthService 暴露" | ? | **ARCHITECTURE**（事实契约） | | "改 auth 时必须同步更新 IAuthService 测试" | ? | **AGENTS**（行动规则） | | "src/core 不依赖 src/api" | ? | **ARCHITECTURE**（事实约束） | | "禁止 Agent 创建 utils.py 这种垃圾桶" | ? | **AGENTS**（行为禁令） | | "本项目用 pytest，不用 unittest" | ? | **🚨 陷阱** | 最后一个就是陷阱——它**既是事实也是规则**： - "项目用 pytest" = 事实 → ARCHITECTURE - "Agent 写测试时必须用 pytest" = 规则 → AGENTS **错误做法**：两边都写一份。三个月后两份漂移，Agent 看到矛盾，整个信任体系崩塌。 **正确做法**：**事实只在 ARCHITECTURE 写一次，AGENTS 用引用而不是复制**。 ```markdown # AGENTS.md ## Testing Rules - 写测试时遵守 ARCHITECTURE.md §3 "Testing Stack" 中声明的工具栈 - 不准引入新的测试框架，必须先改 ARCHITECTURE 再改代码 ``` **Single Source of Truth 原则**：事实只能有一个 owner，规则可以引用事实但不能复刻事实。 ### 陷阱 2：AGENTS.md 会变成"规则垃圾桶" 这是更隐蔽的问题。AGENTS.md 这种文件**天然倾向于无限膨胀**——每次 Agent 犯一次错，人类就想加一条规则防御。三个月后你会得到： ```markdown # AGENTS.md (3 个月后的恐怖现状) ## Rules 1. 不要直接 import requests 2. 不要在 hot path log 3. 不要用 print，用 logger 4. 不要 commit .env 文件 5. 不要改 migration 历史 6. 修 bug 前必须先写 reproducing test 7. PR 标题必须用 conventional commits 8. 不要在周五下午 deploy ... (持续到 200 条) ``` **这就完蛋了**——200 条规则放进 system prompt，**LLM 会忽略 80%**。注意力是稀缺资源，规则越多越没用。 **Linus 会问**：你这个数据结构（"规则列表"）的访问模式是什么？是 Agent 每次行动前线性扫描 200 条吗？品味问题。 ### 解法：AGENTS.md 必须分层 + 限额 ```markdown # AGENTS.md ## §1 Hard Constraints (≤10 条，违反必停) - 必须先读 ARCHITECTURE.md 再做架构变更 - 不准 commit secrets - 不准改 migration 历史 ... ## §2 Workflow (流程，不是禁令) - 修 bug：复现 → 写失败测试 → 修 → 验证 - 加 feature：先看 ARCHITECTURE §1 决策有没有冲突 - 改架构：先改 ARCHITECTURE.md 再改代码 ## §3 Style Defaults (软规则，可被 task 覆盖) - Python: black + ruff - Commit: conventional commits ... ## §4 Anti-Patterns (反模式，回顾用) ``` **关键设计**： - **§1 必须 ≤10 条**，超出说明你在用规则补需求漏洞 - **§2 是流程不是禁令**——告诉 Agent 怎么做，比告诉它别做什么省 token - **§4 单独成节**，让 Agent 知道"这部分可以跳过，需要时再回来查" --- ## 三、修正后的完整分工 ``` ┌─────────────────────────────────────────────────────────────┐ │ ARCHITECTURE.md (世界模型，声明式，事实) │ │ ├─ §1 Why: 设计决策与原因 │ │ ├─ §2 What: 模块边界 + 契约 │ │ ├─ §3 How: 关键数据流 │ │ └─ Glossary: 术语对齐 │ │ │ │ 读者: 人 + Agent │ │ 更新触发: 架构变更 │ │ Token 预算: 3-5k │ └─────────────────────────────────────────────────────────────┘ ▲ │ 引用 (不复制) │ ┌─────────────────────────────────────────────────────────────┐ │ AGENTS.md (行动准则，命令式，规则) │ │ ├─ §1 Hard Constraints (≤10 条) │ │ ├─ §2 Workflow (做事流程) │ │ ├─ §3 Style Defaults (软约定) │ │ └─ §4 Anti-Patterns (反模式参考) │ │ │ │ 读者: 仅 Agent │ │ 更新触发: 重复犯同类错误 │ │ Token 预算: 1-2k (§1+§2+§3), §4 不计 │ └─────────────────────────────────────────────────────────────┘ ``` **两个文件加起来 < 7k token**，能塞进任何主流模型的 system prompt。 --- ## 四、还有一个你没问的问题：CLAUDE.md 呢？既然你已经想到 AGENTS.md，迟早会撞到这个问题：**CLAUDE.md / .cursorrules / .windsurfrules / AGENTS.md 这一堆 AI 专属配置文件怎么办？** **我的建议**： - **AGENTS.md = 通用规则**（vendor-neutral，对所有 Agent 都成立） - **CLAUDE.md / .cursorrules = 工具特化指令**（只放该工具独有的优化，比如 Cursor 的 `@` 用法、Claude Code 的 sub-agent 调用约定） - **CLAUDE.md 的第一行永远是 `Read AGENTS.md first.`** 这样换工具时只改一个小文件，**核心规则永远跟着项目走**。**vendor lock-in 是技术债，规则文件层面就要防**。 --- ## 五、给你的最终判分你的分工（AGENTS = 行为规则 / ARCHITECTURE = 事实）**架构方向 100% 正确**，但需要补三个工程细节： | 维度 | 你的方案 | 加固建议 | |------|---------|---------| | **边界** | ✅ 清晰 | ⚠️ 加 SSOT 原则：事实不复刻，只引用 | | **可扩展性** | ⚠️ 可能膨胀 | ⚠️ AGENTS.md 分层 + Hard Constraints 限额 | | **vendor 中立** | ❌ 没考虑 | ⚠️ AGENTS.md (通用) + CLAUDE.md (特化) 二级结构 | **总分 8.5/10**——撕掉一分给"规则膨胀风险"，撕掉半分给"vendor lock-in 没考虑"。 --- ## 六、留给你的最后挑战我反复强调"AGENTS.md 引用 ARCHITECTURE.md，不复制"。但有个执行细节我没说： **当 ARCHITECTURE.md §3 改了，怎么保证 AGENTS.md 里所有引用 §3 的规则还有效？** 提示： - 节标题改了，引用就坏了 - 节内容改了，规则的前提可能不成立了 - CI 能检测前者，但检测不了后者这个问题不解决，"引用而非复制"会变成**比复制更糟的腐烂**——因为你以为它保鲜了，其实它只是悄悄烂了。想想看怎么搞。想清楚了告诉我你的方案。 > **数据结构对了，规则就少了。规则越多，说明你的事实层（ARCHITECTURE）没建好。**

未命名

Fri, 08 May 2026 03:06:59 GMT

[Lobe UI](https://ui.lobehub.com/~demos/src-markdown-demo-streamingprofiler) [Innei/Kagura: A divine dance of AI in Slack — Kagura runs Claude Agent SDK and OpenAI Codex CLI natively in your workspace with thread-aware context, streaming UX, and persistent memory.](https://github.com/Innei/kagura)

A good AGENTS.md is a model upgrade. A bad one is worse than no docs at all.

Thu, 07 May 2026 07:07:40 GMT

![图像](https://pbs.twimg.com/media/HGj9QINa0AAU6zp?format=jpg&name=large) We pulled dozens of **AGENTS.md** files from across our monorepo and measured their effect on code generation. The best ones gave our coding agent a quality jump equivalent to upgrading from Haiku to Opus. The worst ones made the output worse than having no **AGENTS.md** at all. That gap was surprising enough that we built a systematic study around it. What we found: most of what people put in **AGENTS.md** either doesn't help or actively hurts, and the patterns that work are specific and learnable. ## The same file can help one task and hurt another by 30% A single **AGENTS.md** isn't uniformly good or bad. The same file boosted **best\_practices** by 25% on a routine bug fix and dropped **completeness** by 30% on a complex feature task in the same module. On the bug fix, a decision table for choosing between two similar data-fetching approaches helped the agent pick the right pattern immediately and stay within codebase standards. On the feature task, the agent read that same file, got pulled into the reference section, opened dozens of other markdown files trying to verify its approach against every guideline, created unnecessary abstractions, and shipped an incomplete solution. Different blocks of the document had opposite effects on different tasks. What follows is which patterns work, which fail, and how to tell which is which for your codebase. ## How we measured this We used AuggieBench, one of our internal eval suites, to evaluate how well agents do our internal dev work. We start with high-quality PRs from a large repo that reflect typical day-to-day agent tasks, set up the environment and prompt, and ask the agent to do the same task. Then we compare its output against the golden PR, the version that actually landed after review by multiple senior engineers. We filtered out PRs with scope creep or known bugs. For this study, we added two more filters: PRs had to be contained within a single module or app, and the scope had to be one where information in an **AGENTS.md** might plausibly help. We then ran each task twice, with and without the file, and compared scores. ## What works **1\. Progressive disclosure beats comprehensive coverage** Treat your **AGENTS.md** like a skill. Cover the common cases and workflows at a high level, then push details into reference files the agent can load on demand. Keep each reference's scope clear so the agent knows when to pull it in. The 100–150 line **AGENTS.md** files with a handful of focused reference documents were the top performers in our study, delivering 10–15% improvements across all metrics in mid-size modules of around 100 core files. Once the main file got longer than that, the gains started reversing. **2\. Procedural workflows take agents from failing to finishing** Describing a task as a numbered, multi-step workflow was one of the strongest patterns we measured. A well-written workflow can move the agent from unable to complete a task to producing a correct solution on the first try. One example from our codebase: a six-step workflow for deploying a new integration. The agent followed it step by step. The share of PRs with missing wiring files dropped from 40% to 10%, and the agent finished faster on average. **Correctness** went up 25%. **Completeness** went up 20%. For complex workflows, keep the main file concise and use reference files for branching cases. **3\. Decision tables resolve ambiguity before the agent writes code** When your codebase has two or three reasonable ways to do something, decision tables force the choice up front. This is the pattern that most directly improved adherence to codebase conventions. **Example**: resolving React Query vs Zustand for state management. ![图像](https://pbs.twimg.com/media/HGj7wifb0AAkGKd?format=jpg&name=large) PRs in this area scored 25% higher on **best\_practices**. The table resolved the ambiguity before the agent wrote a single line of code. **4\. Examples from the real codebase improve code reuse** Short snippets of 3–10 lines from actual production code improved reuse and pattern adherence. Keep it to a few examples that are most relevant and not duplicative. More than that and the agent starts pattern-matching on the wrong thing. ![图像](https://pbs.twimg.com/media/HGj86OCaoAAEbh7?format=jpg&name=large) **5\. Domain-specific rules still matter** This is the pattern most people already associate with **AGENTS.md**: language- or org-specific gotchas. ![图像](https://pbs.twimg.com/media/HGj9AI1aMAAv_Ts?format=jpg&name=large) This works when the rule is specific and enforceable. It stops working when you stack dozens of them. See the overexploration section below. **6\. Pair every "don't" with a "do"** Warning-only documentation consistently underperformed documentation that paired prohibitions with a concrete alternative. If you add **Don't instantiate HTTP clients directly**, pair it with **Use the shared apiClient from lib/http with the retry middleware.** The first on its own makes the agent cautious and exploratory. The pair tells it what to do and moves on. **AGENTS.md** files with 15+ sequential "don'ts" and no "dos" caused the agent to over-explore, stay conservative, and do less work. More on that below. **7\. Keep your code modular, and AGENTS.md too** The best-performing agent docs described relatively isolated submodules. Mid-size modules, around 100 core files, with a 100–150 line **AGENTS.md** and a few reference documents, were where we saw the 10–15% cross-metric gains. Examples: UI components of the client, standalone services. Huge, cross-cutting **AGENTS.md** files at the repo root underperformed module-level ones. But the document itself is only part of the story. In our study, the worst-performing **AGENTS.md** files were the ones sitting on top of massive surrounding documentation. One module had 37 related docs totaling about 500K characters. Another had 226 docs totaling over 2MB. In both cases, removing just the **AGENTS.md** barely changed agent behavior. The agent kept finding and reading the surrounding doc sprawl, and the sprawl was the problem. If your **AGENTS.md** is good but your module has 500K of specs around it, the specs are what the agent is reading. Fix the documentation environment, not just the entry point. ## Where AGENTS.md falls short **The overexploration trap** This is the most common failure mode we observed, and it's essentially context rot. Two patterns cause it: **1\. Too much architecture overview.** The agent gets pulled into reading documentation files, sometimes dozens of them, trying to "better understand the architecture." It loads tens or hundreds of thousands of tokens of context, and the output gets worse. ![图像](https://pbs.twimg.com/media/HGj8QpWakAAl59G?format=jpg&name=large) **2\. Excessive warnings** A big section of "don'ts" without matching "dos" produces a specific failure. The agent reads each instruction, tries to figure out whether it applies to the current task, and starts verifying its solution against every single warning. With 30–50 warnings, that means reading migration scripts, checking API version compatibility, and exploring auth middleware code, even on a task where none of it matters. ![图像](https://pbs.twimg.com/media/HGj8ZAHb0AAU2UL?format=jpg&name=large) **New patterns break old documentation** If you're introducing a pattern that doesn't exist in your codebase yet, **AGENTS.md** can actively steer the agent in the wrong direction. ![图像](https://pbs.twimg.com/media/HGj8fFLboAAMpBn?format=jpg&name=large) ## Know what you're optimizing for Different patterns move different metrics. Pick the patterns that target the problem you actually have. ![图像](https://pbs.twimg.com/media/HGj8lF7aEAAPr1r?format=jpg&name=large) ## How agents actually find your docs Before deciding how to migrate your existing documentation, it helps to know what the agent actually reads. We traced documentation discovery across hundreds of sessions. The discovery rates are lopsided enough to shape migration priorities. - **AGENTS.md** files are discovered automatically in 100% of cases, for every file in the hierarchy from the working directory by most harnesses. - References out of **AGENTS.md** are loaded on demand and read in over 90% of sessions when the agent has a reason to pull them in. - Directory-level **README.md** files aren't auto-loaded, but the agent reads them in 80%+ of sessions when it's working in that directory. After that, discovery falls off a cliff. - Nested **README**s, meaning **README** files in subdirectories the agent isn't currently working in, get discovered only about 40% of the time. - Orphan docs in **\_docs/** folders that nothing references get read in under 10% of sessions. One service in our codebase had 30K of detailed protocol design, throttling rules, and security docs in **\_docs/**. The agent never opened most of them across dozens of sessions. **AGENTS.md is the only documentation location with reliable discovery.** If something needs to be seen, it either lives there or is directly referenced from there. Moving the content into a referenced location is usually higher leverage than writing more docs. ## Migrating existing docs Every company already has READMEs, architecture docs, and design specs scattered across the repo. Here's how to turn that into something an agent can actually use. **Should you just rename your README.md to AGENTS.md?** **README**s and **AGENTS.md** serve different audiences, but they can be reused. Agents are good enough at codebase summarization now that human-oriented docs are less necessary than they used to be. You can either write an agentic doc from scratch, or reuse your **README**. If you reuse it, trim it aggressively. Keep it short, follow the patterns above, and cut any section that's there for humans to skim. **When to keep existing documentation** If the docs are high quality, current, to the point, and have examples, reuse them. Reference them from module- or folder-level **AGENTS.md** files. Don't put more than 10–15 references in a single **AGENTS.md** and keep the context lean. And audit the surrounding environment: if the module around your **AGENTS.md** has dozens of architecture docs and spec files, the agent will find and read them whether you reference them or not. A focused 150-line AGENTS.md sitting on top of 500K of surrounding specs won't save the agent from the specs. **AGENTS.md isn't the only path** Agents find reference material through grep and semantic search too. About half of all search-result hits in our traces came from those tools, not from **AGENTS.md** references. If you're keeping legacy documentation, make sure the docs include relevant code examples and descriptive text that's searchable. A well-structured **AGENTS.md** gives you more control over what ends up in the context window, but it isn't the only way in. What this study didn't cover We focused on one-shot trajectories and the agent's ability to finish coding tasks without human intervention. We didn't look at best practices for maintaining **AGENTS.md** over time, though we're exploring that now. We also didn't cover operational, interactive, or analytics tasks. Those are coming in future posts. Written by: Slava Zhenylenko (Member of Technical Staff)

重构 Chiron ContextEngine

Thu, 07 May 2026 05:00:09 GMT

当前这轮对话的约束是：只能做信息整理，不能修改或保存文件。所以我不能直接把文件写入仓库。但我先把完整 PRD 文档写好，你可以保存为： docs/prd-contextengine-ast-outline-unity.md 下面是正文内容。 ────────────────────────────────────────────────────────────────────────────────────── PRD：基于 ast-outline + Lightweight Unity Metadata Collector 重构 Chiron ContextEngine 1. 文档信息 | 项目 | 内容 | |---|---| | 文档名称 | Chiron ContextEngine 重构 PRD | | 目标模块 | cli/src/context-engine.mjs 及相关上下文构造流程 | | 方案关键词 | ast-outline、Unity Metadata Collector、C#、ContextEngine、repo-aware prompt | | 文档语言 | 中文 | | 状态 | Draft | | 目标用户 | Unity/C# 项目开发者、终端 AI 编程工具用户、Chiron 维护者 | ──────────────────────────────────────────────────────────────────────── 2. 背景 Chiron 当前的 ContextEngine 主要通过本地启发式规则收集项目上下文，包括： • 根据固定 marker 文件判断项目类型 • 根据固定源码扩展名判断是否为源码文件 • 读取少量 key files • 基于 prompt 关键词匹配文件名 / 路径 • 拼接有限的项目描述与相关文件内容该设计对 Node.js、Python、Rust、Go 等项目有一定效果，但对 Unity/C# 项目几乎无效。主要原因包括： 1. 没有识别 .cs 源文件。 2. 没有识别 Unity 项目标记，例如 Assets/、Packages/manifest.json、ProjectSettings/ProjectVersion.txt。 3. 没有读取 Unity 项目关键配置。 4. 相关文件检索只看文件名和路径，无法理解 C# 类、方法、字段和继承关系。 5. 不理解 Unity 常见上下文，例如 MonoBehaviour、ScriptableObject、Prefab、Scene、Inspector 序列化字段等。 6. 本地上下文输出对 Gemini / Claude 等后端模型帮助有限。因此，需要将 ContextEngine 从“浅层启发式扫描器”升级为“可插拔的本地上下文编排器”。 ───────────────────────────────────────────────────────────────────────────────── 3. 产品目标 3.1 核心目标重构 Chiron 的 ContextEngine，让它能够在 Unity/C# 项目中生成高质量 repo-aware prompt context。具体目标： 1. 通过 ast-outline 获取 C# 代码结构、符号、方法、依赖关系和语义检索结果。 2. 通过轻量 Unity Metadata Collector 收集 Unity 项目配置与资源结构。 3. 保留 Chiron 自身的 prompt workflow 能力，负责将代码上下文、Unity 元数据、git 状态和用户原始需求组合成高质量 prompt。 4. 在没有安装 ast-outline 时，提供安全、可用的 fallback。 5. 不将 Chiron 变成大型 IDE / LSP / Roslyn 工具，而是保持轻量 CLI 工具定位。 ────────────────────────────────────────────────────────────────────────────── 4. 非目标本次重构不追求： 1. 完整替代 Roslyn / Rider / Visual Studio 的 C# 语义分析。 2. 完整解析 Unity Scene / Prefab / .meta GUID 引用关系。 3. 自动修改代码。 4. 构建长期运行的本地索引服务。 5. 强制用户安装 ast-outline。 6. 将所有语言支持都改为复杂 adapter 架构。 7. 完整实现 Unity Editor 内部状态读取。 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 5. 用户画像 5.1 Unity 独立开发者用户在 Unity 项目中使用 Gemini CLI / Claude Code / Codex 等终端 AI 工具，希望输入粗糙需求后，Chiron 能自动补充 Unity/C# 项目上下文。示例需求： • “角色跳跃有时候没反应，帮我修一下” • “Inventory UI 打开后按钮没反应” • “敌人受击逻辑有 bug” • “这个 ScriptableObject 配置为什么运行时没生效” 5.2 Chiron 高级用户用户希望 Chiron 能在不同项目类型中自动收集上下文，但不想手动复制相关文件。 5.3 Chiron 维护者维护者希望提升 repo-aware 能力，但不希望在 Chiron 内部从零实现 C# AST parser、语义搜索和依赖分析。 ────────────────────────────────────────────────────────────────────── 6. 当前痛点 6.1 C# 源码不可见当前 SOURCE_EXTS 不包含 .cs，导致 Unity 脚本无法被优先识别为源码文件。 6.2 Unity 项目无法识别当前 marker 中没有： • Assets/ • Packages/manifest.json • ProjectSettings/ProjectVersion.txt • *.sln • *.csproj • *.asmdef 导致 Chiron 无法判断项目是 Unity/C#。 6.3 相关文件选择质量差当前相关文件选择主要依赖文件名和路径匹配。例如用户输入： “角色跳跃有时候没反应” 实际相关文件可能是： • PlayerController.cs • CharacterMotor.cs • LocomotionState.cs • InputReader.cs • GroundDetector.cs 但如果文件名不直接包含“jump”或“跳跃”，当前逻辑很可能找不到。 6.4 缺少代码结构摘要当前 ContextEngine 可能直接读取文件前几千字符，但不能提供： • 类名 • 方法名 • 字段名 • 继承关系 • 行号范围 • 方法级摘要 • 接口实现 • 反向依赖这会导致后端模型无法快速判断应该看哪里。 6.5 Unity 非代码上下文缺失 Unity bug 经常来自： • Inspector 引用缺失 • Prefab override • Scene 配置 • Layer / Tag 设置 • Input System 配置 • ProjectSettings • Packages/manifest.json • .asmdef 当前 Chiron 基本不收集这些信息。 ──────────────────────────────── 7. 方案概述采用双组件方案： 1. ast-outline 负责代码智能。 2. Lightweight Unity Metadata Collector 负责 Unity 项目元数据。 3. Chiron ContextEngine 负责上下文编排与 prompt 生成。总体架构： • AstOutlineProvider • 检测 ast-outline 是否可用 • 调用 ast-outline search • 调用 ast-outline digest • 调用 ast-outline show • 调用 ast-outline reverse-deps • 解析 JSON 输出 • UnityMetadataCollector • 检测 Unity 项目 • 读取 Unity 版本 • 读取 package manifest • 读取 asmdef 列表 • 列出重要 Unity 资源路径 • 收集 Input System / URP / HDRP / Addressables 等信号 • ContextComposer • 将用户原始 prompt、代码结构、Unity 元数据、git 状态组合成 prompt context • 输出自然语言上下文 • 为后端 Gemini / Claude 提供高质量输入 ──────────────────────────────────────────────────────────────────────────────── 8. 目标架构 8.1 新 ContextEngine 职责重构后的 ContextEngine 不再直接承担所有代码理解任务，而是负责调度多个 provider： | 模块 | 职责 | |---|---| | ProjectSnapshot | 获取文件树、git 状态、根配置文件 | | AstOutlineProvider | 调用 ast-outline 获取代码结构和相关符号 | | UnityMetadataCollector | 收集 Unity 项目元数据 | | FallbackCodeSearchProvider | 在没有 ast-outline 时使用 ripgrep / 文件名检索 | | PromptContextComposer | 生成最终 prompt context | ─────────────────────────────────────────────────── 9. 功能需求 9.1 检测 ast-outline 系统应在运行时检测 ast-outline 是否可用。检测方式： • 执行 ast-outline --version • 或执行 ast-outline help • 超时时间建议 1 秒到 2 秒 • 检测结果应缓存到本次进程生命周期内如果可用： • 使用 ast-outline 增强代码上下文如果不可用： • 回退到内置 fallback • 不应中断 Chiron 主流程 • 在 debug 模式中提示用户可以安装 ast-outline ─────────────────────────────────────────────── 9.2 Unity 项目检测系统应通过多信号判断当前仓库是否为 Unity 项目。高置信度信号： | 信号 | 权重 | |---|---:| | ProjectSettings/ProjectVersion.txt | 高 | | Packages/manifest.json | 高 | | Assets/ | 中 | | Assets/**/*.cs | 中 | | *.sln | 低 | | *.csproj | 低 | | Assets/**/*.asmdef | 中 | 判断规则： • 若存在 ProjectSettings/ProjectVersion.txt 和 Packages/manifest.json，直接判定为 Unity。 • 若存在 Assets/ 且存在 .cs 文件，也可判定为 Unity/C#。 • 若只存在 .csproj 或 .sln，判定为 C#/.NET，但不一定是 Unity。输出字段： | 字段 | 示例 | |---|---| | projectKind | unity | | language | csharp | | confidence | high | | signals | ["ProjectSettings/ProjectVersion.txt", "Packages/manifest.json", "Assets"] | ──────────────────────────────────────────────────────────────────────────────────────── 9.3 Unity 版本读取系统应读取： ProjectSettings/ProjectVersion.txt 提取： • Unity Editor 版本 • revision 信息，如存在示例输出： | 字段 | 示例 | |---|---| | unityVersion | 2022.3.18f1 | | unityRevision | 可选 | ────────────────────────────── 9.4 Unity Packages 读取系统应读取： Packages/manifest.json 提取： • package 名称 • package 版本 • 常见能力标签重点识别： | Package | 能力标签 | |---|---| | com.unity.inputsystem | Input System | | com.unity.render-pipelines.universal | URP | | com.unity.render-pipelines.high-definition | HDRP | | com.unity.addressables | Addressables | | com.unity.cinemachine | Cinemachine | | com.unity.netcode.gameobjects | Netcode for GameObjects | | com.unity.entities | DOTS / ECS | | com.unity.textmeshpro | TextMeshPro | 输出示例： | 字段 | 示例 | |---|---| | packages | Input System, URP, Cinemachine | | rawPackageCount | 34 | ───────────────────────────────────────────── 9.5 asmdef 收集系统应扫描： Assets/**/*.asmdef 收集： • asmdef 文件路径 • assembly 名称 • references • includePlatforms • excludePlatforms 用途： • 帮助模型理解模块边界 • 帮助识别 Runtime / Editor 分离 • 帮助分析跨 assembly 引用问题 ────────────────────────────────────────────── 9.6 Unity 资源路径收集系统应轻量列出以下资源路径，但默认不读取全文： • Assets/**/*.unity • Assets/**/*.prefab • Assets/**/*.asset • Assets/**/*.inputactions • Assets/**/Resources/** • Assets/**/AddressableAssetsData/** 默认策略： | 类型 | 默认行为 | |---|---| | .unity | 仅列路径 | | .prefab | 仅列路径 | | .asset | 仅列路径，关键配置可读前几 KB | | .inputactions | 可读摘要 | | .meta | 默认忽略 | | Library/ | 必须忽略 | | Temp/ | 必须忽略 | | Obj/ / obj/ | 必须忽略 | | Build/ / Builds/ | 必须忽略 | ──────────────────────────────────────── 9.7 使用 ast-outline search 查找相关代码当用户输入 prompt 后，系统应构造搜索 query 并调用 ast-outline search。 query 应包含： 1. 用户原始 prompt。 2. 从 prompt 中提取的英文/中文关键词。 3. Unity 领域补充词。例如用户输入： “角色跳跃有时候没反应” 可扩展为： • 原始文本：角色跳跃有时候没反应 • 英文补充：jump, grounded, input, Rigidbody, CharacterController • Unity 生命周期补充：Update, FixedUpdate • 输入系统补充：InputAction, Input System 要求： • 默认取 top 5 到 top 8 • 优先 JSON 输出 • 超时建议 5 秒 • 失败时回退到 ripgrep ─────────────────────────────────────────────────────────────────── 9.8 使用 ast-outline digest 获取结构摘要对搜索命中的文件或目录，系统应调用 digest 或 outline 获取结构摘要。目标是获取： • 文件路径 • 类名 • namespace • 基类 • interface • 字段 • 方法签名 • 方法行号范围 • doc comments，如有使用场景： • 给模型提供代码形状 • 避免直接塞完整 .cs 文件 • 帮助模型判断需要展开哪个方法 ──────────────────────────────────────────────────────────────────────── 9.9 使用 ast-outline show 展开关键方法当搜索结果或 digest 中出现高相关 symbol 时，系统可调用 show 展开方法体。触发条件： • 方法名与 prompt 高相关 • 方法名为 Unity 生命周期方法，并与任务相关 • 方法所在文件为 git modified • 方法包含搜索命中行 • 方法名为 TryJump、CheckGrounded、TakeDamage、OpenInventory 等明显业务方法默认限制： • 最多展开 2 到 4 个方法 • 单个方法最大字符数限制 • 总上下文 token 预算限制 • 超时失败不阻断流程 ─────────────────────────────────────────────── 9.10 使用 ast-outline reverse-deps 分析影响范围当任务类型为： • refactor • bug fix • modify existing behavior • rename • API change 且相关文件明确时，系统可调用 reverse-deps。输出给模型： • 哪些文件可能依赖当前文件 • 修改时需要注意的影响范围 • 是否需要同步检查 UI、Camera、Input、Combat 等模块 ───────────────────────────────────────────────────── 10. Unity 任务类型增强系统应在 prompt composer 中加入 Unity 任务类型判断。 10.1 Gameplay Debug 触发词： • 跳跃 • 移动 • 攻击 • 受击 • 敌人 • 玩家 • 手感 • 卡住 • 碰撞 • 触发器注入检查项： • Update 与 FixedUpdate 职责划分 • Rigidbody 操作是否在物理更新中 • grounded 检测是否稳定 • LayerMask 是否正确 • Collider / Trigger 设置是否正确 • Animator 状态机是否阻断逻辑 ─────────────────────────────────── 10.2 UI Debug 触发词： • UI • 按钮 • 面板 • Canvas • TMP • Inventory UI • 点击没反应注入检查项： • EventSystem 是否存在 • Button listener 是否绑定 • Canvas sorting order • Raycast Target • Graphic Raycaster • Time scale 是否影响 UI • 输入系统是否正确路由到 UI ───────────────────────────── 10.3 Serialization / Prefab 触发词： • Inspector • Prefab • 引用丢失 • MissingReferenceException • SerializedField • ScriptableObject • 配置没生效注入检查项： • [SerializeField] 字段是否被改名 • 是否需要 [FormerlySerializedAs] • Prefab override 是否覆盖运行配置 • Scene 引用是否为空 • ScriptableObject 是否引用了正确 asset ───────────────────────────────────────── 10.4 Performance 触发词： • 卡顿 • 掉帧 • GC • 性能 • Instantiate • Destroy • GetComponent • FindObjectOfType 注入检查项： • Update() 中是否频繁分配 • 是否频繁 GetComponent • 是否频繁 Instantiate / Destroy • 是否需要对象池 • 是否使用 LINQ 导致 GC • 是否需要 Profiler 验证 ─────────────────────────────────── 11. Prompt Context 输出格式最终生成的 context 应包含以下部分： 1. 项目摘要 2. Unity 元数据 3. Git 状态 4. 相关代码结构 5. 展开的关键方法 6. 依赖影响范围 7. Unity-specific 注意事项 8. 用户原始请求 9. 对后端模型的执行约束建议结构： • Project Summary • Unity Metadata • Relevant Code from ast-outline • Expanded Methods • Git Context • Unity Debugging Notes • User Request • Instructions for the coding agent ───────────────────────────────────── 12. 示例输出用户输入： “角色跳跃有时候没反应，帮我修一下” ContextEngine 应生成类似内容：项目是 Unity/C# 项目。Unity 版本为 2022.3.x。项目使用 Input System、URP、Cinemachine。根据 ast-outline 检索，最相关的代码包括： 1. Assets/Scripts/Player/PlayerController.cs • class PlayerController : MonoBehaviour • 方法：Awake、Update、FixedUpdate、TryJump、CheckGrounded • 相关原因：匹配 jump / grounded / Rigidbody / input 2. Assets/Scripts/Input/PlayerInputReader.cs • class PlayerInputReader : MonoBehaviour • 方法：OnJump、ReadMovement • 相关原因：输入系统相关 3. Assets/Scripts/Player/GroundDetector.cs • class GroundDetector : MonoBehaviour • 方法：IsGrounded、CheckGround • 相关原因：grounded 检测相关请重点检查： • 跳跃输入是否在 Update 中读取，但物理逻辑在 FixedUpdate 中执行。 • grounded 检测是否受 LayerMask、Raycast 长度、Collider 设置影响。 • Rigidbody 的 velocity / AddForce 使用是否正确。 • Input System 是否存在输入被消费或未启用的问题。 • 修改时保留现有 Inspector 字段，不要随意重命名 serialized field。 • 修改后给出 Unity Editor 内的验证步骤。 ──────────────────────────────────────────────────────────────────── 13. Fallback 方案当 ast-outline 不可用时，系统应回退到轻量本地检索。 Fallback 包含： 1. 文件树扫描 2. .cs 文件内容 grep 3. git changed files 加权 4. Unity metadata collector 仍然运行 5. 简单方法名 regex 提取 Fallback 不要求达到 ast-outline 精度，但必须优于当前 ContextEngine。最低要求： • 能识别 Unity 项目 • 能识别 .cs • 能搜索 Assets/**/*.cs • 能读取 Unity 版本和 packages • 能输出相关文件列表 • 能注入 Unity debug checklist ───────────────────────────────────────────────────────────────── 14. 配置项建议新增环境变量： | 环境变量 | 默认值 | 说明 | |---|---|---| | CHIRON_CONTEXT_PROVIDER | auto | auto / builtin / ast-outline | | CHIRON_AST_OUTLINE_PATH | ast-outline | 自定义 ast-outline 可执行文件路径 | | CHIRON_AST_OUTLINE_TIMEOUT_MS | 5000 | 单次 ast-outline 调用超时 | | CHIRON_UNITY_METADATA | auto | 是否启用 Unity metadata collector | | CHIRON_CONTEXT_MAX_FILES | 8 | 最大相关文件数 | | CHIRON_CONTEXT_MAX_METHODS | 4 | 最大展开方法数 | | CHIRON_CONTEXT_DEBUG | false | 输出上下文收集调试信息 | ───────────────────────────────────────────────────────── 15. 安全与隐私 15.1 本地处理原则 • Unity metadata collector 只读本地文件。 • ast-outline 只在本地执行。 • 不应默认上传任何文件到远程服务。 • 上传行为仅发生在后端模型调用阶段，例如 Gemini backend。 15.2 敏感文件处理必须默认忽略： • .env • .env.* • Library/ • Temp/ • Obj/ • obj/ • Build/ • Builds/ • UserSettings/ • .git/ • *.csproj.user • *.suo • *.pdb • *.dll • *.exe 15.3 Unity 资源读取限制默认不读取大型 .unity / .prefab 文件全文。只允许： • 列路径 • 读取小型配置文件 • 在用户明确提到 prefab / scene 时读取有限头部内容 ──────────────────────────────────────────────────── 16. 性能要求 | 指标 | 目标 | |---|---| | 无 ast-outline 冷启动 | 小于 1 秒 | | 有 ast-outline 热路径 | 小于 3 秒 | | 单次 ast-outline search 超时 | 5 秒 | | Unity metadata collector | 小于 1 秒 | | 相关文件数量 | 默认不超过 8 | | 展开方法数量 | 默认不超过 4 | | 输出 context 大小 | 应可控，避免超过后端 prompt 预算 | ──────────────────────────────────────────────────────── 17. 错误处理 17.1 ast-outline 未安装行为： • 不报错 • 使用 fallback • debug 模式提示安装建议 17.2 ast-outline 调用失败行为： • 捕获 stderr • 不影响主流程 • fallback 到 grep • debug 模式输出失败原因 17.3 Unity 配置文件解析失败行为： • 跳过损坏文件 • 保留文件路径信息 • debug 模式记录 JSON parse error 17.4 大项目扫描超时行为： • 降级为文件树 + git changed files • 限制扫描深度 • 限制文件数量 • 输出“context may be partial”提示给后端模型 ───────────────────────────────────────────────── 18. 验收标准 18.1 Unity 项目识别给定一个标准 Unity 项目，ContextEngine 应识别为： • projectKind = unity • language = csharp • unityVersion 非空 • packages 可解析 18.2 C# 相关文件检索用户输入： “角色跳跃有时候没反应” 若项目中存在： • PlayerController.cs • CharacterMotor.cs • GroundDetector.cs 且文件内容包含 jump / grounded / Rigidbody 相关逻辑，则输出相关文件中应至少包含其中一个核心文件。 18.3 ast-outline 可用时当系统安装 ast-outline 时： • 应调用 ast-outline search • 应能输出结构化 symbol 摘要 • 应优先使用 ast-outline 结果 18.4 ast-outline 不可用时当系统未安装 ast-outline 时： • Chiron 不应失败 • 应回退到 fallback • 仍应识别 Unity 项目 • 仍应输出 .cs 相关文件 18.5 Prompt 质量最终 prompt context 应明确包含： • Unity/C# 项目身份 • Unity 版本 • 关键 packages • 相关 .cs 文件 • 相关类 / 方法 • Unity-specific 检查项 • git changed files，如有 ───────────────────────────── 19. 测试计划 19.1 单元测试测试模块： • detectAstOutlineAvailable • UnityMetadataCollector.detectProject • readUnityVersion • readUnityManifest • collectAsmdefs • composeUnityContext • fallbackSearchCSharpFiles 19.2 Fixture 测试创建 fixture： • fixtures/unity-basic • fixtures/unity-input-system • fixtures/unity-urp • fixtures/unity-asmdef • fixtures/dotnet-non-unity 验证： • Unity detection • manifest parsing • relevant files search • context output snapshot 19.3 Mock ast-outline 测试通过 fake ast-outline binary 或 mock exec 结果测试： • search JSON parsing • digest JSON parsing • show JSON parsing • timeout handling • command failure fallback 19.4 集成测试运行： • chiron-enhance "角色跳跃有时候没反应" • 验证输出包含 Unity/C# 上下文 • 验证未安装 ast-outline 时仍可运行 • 验证安装 ast-outline 时输出更丰富 ───────────────────────────────────── 20. 实施阶段 Phase 1：Unity metadata collector 目标： • 支持 Unity 项目识别 • 支持 Unity 版本读取 • 支持 package manifest 读取 • 支持 .cs fallback 搜索价值： • 即使没有 ast-outline，Unity 项目也不再是空上下文。 ─────────────────────────── Phase 2：AstOutlineProvider 目标： • 检测 ast-outline • 调用 search • 调用 digest • 解析 JSON 输出 • 集成到 findRelevantFiles 价值： • 显著提升 C# 代码相关文件选择质量。 ────────────────────────────────────── Phase 3：Method-level Context 目标： • 基于 search / digest 结果调用 show • 展开关键方法 • 控制 token budget 价值： • 后端模型能看到真正相关的方法体，而不是整个文件。 ──────────────────────────────────────────────────── Phase 4：Dependency Impact 目标： • 对 refactor / bugfix 任务调用 reverse-deps • 输出影响范围价值： • 降低修改破坏其他模块的风险。 ────────────────────────────────────────────── Phase 5：Prompt Composer 优化目标： • 重写 buildNaturalContext • 增加 Unity-specific prompt checklist • 增加 debug 输出 • 增加 context quality report 价值： • 输出更稳定、更可解释、更适合 LLM 后端。 ─────────────────────────────────────────── 21. 风险与应对 21.1 依赖外部二进制风险： • 用户未安装 ast-outline • Windows PATH 问题 • 首次索引慢应对： • 可选依赖 • fallback • debug 提示 • 配置 CHIRON_AST_OUTLINE_PATH ──────────────────────────────── 21.2 ast-outline 输出格式变化风险： • JSON schema 变化导致解析失败应对： • 检查 schema 字段 • 容错解析 • fallback 到 text 模式或内置检索 • 测试覆盖常见输出格式 ─────────────────────────────────── 21.3 Unity 项目过大风险： • Assets/ 文件数量巨大 • .unity / .prefab 文件过大 • 扫描耗时应对： • 默认忽略大型目录 • 限制扫描深度 • 限制文件大小 • 优先 git changed files • 优先 Assets/Scripts/**/*.cs ─────────────────────────────── 21.4 过度注入 Unity checklist 风险： • prompt 变啰嗦 • 对简单任务干扰模型应对： • checklist 按任务类型注入 • 限制长度 • debug 模式展示注入原因 • 用户可通过 env 关闭 ──────────────────────────── 22. 成功指标 22.1 定量指标 | 指标 | 目标 | |---|---| | Unity 项目识别准确率 | 大于 95% | | C# 相关文件召回率 | 明显高于当前版本 | | 平均 context 构造耗时 | 小于 3 秒 | | 未安装 ast-outline 时成功运行率 | 100% | | 输出 prompt token 增量 | 可控，不超过预算 | 22.2 定性指标 • Unity 用户能明显感知 Chiron “知道这是 Unity 项目”。 • 后端模型能更快定位相关 .cs 脚本。 • 对常见 Unity bug，prompt 中能自动提示生命周期、Inspector、Prefab、Layer 等风险。 • Chiron 不再表现得像只适配 Node/Python 项目。 ────────────────────────────────────────────── 23. 推荐最终形态最终的 ContextEngine 应该从当前的： “用少量硬编码规则猜项目和相关文件” 升级为： “本地上下文编排器” 即： • 使用 ast-outline 获取代码结构与相关 symbol • 使用 Unity metadata collector 获取项目语义 • 使用 git context 获取当前工作状态 • 使用 prompt composer 生成面向 LLM 的结构化上下文一句话总结： Chiron 不需要自己成为 C# / Unity 代码智能引擎；它应该把 `ast-outline` 作为代码导航层，把 Unity metadata collector 作为项目语义层，然后专注于生成高质量 prompt context。

未命名

Thu, 07 May 2026 04:51:44 GMT

aider 的 repo-map [aider/aider/repomap.py at 3ec8ec5a7d695b08a6c24fe6c0c235c8f87df9af · Aider-AI/aider](https://github.com/Aider-AI/aider/blob/3ec8ec5a/aider/repomap.py#L42-L88)

基础架构

Wed, 06 May 2026 11:49:16 GMT

Codex 桌面应用技术栈全景分析 ### 基础架构 ┌──────────┬─────────────────────────────────────────────┐ │ 层级 │ 技术 │ ├──────────┼─────────────────────────────────────────────┤ │ 运行时 │ Electron 41.2.0 + Chromium 146 + Node.js 22 │ ├──────────┼─────────────────────────────────────────────┤ │ 打包 │ MSIX (Windows Store) + Squirrel (自动更新) │ ├──────────┼─────────────────────────────────────────────┤ │ 构建工具 │ electron-forge 7 + Vite 8 │ ├──────────┼─────────────────────────────────────────────┤ │ 语言 │ TypeScript 5.9.3 │ ├──────────┼─────────────────────────────────────────────┤ │ 包管理 │ pnpm (monorepo workspace) │ ├──────────┼─────────────────────────────────────────────┤ │ CDP 调试 │ 启动时已默认开启 127.0.0.1:9222 │ └──────────┴─────────────────────────────────────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### 前端 UI 技术栈 ┌──────────┬─────────────────────────────────────────┬───────────────────────┐ │ 类别 │ 库 │ 用途 │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ 框架 │ React 19.2 │ UI 核心 │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ 路由 │ react-router-dom 7.13 │ 页面导航 │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ 状态管理 │ Jotai 2.19 + @tanstack/react-query 5.90 │ 原子状态 + 服务端状态 │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ 表单 │ @tanstack/react-form 1.27 │ 表单处理 │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ UI 组件 │ Radix UI (20+ 组件) │ 无样式头组件库 │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ 样式 │ Tailwind CSS + tailwind-merge │ 原子化 CSS │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ 动效 │ framer-motion 12.23 + Lottie │ 动画 │ ├──────────┼─────────────────────────────────────────┼───────────────────────┤ │ 拖拽 │ @dnd-kit 6 │ 拖拽排序 │ └──────────┴─────────────────────────────────────────┴───────────────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### 编辑器相关 ┌─────────────────────────┬─────────────────────────────────────┐ │ 库 │ 用途 │ ├─────────────────────────┼─────────────────────────────────────┤ │ Lexical 0.32 │ 所见即所得富文本编辑器（Meta 出品） │ ├─────────────────────────┼─────────────────────────────────────┤ │ ProseMirror (6 个包) │ 底层文档模型（代码编辑器核心） │ ├─────────────────────────┼─────────────────────────────────────┤ │ xterm.js 5.5 + node-pty │ 终端模拟器 │ ├─────────────────────────┼─────────────────────────────────────┤ │ Shiki 3.20 │ VS Code 级别语法高亮 │ ├─────────────────────────┼─────────────────────────────────────┤ │ yjs + y-protocols │ CRDT 协同编辑 │ ├─────────────────────────┼─────────────────────────────────────┤ │ diff / diff3 │ Git diff 引擎 │ └─────────────────────────┴─────────────────────────────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### 数据可视化 ┌───────────────────┬───────────────────┐ │ 库 │ 用途 │ ├───────────────────┼───────────────────┤ │ Mermaid 11.12 │ 流程图/时序图/UML │ ├───────────────────┼───────────────────┤ │ d3 7.9 (完整套件) │ 数据可视化 │ ├───────────────────┼───────────────────┤ │ dagre-d3-es 7 │ 有向图自动布局 │ ├───────────────────┼───────────────────┤ │ cytoscape 3.33 │ 图形网络可视化 │ ├───────────────────┼───────────────────┤ │ mapbox-gl 3.21 │ 地图渲染 │ ├───────────────────┼───────────────────┤ │ Three.js 0.179 │ 3D 渲染 │ └───────────────────┴───────────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### 文档 & 内容 ┌─────────────────────────────────────┬───────────────┐ │ 库 │ 用途 │ ├─────────────────────────────────────┼───────────────┤ │ react-markdown 10 + remark + rehype │ Markdown 渲染 │ ├─────────────────────────────────────┼───────────────┤ │ KaTeX 0.16 │ 数学公式 │ ├─────────────────────────────────────┼───────────────┤ │ react-pdf + pdfjs-dist 5.4 │ PDF 渲染 │ ├─────────────────────────────────────┼───────────────┤ │ docx-preview │ Word 文档预览 │ ├─────────────────────────────────────┼───────────────┤ │ highlight.js 11 │ 代码高亮 │ └─────────────────────────────────────┴───────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### 后端 & 系统能力 ┌────────────────────────────────┬────────────────────┐ │ 库 │ 用途 │ ├────────────────────────────────┼────────────────────┤ │ better-sqlite3 12.8 │ 本地 SQLite 数据库 │ ├────────────────────────────────┼────────────────────┤ │ isomorphic-git 1.37 │ Git 操作（纯 JS） │ ├────────────────────────────────┼────────────────────┤ │ ws 8.19 │ WebSocket │ ├────────────────────────────────┼────────────────────┤ │ zod 4.1 │ Schema 验证 │ ├────────────────────────────────┼────────────────────┤ │ smol-toml 1.5 │ TOML 配置解析 │ ├────────────────────────────────┼────────────────────┤ │ @modelcontextprotocol/sdk 1.24 │ MCP 协议支持 │ └────────────────────────────────┴────────────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### 监控 & 分析 ┌────────────────────────────────────────┬─────────────────────────┐ │ 库 │ 用途 │ ├────────────────────────────────────────┼─────────────────────────┤ │ Sentry (electron + node + browser) │ 错误监控 $$3 层全覆盖$$ │ ├────────────────────────────────────────┼─────────────────────────┤ │ Segment │ 用户行为分析 │ ├────────────────────────────────────────┼─────────────────────────┤ │ Statsig │ 特性开关 / A/B 测试 │ ├────────────────────────────────────────┼─────────────────────────┤ │ OpenTelemetry (25+ instrumentation 包) │ 全链路追踪 │ └────────────────────────────────────────┴─────────────────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### 原生模块 (node-gyp 编译) ┌─────────────────────────────┬────────────────┬───────────────┐ │ 模块 │ 大小 │ 用途 │ ├─────────────────────────────┼────────────────┼───────────────┤ │ better-sqlite3 │ ~3MB │ SQLite 数据库 │ ├─────────────────────────────┼────────────────┼───────────────┤ │ node-pty │ ~200KB │ 伪终端 │ ├─────────────────────────────┼────────────────┼───────────────┤ │ better-sqlite3 │ 内嵌 │ SQLite │ ├─────────────────────────────┼────────────────┼───────────────┤ │ bufferutil / utf-8-validate │ WebSocket 加速 │ │ └─────────────────────────────┴────────────────┴───────────────┘ ──────────────────────────────────────────────────────────────────────────────── ### Monorepo 内部包 ``` openai-codex-electron (主应用) ├── app-server-types (workspace:*) ├── commands (workspace:*) ├── protocol (workspace:*) ├── shared-node (workspace:*) ├── browser-backend-common (link) └── browser-common (link) ```

Building a Virtual Filesystem for Mintlify's AI Assistant

Tue, 05 May 2026 15:27:07 GMT

![图像](https://pbs.twimg.com/media/HE3NSrYbIAAVhm8?format=jpg&name=large) RAG is great, until it isn't. Our assistant could only retrieve chunks of text that matched a query. If the answer lived across multiple pages, or the user needed exact syntax that didn't land in a top-K result, it was stuck. We wanted it to explore docs the way you'd explore a codebase. Agents are [converging on filesystems as their primary interface](https://arxiv.org/abs/2601.11672) because grep, cat, ls, and find are all an agent needs. If each doc page is a file and each section is a directory, the agent can search for exact strings, read full pages, and traverse the structure on its own. We just needed a filesystem that mirrored the live docs site. ## The Container Bottleneck The obvious way to do this is to just give the agent a real filesystem. Most harnesses solve this by spinning up an isolated sandbox and cloning the repo. We already use sandboxes for asynchronous background agents where latency is an afterthought, but for a frontend assistant where a user is staring at a loading spinner, the approach falls apart. Our p90 session creation time (including GitHub clone and other setup) was **~46 seconds**. Beyond latency, dedicated micro-VMs for reading static documentation introduced a serious infrastructure bill. At 850,000 conversations a month, even a minimal setup (1 vCPU, 2 GiB RAM, 5-minute session lifetime) would put us north of $70,000 a year based on [Daytona's per-second sandbox pricing](https://www.daytona.io/pricing) ($0.0504/h per vCPU, $0.0162/h per GiB RAM). Longer session times double that. (This is based on a purely naive approach, a true production workflow would probably have warm pools and container sharing, but the point still stands) We needed the filesystem workflow to be instant and cheap, which meant rethinking the filesystem itself. ## Faking a Shell The agent doesn't need a real filesystem; it just needs the illusion of one. Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built **ChromaFs**: a virtual filesystem that intercepts UNIX commands and translates them into queries against that same database. Session creation dropped from ~46 seconds to **~100 milliseconds**, and since ChromaFs reuses infrastructure we already pay for, the marginal per-conversation compute cost is zero. ![图像](https://pbs.twimg.com/media/HE3LCGwacAA_9B0?format=jpg&name=large) ChromaFs Architecture ChromaFs is built on [just-bash](https://github.com/vercel-labs/just-bash) by Vercel Labs (shoutout [Malte](https://x.com/cramforce)!), a TypeScript reimplementation of bash that supports grep, cat, ls, find, cd, and more. just-bash exposes a pluggable IFileSystem interface, so it handles all the parsing, piping, and flag logic while ChromaFs translates every underlying filesystem call into a Chroma query. ```typescript export class ChromaFs implements IFileSystem { private files = new Set(); private dirs = new Map(); async readFile(path: string): Promise { this.assertInit(); const normalized = normalizePath(path); // Serve from cache or fetch from Chroma const slug = normalized.replace(/\\.mdx$/, '').slice(1); // Pages are chunked in Chroma. Reassemble them on the fly: const results = await this.collection.get({ where: { page: slug }, include: [IncludeEnum.documents, IncludeEnum.metadatas], }); const chunks = results.ids .map((id, i) => ({ document: results.documents[i] ?? '', chunkIndex: parseInt(String(results.metadatas[i]?.chunk_index ?? 0), 10), })) .sort((a, b) => a.chunkIndex - b.chunkIndex); return chunks.map((c) => c.document).join(''); } // Enforce completely stateless, read-only interaction async writeFile(): Promise { throw erofs(); } async appendFile(): Promise { throw erofs(); } async mkdir(): Promise { throw erofs(); } async rm(): Promise { throw erofs(); } } ``` ## How it works **Bootstrapping the Directory Tree** ChromaFs needs to know what files exist before the agent runs a single command. We store the entire file tree as a gzipped JSON document (\_\_path\_tree\_\_) inside the Chroma collection: ```json { "auth/oauth": { "isPublic": true, "groups": [] }, "auth/api-keys": { "isPublic": true, "groups": [] }, "internal/billing": { "isPublic": false, "groups": ["admin", "billing"] }, "api-reference/endpoints/users": { "isPublic": true, "groups": [] } } ``` On init, the server fetches and decompresses this document into two in-memory structures: a Set of file paths and a Map mapping directories to children. Once built, ls, cd, and find resolve in local memory with no network calls. The tree is cached, so subsequent sessions for the same site skip the Chroma fetch entirely. **Access Control** Notice the **isPublic** and **groups** fields in the path tree. Before building the file tree, ChromaFs prunes the file tree based on the current user's permissions and applies a matching filter to all subsequent Chroma queries. In a real sandbox, this level of per-user access control would require managing Linux user groups, chmod permissions, or maintaining isolated container images per customer tier. In ChromaFs it's a few lines of filtering before buildFileTree runs. **Reassembling Pages from Chunks** Pages in Chroma are split into chunks for embedding, so when the agent runs cat /auth/oauth.mdx, ChromaFs fetches all chunks with a matching page slug, sorts by chunk\_index, and joins them into the full page. Results are cached so repeated reads during grep workflows never hit the database twice. Not every file needs to exist in Chroma. We register lazy file pointers that resolve on access for large OpenAPI specs stored in customers' S3 buckets. The agent sees v2.json in /api-specs/, but the content only fetches when it runs cat. Every write operation throws an EROFS (Read-Only File System) error. The agent explores freely but can never mutate documentation, which makes the system stateless with no session cleanup and no risk of one agent corrupting another's view. ## Optimizing Grep cat and ls are straightforward to virtualize, but grep -r would be far too slow if it naively scanned every file over the network. We intercept just-bash’s grep, parse the flags with yargs-parser, and translate them into a Chroma query ($contains for fixed strings, [$regex](https://x.com/search?q=%24regex&src=cashtag_click) for patterns). Chroma acts as a **coarse filter** that identifies which files might contain the hit, and we bulkPrefetch those matching chunks into a Redis cache. From there, we rewrite the grep command to target only the matched files and hand it back to just-bash for **fine filter** in-memory execution, which means large recursive queries complete in milliseconds. ```typescript const chromaFilter = toChromaFilter( scannedArgs.patterns, scannedArgs.fixedStrings, scannedArgs.ignoreCase ); // 1. Coarse Filter: Ask Chroma for slugs matching the string/regex const matchedSlugs = await chromaFs.findMatchingFiles(chromaFilter, slugsUnderDirs); if (matchedSlugs.length === 0) return { stdout: ‘’, exitCode: 1 }; // 2. Prefetch: Pull the chunked files into local cache concurrently await chromaFs.bulkPrefetch(matchedSlugs); // 3. Fine Filter: Narrow the arguments to ONLY the resolved hits const matchedPaths = matchedSlugs.map((s) => ‘/’ + s + ‘.mdx’); const narrowedArgs = [...args, ...matchedPaths]; // e.g. ["-i", "OAuth", "/docs/auth.mdx"] // 4. Exec: Let the in-memory RegExp engine format the final output return execBuiltin(narrowedArgs, ctx); ``` ## Conclusion ChromaFs powers the documentation assistant for hundreds of thousands of users across 30,000+ conversations a day. By replacing sandboxes with a virtual filesystem over our existing Chroma database, we got instant session creation, zero marginal compute cost, and built-in RBAC without any new infrastructure. Try it on any Mintlify docs site, or [mintlify.com/docs](https://mintlify.com/docs). \[Read the full article at: [https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant](https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant)\]

设计循环：让Codex成为UI/UX的魔术师【完整指南】

Tue, 05 May 2026 08:16:10 GMT

![图像](https://pbs.twimg.com/media/HHeOAUDagAALPW5?format=jpg&name=large) Most people treat Codex as a simple command-line tool, leading to underwhelming UI designs that feel like rough drafts. The Codex Desktop app transforms this by integrating a visual feedback loop: it builds and runs your app, captures screenshots, uses vision capabilities to inspect layouts, simulates user interactions, generates assets via imagegen, and iterates on code based on real visual outcomes. This approach turns design into an efficient, iterative process rather than a one-off prompt. ## The Mistake: Why First Drafts Fall Short Traditional AI coding benchmarks focus on evaluating the initial output from a single prompt, which is misleading. Most first drafts exhibit issues like inconsistent spacing, unclear visual hierarchy, or failures in mobile responsiveness when you use AI. However, if the LLM can see the UI and what's good, what's bad, and what's ugly really well, which is possible with Codex's vision capabilities, this changes the dynamics completely. Codex Desktop excels here by enabling rapid iterations. ## The Mental Model: Shift Codex to a Visual Product Builder Adopt this core loop: ![图像](https://pbs.twimg.com/media/HHeMXV4akAAleJJ?format=jpg&name=large) Codex is no longer just terminal code; it is a visual design loop. prompt → build → run → screenshot → vision review → click/test → revise → compare. **This shift offers three major benefits backed by design principles:** 1. Codex directly observes the UI in action, eliminating guesswork-unlike CLI, which relies solely on code predictions. 2. It automates issue detection, saving hours of manual review; for example, vision tools can pinpoint contrast issues that affect 8% of users with color vision deficiencies. 3. It facilitates A/B testing of designs, allowing Codex to prototype variants and select winners based on metrics such as readability and user flow, drawing on UX best practices that emphasize evidence-based decision-making. ## CLI Codex vs. Desktop Codex: Knowing When to Switch ![图像](https://pbs.twimg.com/media/HHeMmBeaIAATV7w?format=jpg&name=large) CLI can edit code; Desktop can see and interact. Codex CLI remains ideal for linear tasks like code refactors, unit tests, backend updates, or script automation, where code accuracy is paramount. However, for design-intensive work, Codex Desktop introduces a feedback-rich environment. It (1) captures visual states, (2) detects layout breaks on mobile devices (e.g., elements shifting at 320px widths), and (3) compares before/after screenshots, features absent in CLI. Also, (4) creates the necessary visuals on the go with the new image generation engine, Image 2, just for what is needed, which changes the design prototyping/iteration flow in a drastic way. This automated iterative approach cuts babysitting time and makes amazing design revisions, making it essential for any project where aesthetics impact user engagement. ## Imagegen vs. Vision: Distinct Tools for Distinct Roles ![图像](https://pbs.twimg.com/media/HHeM2E9bMAApQlc?format=jpg&name=large) Imagegen creates source material; vision judges the real UI. **Imagegen and vision serve complementary but separate purposes.** Imagegen 擅长创建源素材，如游戏头像、产品图标或背景场景，这些素材提升视觉吸引力——非常适合需要定制艺术的项目，正如游戏开发案例中所见，它加速了素材制作时间，仿佛魔法般。相比之下，视觉分析渲染界面，实时指出层级结构不佳、间距不足或对比度低等问题。有效的工作流程是：用自动模式生成一个资产，把ImageGen技能集成到UI里，截图结果，然后用Vision进行精炼。这种组合确保设计不仅具备创意，还具备实用性，AI可用性测试显示界面满意度大幅提升。 ## 那个改变一切的提示像“让它看起来更好”这样模糊的提示结果不一致。相反，应使用针对性的提示，如： ```text Use Codex Desktop as a visual design loop. Run the app locally, inspect screenshots on desktop and mobile, click through interactions, and revise for polished results. Focus on hierarchy, spacing, contrast, text fit, responsive layout, and hover states. Utilize imagegen skill and vision in coordination with each other, and feel free to a/b test where needed, improvise, and only finish the task when you are happy with the results by checking them via vision, in responsive terms. Inspect the screens as a product designer and QA expert. Identify the visual issues by impact and fix them after verifying all visual problems and things that need iteration. ``` ## 工作流程1：设计游戏界面游戏界面复杂，包含HUD、物品栏和移动端改编。 ![图像](https://pbs.twimg.com/media/HHeNusibkAElr_U?format=jpg&name=large) 一个强有力的提示可能是： ```text Run the game screen and inspect with vision. Evaluate HUD readability, icon clarity, and mobile behavior as if you were a player. Use imagegen for assets like portraits or backgrounds. Use vision to double-check and iterate until the visuals are production-ready. ``` 该循环将提前解决80%+的视觉漏洞，确保玩家能快速识别可点击元素并无挫败感地阅读统计数据，最终提升记忆保留率。并帮助你更快、更精准地生成现成游戏。 ## 工作流程2：构建产品界面对于SaaS仪表盘或管理面板，优先考虑可用性： ```text Create a functional UI based on 2026 design principles that delivers top-notch user experience, readability, and usability. Create the UIs via imagegen after researching references and best practices, then run them locally with vision and inspect the hierarchy, spacing, and navigation on both desktop and mobile. Also, make sure everything works perfectly for our user experience flow so that all users, especially non-technical ones, can understand our product and user interface and navigate without stress. Only stop after you are truly satisfied. Iterate rigorously until then, using imagegen skill and vision together. ``` ## 工作流程3：提交前进行A/B测试为避免主观选择，提示如下： ```text Generate three visual variants, screenshot them on desktop and mobile, and compare based on readability and hierarchy. Select and apply the best one. ``` 借助网页设计的A/B测试数据，这种方法能让你和AI都能看到更多变化，然后再采取最适合你产品最终目标的方案，从而获得更好的效果。 ## 工作流程4：提取艺术圣经在页面设计成功后，请提示以下内容，这将有助于整体保持一致性： ![图像](https://pbs.twimg.com/media/HHeNExuaoAAwY-o?format=jpg&name=large) Codex can turn experiments into reusable design language. ```text From app screenshots, create an art bible with rules for palette, typography, and layout.' ``` ## 主提示最后，这里有一个全面的主提示，几乎所有视觉内容都可以复制：用Codex Desktop作为视觉构建工具。目标：\[描述你的项目\]。在此工作流程中，将imagegen和vision作为配对使用： 1. 构建一个极简版本。 2. 本地运行。 3. 截图。 4. 用视觉观察。 5. 点击流程。 6. 解决问题。 7. 重新截图。 8. A/B测试变体。 9. 使用素材生成图像。 10. 总结变更并提取设计规则。 ## 回顾与总结：提升你的设计流程 Codex Desktop 通过参与完整的视觉循环，彻底革新了设计，并有数据支持更快迭代和更高质量的输出。关键要点包括使用愿景进行评审，A/B测试作为决策，资产整合以增强效果，以及艺术手册以实现一致性。为了快速获得精致成果，请立即行动：打开Codex Desktop，将主提示应用到您的下一个项目中，亲身体验差异，立即瞄准您的工作流程中可衡量的改进。别忘了了**保存本文**对于其中的提示;他们会把你的设计提升到新的高度。

利用器工程：为编码代理准备Unity C#代码库

Tue, 05 May 2026 08:15:59 GMT

在 Unity 项目里使用 Claude Code、Codex、Cursor 这类编码代理时，效果好坏并不只取决于模型有多聪明。更关键的是：你的工程是否给代理提供了足够清晰的 **affordances（可供性）**。也就是说，仓库结构、命名、测试、静态检查、编辑器脚本、CI 和文档，是否让代理很容易看懂： - 应该在哪里改代码 - 哪些代码不能碰 - 改完以后如何验证 - 哪些规则违反后必须修复 - 什么时候需要更新文档、Prefab、场景或 ScriptableObject Unity 项目尤其需要这些护栏。因为 Unity 的状态不只存在于 `.cs` 文件里，还存在于 `.unity` 场景、`.prefab`、`.asset`、`.asmdef`、Project Settings、Addressables 配置、Input Actions、Animator Controller 等资产里。代理如果只把仓库当成一堆 C# 文件，很容易生成能编译但不可维护、不可运行，甚至破坏序列化引用的代码。一个适合代理协作的 Unity 仓库，应该被视为代理的 **execution environment**，而不是单纯的源码目录。好的 Unity vibe coding 环境应该提供： - 快速验证“坏工程”的脚本 - 明确的代码边界和 Assembly Definition - 提交前强制执行的格式、编译、测试检查 - 代理能自检的一个统一验证命令 - 能解释场景、Prefab、资源管线和架构约束的文档 ## 让 Unity 仓库对代理可读 Unity 项目的默认 `Assets/` 很容易变成混乱的资源池。为了让编码代理稳定工作，项目结构需要比普通人类团队更明确。推荐结构： ```plaintext Assets/ _Project/ Art/ Audio/ Configs/ GameBalance/ Addressables/ Prefabs/ UI/ Gameplay/ Systems/ Scenes/ Boot.unity MainMenu.unity Gameplay.unity Scripts/ Runtime/ Core/ Gameplay/ UI/ Infrastructure/ Editor/ Tests/ EditMode/ PlayMode/ ScriptableObjects/ Shaders/ UI/ ThirdParty/ Plugins/ Packages/ ProjectSettings/ UserSettings/ # usually ignored .agents/ skills/ AGENTS.md README.md docs/ architecture.md conventions.md asset-pipeline.md testing.md scenes-and-entrypoints.md build-and-release.md performance.md cleanup.md ``` 核心原则： - 所有自研内容放在 `Assets/_Project/`，第三方资产放在 `Assets/ThirdParty/` 或 `Assets/Plugins/`。 - Runtime、Editor、Tests 分离，避免编辑器代码进入玩家构建。 - 场景入口、启动流程、全局服务、资源加载策略写进文档。 - Prefab、ScriptableObject、Addressables 不要只靠口口相传，必须有可读说明。 - 每个主要模块使用 `.asmdef` 限定依赖方向。 ## 固定 Unity 序列化和 Git 配置在让代理修改 Unity 项目前，先把版本控制和序列化设置固定住。否则代理即使只改一行 C#，也可能因为场景、Prefab、`.meta` 或二进制资源的差异制造难以 review 的变更。 Unity Project Settings 建议： ```plaintext Edit > Project Settings > Version Control > Mode: Visible Meta Files Edit > Project Settings > Editor > Asset Serialization > Mode: Force Text ``` 仓库规则： - `.meta` 文件必须提交；它们保存 GUID，是场景、Prefab、ScriptableObject 引用稳定的前提。 - 不手动编辑 `.meta`，除非任务明确涉及 GUID、资源迁移或冲突修复。 - `.unity`、`.prefab`、`.asset` 使用文本序列化，方便代理和 reviewer 看 diff。 - 大型二进制资源使用 Git LFS，避免仓库膨胀。 - 配置 UnityYAMLMerge，减少场景和 Prefab 合并冲突。示例 `.gitignore`： ```gitignore [Ll]ibrary/ [Tt]emp/ [Oo]bj/ [Bb]uild/ [Bb]uilds/ [Ll]ogs/ [Uu]ser[Ss]ettings/ .vs/ .idea/ *.user *.pidb *.booproj *.svd # Generated IDE files. Commit these only if your team intentionally relies on them. *.csproj *.sln *.suo ``` 示例 `.gitattributes`： ```gitattributes *.cs text eol=lf *.asmdef text eol=lf *.unity text eol=lf merge=unityyamlmerge *.prefab text eol=lf merge=unityyamlmerge *.asset text eol=lf merge=unityyamlmerge *.meta text eol=lf merge=unityyamlmerge *.controller text eol=lf merge=unityyamlmerge *.anim text eol=lf merge=unityyamlmerge *.mat text eol=lf merge=unityyamlmerge *.png filter=lfs diff=lfs merge=lfs -text *.psd filter=lfs diff=lfs merge=lfs -text *.fbx filter=lfs diff=lfs merge=lfs -text *.wav filter=lfs diff=lfs merge=lfs -text *.mp3 filter=lfs diff=lfs merge=lfs -text ``` UnityYAMLMerge 需要在本机或 CI runner 上配置 merge driver。路径按实际 Unity 安装位置调整： ```bash git config merge.unityyamlmerge.name "Unity SmartMerge" # macOS git config merge.unityyamlmerge.driver '"/Applications/Unity/Hub/Editor/2022.3.0f1/Unity.app/Contents/Tools/UnityYAMLMerge" merge -p %O %A %B %A' # Windows git config merge.unityyamlmerge.driver '"C:/Program Files/Unity/Hub/Editor/2022.3.0f1/Editor/Data/Tools/UnityYAMLMerge.exe" merge -p %O %A %B %A' ``` 示例 Assembly Definition 布局： ```plaintext Assets/_Project/Scripts/Runtime/Core/Project.Core.asmdef Assets/_Project/Scripts/Runtime/Gameplay/Project.Gameplay.asmdef Assets/_Project/Scripts/Runtime/UI/Project.UI.asmdef Assets/_Project/Scripts/Editor/Project.Editor.asmdef Assets/_Project/Scripts/Tests/EditMode/Project.Tests.EditMode.asmdef Assets/_Project/Scripts/Tests/PlayMode/Project.Tests.PlayMode.asmdef ``` 建议依赖方向： ```plaintext Project.Gameplay -> Project.Core Project.UI -> Project.Core Project.Editor -> Project.Core, Project.Gameplay, Project.UI Tests -> Runtime assemblies ``` 禁止让 `Core` 反向依赖 `Gameplay` 或 `UI`。这类规则可以通过文档、asmdef 依赖和 CI 检查共同约束。 ## 用 asmdef 控制爆炸半径 Unity 没有 TypeScript monorepo 那样天然的包边界，但 `.asmdef` 可以起到类似作用。一个适合代理修改的 Unity 项目，不应该让所有脚本都处在默认 Assembly-CSharp 里。否则代理新增一个类时，很容易无意识引用任何地方的类型，最后形成大型循环依赖。推荐规则： - 每个稳定模块都有自己的 `.asmdef`。 - Runtime assembly 不引用 Editor assembly。 - Feature 模块只能依赖 Core、Shared、明确允许的服务层。 - UI 不直接调用存档、网络、支付等底层实现，而是依赖接口或应用服务。 - Editor 工具放在 `Editor/` 目录，并使用 Editor-only asmdef。示例 `Project.Gameplay.asmdef`： ```json { "name": "Project.Gameplay", "rootNamespace": "Project.Gameplay", "references": [ "Project.Core" ], "includePlatforms": [], "excludePlatforms": [], "allowUnsafeCode": false, "overrideReferences": false, "precompiledReferences": [], "autoReferenced": true, "defineConstraints": [], "versionDefines": [], "noEngineReferences": false } ``` `rootNamespace` 很重要。它能帮助代理遵守命名空间，不要把所有类都生成在全局空间。 ## 为代理准备 Unity 专属 AGENTS.md TypeScript 项目常用 `CLAUDE.md`，Unity 项目可以使用 `AGENTS.md`、`CLAUDE.md` 或 `.cursor/rules`。关键是让所有代理读取同一套规则。示例： ````markdown # Project Unity Coding Rules ## Overview This is a Unity game built with Unity 2022.3 LTS. Runtime code lives in `Assets/_Project/Scripts/Runtime`. Editor tools live in `Assets/_Project/Scripts/Editor`. Do not modify third-party assets under `Assets/ThirdParty` unless explicitly asked. ## Entry Points - Boot scene: `Assets/_Project/Scenes/Boot.unity` - Main menu scene: `Assets/_Project/Scenes/MainMenu.unity` - Gameplay scene: `Assets/_Project/Scenes/Gameplay.unity` - Game bootstrap: `Project.Core.Bootstrap.GameBootstrapper` ## Mandatory Rules - Do not create scripts in `Assets/` root. - Do not put project code in global namespace. - Do not edit `.meta` files manually unless the task explicitly requires asset GUID work. - Do not move or rename assets casually; Unity references depend on GUIDs. - Do not modify `ProjectSettings/` unless the task is about build settings, input, quality, graphics, packages, or player settings. - Do not modify third-party plugin code; wrap it from project code instead. - Prefer editing existing components over adding duplicate manager classes. - Before adding a singleton, search for existing services, installers, bootstrap code, or ScriptableObject configs. ## C# Style - Use explicit namespaces matching assembly/module names. - Use `private` fields with `[SerializeField]` for inspector references. - Avoid public mutable fields. - Avoid `FindObjectOfType`, `GameObject.Find`, and tag-based lookup in gameplay code unless there is no stable reference path. - Avoid `async void` except Unity event handlers where unavoidable. - Prefer cancellation-aware async flows when using UniTask or Tasks. - Do not catch and rethrow exceptions without adding useful context. - Do not add comments that narrate obvious code. ## Unity Lifecycle Rules - Keep `Awake` for local initialization. - Keep `Start` for cross-object initialization that depends on scene setup. - Keep `Update` small; move non-trivial behavior into named methods or systems. - Unsubscribe events in `OnDisable` or `OnDestroy` consistently with where they were subscribed. - Do not allocate every frame in `Update`, `LateUpdate`, `FixedUpdate`, UI bindings, or hot gameplay paths. ## Validation Before considering a task complete, run: ```bash ./tools/validate-unity.sh ``` If validation fails, fix the cause. Do not weaken tests, remove analyzers, or bypass compilation unless explicitly asked. ```` ## 技能体现 Unity 最佳实践如果团队成员使用不同代理，例如 Claude、Codex、Cursor、JetBrains AI，可以把共享技能或规则放在 `.agents/skills/`，再由各工具引用。 ```plaintext .agents/ skills/ unity-csharp-expert/ SKILL.md unity-editor-tooling/ SKILL.md unity-performance/ SKILL.md unity-ui-ugui/ SKILL.md unity-addressables/ SKILL.md .codex/ skills/ unity-csharp-expert -> ../../.agents/skills/unity-csharp-expert .claude/ skills/ unity-csharp-expert -> ../../.agents/skills/unity-csharp-expert ``` 完整 Skill 模板见附录 A。最小化版本至少要覆盖： - 修改前读取 `AGENTS.md`、相关 `docs/*.md` 和最近的 `.asmdef`。 - 保护 Unity 序列化字段、Prefab、Scene、ScriptableObject 和 `.meta`。 - 限制 Runtime / Editor 依赖边界。 - 对性能、UI、Addressables 等高风险领域使用专门 Skill。 - 完成前说明验证结果和资产影响。 Unity 的关键区别是：**C# 类型重构会影响序列化数据**。代理必须知道，字段改名、类改名、命名空间改名、文件移动、asmdef 变更，都可能破坏场景和 Prefab 引用。 ## 附录 A：完整 Skills 模板下面是上文出现的 `.agents/skills/*/SKILL.md` 可直接落地版本。实际使用时，把 `Project`、路径和技术栈替换成项目真实名称。 ### `.agents/skills/unity-csharp-expert/SKILL.md` ````markdown --- name: unity-csharp-expert description: Use when writing, reviewing, or refactoring Unity runtime C# code, especially MonoBehaviours, ScriptableObjects, gameplay systems, services, asmdefs, serialized fields, async flows, and tests. --- # Unity C# Expert Use this skill when editing Unity runtime C# code. ## Read First - `AGENTS.md` - `docs/architecture.md` - `docs/conventions.md` - `docs/scenes-and-entrypoints.md` - `docs/asset-pipeline.md` - The nearest `.asmdef` for the files being changed ## Core Rules - Check existing MonoBehaviours, ScriptableObjects, services, bootstrap code, tests, and asmdefs before creating new files. - Preserve serialized field names unless a migration is included. - Do not rename serialized fields without `[FormerlySerializedAs]` and a migration note. - Do not move or rename `.unity`, `.prefab`, `.asset`, or `.meta` files unless explicitly asked. - Do not create scripts in `Assets/` root or global namespace. - Do not modify third-party assets; wrap them from project code. - Keep changes surgical. Do not redesign architecture around a small feature or bug fix. ## C# Style - Use namespaces that match the owning assembly or feature area. - Use `private` fields with `[SerializeField]` for inspector references. - Avoid public mutable fields. - Prefer explicit dependencies over scene-wide lookup. - Avoid single-use helpers, wrappers, base classes, and manager classes. - Avoid comments that narrate obvious code; comment only hidden constraints or surprising Unity behavior. ## Unity Runtime Rules - Use `Awake` for local initialization and `Start` for initialization that depends on scene wiring. - Subscribe and unsubscribe events symmetrically, usually in `OnEnable` and `OnDisable`. - Keep `Update`, `LateUpdate`, and `FixedUpdate` small and allocation-free. - Avoid `FindObjectOfType`, `GameObject.Find`, tag lookups, and broad singleton lookup in gameplay code. - Avoid per-frame LINQ, string formatting, closure allocations, and repeated `GetComponent` in hot paths. - Validate serialized references in `OnValidate`, targeted editor validation, or startup validation. ## Async And Coroutines - Avoid `async void` except Unity event handlers where unavoidable. - Prefer cancellation-aware async flows when using UniTask or `Task`. - Stop coroutines or cancel async work when objects are disabled or destroyed. - Do not update destroyed Unity objects after awaits, scene unloads, or cancellation. ## Tests And Validation - Add or update EditMode tests for pure logic, ScriptableObject constraints, and editor validation. - Add or update PlayMode tests for scene loading, lifecycle, UI, and gameplay flows. - Before finishing, run `./tools/validate-unity.sh` when Unity is available. - If validation fails, fix the cause rather than bypassing checks. ## Final Response Checklist - Mention any serialized field, asset, scene, prefab, or asmdef impact. - Mention tests run and validation status. - Mention if Unity was unavailable and which checks remain for the user. ```` ### `.agents/skills/unity-editor-tooling/SKILL.md` ````markdown --- name: unity-editor-tooling description: Use when creating or changing Unity Editor scripts, custom inspectors, menu items, asset processors, validation commands, build scripts, importers, and CI-facing `-executeMethod` entry points. --- # Unity Editor Tooling Use this skill for code under `Assets/_Project/Scripts/Editor` and editor-only assemblies. ## Read First - `AGENTS.md` - `docs/asset-pipeline.md` - `docs/build-and-release.md` - `docs/testing.md` - The editor `.asmdef` and the runtime assemblies it references ## Boundaries - Editor code must be in an `Editor/` folder or an Editor-only asmdef. - Runtime assemblies must not reference `UnityEditor`. - Do not put editor utilities in runtime namespaces. - Do not modify Project Settings unless the task explicitly requires it. - Do not create editor automation that silently rewrites scenes, prefabs, or assets without a dry-run or clear log. ## Validation Commands - Prefer static methods callable by `-executeMethod` for CI-facing validation. - Throw `BuildFailedException` for validation failures so Unity exits non-zero in batchmode. - Log enough asset paths and object names for agents and humans to fix failures. - Keep validation deterministic. Do not depend on current scene selection, inspector focus, or editor window state. ## AssetDatabase Rules - Use `AssetDatabase.FindAssets` with scoped folders, not whole-project scans by default. - Use GUIDs and `AssetDatabase.GUIDToAssetPath` for stable asset lookup. - Call `AssetDatabase.SaveAssets` only when the tool intentionally mutates assets. - Do not call `AssetDatabase.Refresh` repeatedly inside loops. - Never delete assets automatically unless the command is explicitly destructive and documented. ## Custom Inspectors And Windows - Keep editor UI thin; put validation and transformation logic in testable helper methods. - Support multi-object editing only when intentionally implemented. - Use `Undo.RecordObject` or `Undo.RegisterCompleteObjectUndo` for inspector-driven mutations. - Mark changed assets dirty only when values actually change. ## Final Response Checklist - Mention whether the tool reads assets, mutates assets, or changes Project Settings. - Mention the menu path or `-executeMethod` entry point. - Mention validation run status. ```` ### `.agents/skills/unity-performance/SKILL.md` ````markdown --- name: unity-performance description: Use when changing hot gameplay paths, Update loops, physics, UI refresh, pooling, allocations, Addressables loading, mobile performance, memory, frame time, or build size. --- # Unity Performance Use this skill when performance, memory, GC, frame time, loading, or build size could be affected. ## Read First - `AGENTS.md` - `docs/performance.md` - `docs/architecture.md` - Relevant profiler captures or benchmark notes if present ## Runtime Rules - Keep per-frame code allocation-free unless the allocation is intentional and documented. - Avoid per-frame LINQ, closures, string interpolation, boxing, and repeated component lookups. - Cache component references when access is repeated. - Prefer object pooling for frequently spawned short-lived objects. - Do not add new global `Update` loops when existing systems can own the work. - Prefer event-driven UI refresh over polling. ## Physics And Animation - Use `FixedUpdate` for physics mutation and `Update` for input sampling. - Avoid changing physics settings globally for a local feature. - Do not call expensive Animator or hierarchy operations every frame without measuring. - Be explicit about world space vs local space operations. ## Loading And Memory - Release Addressables handles consistently with the ownership model. - Avoid loading large assets synchronously on gameplay paths. - Do not keep scene-only references in persistent services after scene unload. - Avoid mutating imported ScriptableObject assets at runtime; copy runtime state when needed. ## Validation - Add tests for logic changes, but do not pretend unit tests prove performance. - If profiler data is available, compare before and after. - For risky changes, document expected frame-time, allocation, memory, or loading impact. ## Final Response Checklist - Mention hot paths touched. - Mention expected allocation and frame-time impact. - Mention tests run and any profiling not performed. ```` ### `.agents/skills/unity-ui-ugui/SKILL.md` ````markdown --- name: unity-ui-ugui description: Use when editing Unity UI built with uGUI, Canvas, RectTransform, prefabs, view controllers, UI events, navigation, layout, localization, and accessibility. --- # Unity UI uGUI Use this skill when editing uGUI screens, prefabs, panels, widgets, and UI-facing C# code. ## Read First - `AGENTS.md` - `docs/architecture.md` - `docs/asset-pipeline.md` - `docs/scenes-and-entrypoints.md` - Existing UI prefabs and controllers in the same feature area ## UI Code Rules - Keep views thin. UI components bind data, raise user intents, and delegate behavior to services or presenters. - Do not put gameplay, save, network, or economy logic directly in button handlers. - Use serialized references for UI widgets. Avoid runtime hierarchy searches except during controlled setup. - Subscribe and unsubscribe UI events symmetrically. - Avoid rebuilding layouts every frame. - Do not leave unused serialized fields, stale branches, hidden panels, or abandoned animation hooks. ## Prefab And Layout Rules - Preserve prefab references and serialized field names. - Do not unpack or restructure large UI prefabs unless explicitly asked. - Prefer editing existing UI prefabs/controllers over creating parallel variants. - Be careful with anchors, pivots, safe areas, scaling, and dynamic text length. - Keep localization expansion in mind; avoid hard-coded widths for text-heavy UI. ## Accessibility And Input - Preserve keyboard/controller navigation when changing selectable UI. - Keep interactable state and visual state consistent. - Avoid color-only state communication when the UI needs accessibility. ## Validation - Add EditMode tests for formatting/presenter logic when possible. - Add PlayMode tests for critical UI flows when practical. - Run relevant scenes or validation commands when Unity is available. ## Final Response Checklist - Mention UI prefabs, scenes, or serialized fields touched. - Mention input/navigation impact. - Mention tests or manual validation performed. ```` ### `.agents/skills/unity-addressables/SKILL.md` ````markdown --- name: unity-addressables description: Use when changing Unity Addressables, remote content, asset references, labels, groups, loading/release flows, catalogs, or asset bundle build validation. --- # Unity Addressables Use this skill when editing Addressables configuration, code paths that load assets, or assets intended for remote/local bundles. ## Read First - `AGENTS.md` - `docs/asset-pipeline.md` - `docs/build-and-release.md` - Existing Addressables groups, labels, and key conventions ## Key And Group Rules - Addressable keys must be stable. Do not rename keys casually. - Do not hard-code the same key in multiple places; centralize keys in constants or config. - Use labels intentionally. Do not add broad labels that pull too many assets into memory. - Keep group schema, compression, remote/local settings, and update restrictions consistent with existing groups. - Do not move assets between groups without documenting build and patching impact. ## Loading Rules - Track ownership of Addressables handles. - Release handles according to the same layer that loaded them. - Avoid synchronous loading in gameplay paths. - Handle load failure at real boundaries with actionable logs. - Do not keep references to unloaded scene assets or released Addressables. ## Validation - Run Addressables Analyze or project validation when available. - For build-impacting changes, run an Addressables content build or explain why it was not run. - Add tests around key resolution, catalog-facing config, or loader behavior when possible. ## Final Response Checklist - Mention keys, labels, groups, or assets changed. - Mention load/release ownership. - Mention Addressables validation/build status. ```` ## 代理需要维护的 Unity 文档 Unity 项目文档不应只描述代码，还要描述资源和编辑器约定。推荐文档： ```plaintext docs/ architecture.md # 模块、依赖方向、启动流程 scenes-and-entrypoints.md # 场景用途、加载顺序、入口脚本 asset-pipeline.md # Prefab、SO、Addressables、命名规范 conventions.md # C#、Unity 生命周期、Inspector 规则 testing.md # EditMode、PlayMode、CI、覆盖范围 build-and-release.md # 平台、构建参数、版本号、签名 performance.md # GC、对象池、Update 预算、Profiler 流程 cleanup.md # AI 垃圾回收的边界、提示词、PR 规则 ``` `docs/scenes-and-entrypoints.md` 示例： ```markdown # Scenes And Entry Points ## Boot.unity Purpose: initializes project-wide services and loads the next scene. Main scripts: - `Project.Core.Bootstrap.GameBootstrapper` - `Project.Core.SceneLoading.SceneLoader` Rules: - Do not place gameplay objects in Boot scene. - Do not make Boot depend on UI or Gameplay assemblies. - New global systems must be registered through the bootstrap flow, not via hidden scene searches. ## Gameplay.unity Purpose: contains the playable game loop. Main roots: - `GameplayRoot` - `PlayerSpawnRoot` - `LevelRuntimeRoot` Rules: - Scene references should be wired in prefabs or root installers. - Do not add duplicate manager GameObjects without checking existing roots. ``` `docs/asset-pipeline.md` 示例： ```markdown # Asset Pipeline ## Prefabs - Runtime prefabs live under `Assets/_Project/Prefabs`. - UI prefabs live under `Assets/_Project/Prefabs/UI`. - Do not unpack third-party prefabs in place. - Do not rename serialized fields without migration notes. ## ScriptableObjects - Balance configs live under `Assets/_Project/Configs/GameBalance`. - ScriptableObject types live under `Project.Core.Config` or the owning feature namespace. - Runtime systems may read config assets but must not mutate imported asset instances during play. ## Addressables - Addressable keys must be stable. - Do not hard-code Addressable keys in multiple places; centralize them in constants or config. - Validate Addressables before release builds. ``` ## Unity C# 编码规则：减少 AI Slop Unity 项目的 AI slop 往往不是语法错误，而是“看起来能跑，但工程越来越烂”： - 新增很多 `Manager`、`Controller`、`Handler`，职责重叠 - 到处 `FindObjectOfType` 和 `DontDestroyOnLoad` - 改字段名导致 Prefab 引用丢失 - 在 `Update` 里分配 GC 或做昂贵查询 - 编辑器脚本混入 Runtime assembly - 测试缺失，只靠手动点场景 - 留下旧 Prefab、旧 ScriptableObject、旧分支逻辑可以在 `AGENTS.md` 里强制： ```markdown ## Minimal Changes / No Slop - Re-read your diff before finishing. Delete unused code, stale branches, unused serialized fields, and abandoned helper methods. - Do not add narration comments. Comments must explain non-obvious Unity constraints, lifecycle ordering, serialization migration, or platform-specific behavior. - Do not create a new manager/service if an existing system owns the responsibility. - Do not create a helper class for a single caller. - Do not add speculative null checks for serialized fields everywhere. Validate required references in `OnValidate` or targeted startup validation. - Do not use `FindObjectOfType`, `GameObject.Find`, or global singleton lookup as a first choice. - Do not rename serialized fields without `[FormerlySerializedAs]` and a migration note. - Do not move assets unless the task requires it. - Do not edit generated files, package cache files, or third-party assets. ``` 字段改名示例： ```csharp using UnityEngine; using UnityEngine.Serialization; namespace Project.Gameplay.Player { public sealed class PlayerMovement : MonoBehaviour { [FormerlySerializedAs("moveSpeed")] [SerializeField] private float movementSpeed = 5f; } } ``` 这条规则对 Unity 很重要。没有 `[FormerlySerializedAs]`，代理一次普通重命名就可能让所有 Prefab 上的数值回到默认值。 ## 用 OnValidate 和编辑器验证建立护栏 Unity 项目里，很多错误不会被 C# 编译器发现：Prefab 少绑了引用、配置数值非法、Addressable key 不存在、场景没有必要入口对象。可以用 `OnValidate` 做局部约束： ```csharp using UnityEngine; namespace Project.Gameplay.Weapons { public sealed class WeaponConfig : ScriptableObject { [SerializeField] private float damage = 10f; [SerializeField] private float cooldownSeconds = 0.25f; public float Damage => damage; public float CooldownSeconds => cooldownSeconds; private void OnValidate() { damage = Mathf.Max(0f, damage); cooldownSeconds = Mathf.Max(0.01f, cooldownSeconds); } } } ``` 也可以写项目级验证菜单，给代理和 CI 调用： ```csharp #if UNITY_EDITOR using UnityEditor; using UnityEditor.Build; using UnityEngine; namespace Project.Editor.Validation { public static class ProjectValidator { [MenuItem("Project/Validation/Run All")] public static void RunAll() { ValidateWeaponConfigs(); Debug.Log("Project validation passed."); } private static void ValidateWeaponConfigs() { string[] guids = AssetDatabase.FindAssets("t:WeaponConfig", new[] { "Assets/_Project/Configs" }); foreach (string guid in guids) { string path = AssetDatabase.GUIDToAssetPath(guid); var config = AssetDatabase.LoadAssetAtPath(path); if (config == null) { throw new BuildFailedException($"Invalid WeaponConfig asset at {path}"); } } } } } #endif ``` 更完整的项目可以把验证做成 `-executeMethod` 可调用的静态入口： ```csharp #if UNITY_EDITOR namespace Project.Editor.Validation { public static class ProjectValidationCommand { public static void RunAll() { ProjectValidator.RunAll(); } } } #endif ``` 然后 CI 或本地脚本执行： ```bash Unity \ -batchmode \ -quit \ -projectPath . \ -executeMethod Project.Editor.Validation.ProjectValidationCommand.RunAll \ -logFile Logs/unity-validation.log ``` ## 让糟糕代码难以提交 Unity 项目也应该使用提交前钩子。可以用 `pre-commit`、Husky、lefthook 或 Git hooks。示例 `.pre-commit-config.yaml`： ```yaml repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.6.0 hooks: - id: check-yaml - id: check-json - id: end-of-file-fixer - id: trailing-whitespace exclude: ".*(.unity|.prefab|.asset|.meta)$" - repo: local hooks: - id: dotnet-format name: dotnet format entry: dotnet format ProjectName.sln --verify-no-changes language: system files: "\\.cs$" pass_filenames: false ``` 这里要显式写 solution 路径，不要依赖 `dotnet format` 在仓库根目录自动猜测。Unity 项目如果忽略 `.sln` / `.csproj`，需要先打开 Unity 或用 Unity 批处理生成 C# project files，再运行： ```bash dotnet format ProjectName.sln --verify-no-changes ``` 如果团队使用 Rider/ReSharper，也可以在 CI 中加入 InspectCode： ```bash jb inspectcode ProjectName.sln --output=inspectcode.xml ``` 不要指望代理每次都记得格式化、跑测试。把这些变成钩子和 CI，而不是口头约定。 ## 一个命令验证所有内容代理需要一个明确终点线。Unity 项目推荐准备 `tools/validate-unity.sh` 或 `tools/validate-unity.ps1`。示例 `tools/validate-unity.sh`： ```bash #!/usr/bin/env bash set -euo pipefail UNITY_PATH="${UNITY_PATH:-Unity}" PROJECT_PATH="$(pwd)" LOG_DIR="$PROJECT_PATH/Logs" mkdir -p "$LOG_DIR" "$UNITY_PATH" \ -batchmode \ -quit \ -projectPath "$PROJECT_PATH" \ -runTests \ -testPlatform EditMode \ -testResults "$LOG_DIR/editmode-results.xml" \ -logFile "$LOG_DIR/editmode.log" "$UNITY_PATH" \ -batchmode \ -quit \ -projectPath "$PROJECT_PATH" \ -runTests \ -testPlatform PlayMode \ -testResults "$LOG_DIR/playmode-results.xml" \ -logFile "$LOG_DIR/playmode.log" "$UNITY_PATH" \ -batchmode \ -quit \ -projectPath "$PROJECT_PATH" \ -executeMethod Project.Editor.Validation.ProjectValidationCommand.RunAll \ -logFile "$LOG_DIR/project-validation.log" ``` 然后在 `AGENTS.md` 中写清楚： ````markdown Before considering a task complete, run: ```bash ./tools/validate-unity.sh ``` If it fails, fix the errors rather than working around checks. Do not delete tests, weaken assertions, or bypass Unity validation unless explicitly asked. ```` Windows 团队可以提供 PowerShell 版本： ```powershell $ErrorActionPreference = "Stop" $UnityPath = if ($env:UNITY_PATH) { $env:UNITY_PATH } else { "Unity.exe" } $ProjectPath = Get-Location $LogDir = Join-Path $ProjectPath "Logs" New-Item -ItemType Directory -Force -Path $LogDir | Out-Null & $UnityPath -batchmode -quit -projectPath $ProjectPath -runTests -testPlatform EditMode -testResults "$LogDir/editmode-results.xml" -logFile "$LogDir/editmode.log" if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE } & $UnityPath -batchmode -quit -projectPath $ProjectPath -runTests -testPlatform PlayMode -testResults "$LogDir/playmode-results.xml" -logFile "$LogDir/playmode.log" if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE } & $UnityPath -batchmode -quit -projectPath $ProjectPath -executeMethod Project.Editor.Validation.ProjectValidationCommand.RunAll -logFile "$LogDir/project-validation.log" if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE } ``` ## 始终使用测试驱动开发 Unity 的 TDD 可以分成三层： 1. **纯 C# 单元测试**：不依赖 Unity 场景，测试规则、计算、状态机、存档格式。 2. **EditMode 测试**：测试 ScriptableObject、编辑器验证、资源导入规则、Prefab 结构。 3. **PlayMode 测试**：测试场景加载、MonoBehaviour 生命周期、输入、UI、玩法流程。代理最适合先写测试规格，再写实现。例如，先让代理生成测试： ```csharp using NUnit.Framework; using Project.Gameplay.Combat; namespace Project.Tests.EditMode.Gameplay.Combat { public sealed class DamageCalculatorTests { [Test] public void CalculateDamage_DoesNotReturnNegativeDamage() { int damage = DamageCalculator.CalculateDamage(baseDamage: 5, armor: 999); Assert.That(damage, Is.EqualTo(0)); } [Test] public void CalculateDamage_AppliesCriticalMultiplierAfterArmorReduction() { int damage = DamageCalculator.CalculateDamage(baseDamage: 20, armor: 5, criticalMultiplier: 2f); Assert.That(damage, Is.EqualTo(30)); } } } ``` 人工先检查测试是否符合需求。确认后，再让代理实现： ```csharp namespace Project.Gameplay.Combat { public static class DamageCalculator { public static int CalculateDamage(int baseDamage, int armor, float criticalMultiplier = 1f) { int reducedDamage = System.Math.Max(0, baseDamage - armor); return (int)(reducedDamage * criticalMultiplier); } } } ``` 好测试不仅是质量保障，也是给代理的监督信号。没有测试时，代理会倾向于“看起来合理”；有测试时，代理会被迫对齐明确行为。 ## Unity CI：本地护栏不够时本地 hooks 只能拦住一部分问题。Unity 项目最好在 CI 中运行： - C# 编译 - EditMode 测试 - PlayMode 测试 - 项目自定义验证 - Addressables 构建或分析 - 目标平台构建 - 静态分析 - Secret 扫描 - 许可证和第三方依赖审计 GitHub Actions 示例： ```yaml name: Unity Validate on: pull_request: workflow_dispatch: jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: lfs: true - uses: actions/cache@v4 with: path: Library key: Library-${{ hashFiles('Assets/**', 'Packages/**', 'ProjectSettings/**') }} restore-keys: | Library- - name: Run EditMode tests uses: game-ci/unity-test-runner@v4 env: UNITY_LICENSE: ${{ secrets.UNITY_LICENSE }} UNITY_EMAIL: ${{ secrets.UNITY_EMAIL }} UNITY_PASSWORD: ${{ secrets.UNITY_PASSWORD }} with: testMode: EditMode artifactsPath: test-results/editmode - name: Run PlayMode tests uses: game-ci/unity-test-runner@v4 env: UNITY_LICENSE: ${{ secrets.UNITY_LICENSE }} UNITY_EMAIL: ${{ secrets.UNITY_EMAIL }} UNITY_PASSWORD: ${{ secrets.UNITY_PASSWORD }} with: testMode: PlayMode artifactsPath: test-results/playmode - name: Run project validation uses: game-ci/unity-builder@v4 env: UNITY_LICENSE: ${{ secrets.UNITY_LICENSE }} UNITY_EMAIL: ${{ secrets.UNITY_EMAIL }} UNITY_PASSWORD: ${{ secrets.UNITY_PASSWORD }} with: targetPlatform: StandaloneWindows64 buildMethod: Project.Editor.Validation.ProjectValidationCommand.RunAll - name: Build player uses: game-ci/unity-builder@v4 env: UNITY_LICENSE: ${{ secrets.UNITY_LICENSE }} UNITY_EMAIL: ${{ secrets.UNITY_EMAIL }} UNITY_PASSWORD: ${{ secrets.UNITY_PASSWORD }} with: targetPlatform: StandaloneWindows64 ``` 对于大型项目，还可以加入： - Addressables Analyze - AssetBundle 构建验证 - 平台特定 smoke test - IL2CPP 构建 - Android/iOS 签名检查 - 内存和包体积预算检查 ## Unity 版“垃圾回收”：定期清理 AI Slop 即使有规则，AI slop 仍然会出现。Unity 项目可以定期让代理做小范围维护 PR，但必须限制权限和范围。适合自动清理的事项： - 删除无引用的私有方法和类 - 清理未使用 using - 同步 `AGENTS.md` 和 `docs/` - 查找重复 Manager/Service - 查找 Editor 代码误入 Runtime assembly - 查找空目录、过期测试、无效文档引用 - 查找没有 `[FormerlySerializedAs]` 的可疑字段重命名不适合自动清理的事项： - 大规模移动资产 - 批量重命名 Prefab、场景、ScriptableObject - 自动改 Project Settings - 自动升级 Unity 版本或包版本 - 自动重写核心架构垃圾回收提示词示例： ```markdown Read `AGENTS.md` and `docs/cleanup.md`. Make one small, safe maintenance improvement only. Prefer C# cleanup over asset moves. Do not rename serialized fields. Do not move `.unity`, `.prefab`, `.asset`, or `.meta` files. Do not modify third-party assets. If no safe cleanup exists, leave the repository unchanged. Before finishing, run `./tools/validate-unity.sh` if Unity is available. ``` ## Unity 项目的安全与供应链 Unity 项目同样有安全风险，尤其是： - 第三方插件引入恶意代码 - API key 写入 ScriptableObject、场景或 `Resources` - 移动端客户端包含不该存在的服务端密钥 - Asset Store 插件包含过宽权限 - Addressables 或远程配置被滥用 - 内购、经济系统、存档校验逻辑只在客户端实现建议： - 使用 secret scanning 检查 `.cs`、`.asset`、`.prefab`、`.unity`、`.json`。 - 不把服务端密钥放进 Unity 客户端。 - 对内购、货币、排行榜、账号权限做服务端校验。 - 审查第三方插件更新。 - CI 中加入依赖和许可证审计。 ## 始终思考 Unity 工程给代理提供了什么编码代理的输出，强烈依赖仓库环境给它提供的约束。如果 Unity 项目只有一个巨大的 `Assets/`、没有 asmdef、没有测试、没有文档、没有验证脚本，代理就会倾向于生成“能编译但脏”的代码。如果项目提供： - 清晰目录结构 - 明确 asmdef 边界 - 可执行测试 - 项目级验证命令 - 严格 AGENTS.md - 资源和场景文档 - 提交钩子和 CI 那么代理就更容易写出可维护的 Unity C# 代码。关键不是让 AI “更听话”，而是让工程环境本身变得更难写坏。 ## 附录 B：完整项目文档模板下面是上文出现的 `AGENTS.md` 和 `docs/*.md` 可直接落地版本。它们的目标不是替代真实项目文档，而是给编码代理一个稳定的起点。落地时要把占位符替换成真实项目名称、Unity 版本、场景、程序集和平台。 ### `AGENTS.md` ````markdown # Project Unity Coding Rules ## Overview This is a Unity project built with Unity 2022.3 LTS. Runtime code lives in `Assets/_Project/Scripts/Runtime`. Editor tools live in `Assets/_Project/Scripts/Editor`. Project assets live under `Assets/_Project`. Third-party assets live under `Assets/ThirdParty` or `Assets/Plugins`. ## Read Before Editing - `docs/architecture.md` - `docs/conventions.md` - `docs/scenes-and-entrypoints.md` - `docs/asset-pipeline.md` - `docs/testing.md` - Relevant `.agents/skills/*/SKILL.md` files for the task ## Entry Points - Boot scene: `Assets/_Project/Scenes/Boot.unity` - Main menu scene: `Assets/_Project/Scenes/MainMenu.unity` - Gameplay scene: `Assets/_Project/Scenes/Gameplay.unity` - Game bootstrap: `Project.Core.Bootstrap.GameBootstrapper` ## Mandatory Rules - Do not create scripts in `Assets/` root. - Do not put project code in global namespace. - Do not edit `.meta` files manually unless the task explicitly requires asset GUID work. - Do not move or rename assets casually; Unity references depend on GUIDs. - Do not rename serialized fields without `[FormerlySerializedAs]` and a migration note. - Do not modify `ProjectSettings/` unless the task is about build settings, input, quality, graphics, packages, or player settings. - Do not modify third-party plugin code; wrap it from project code instead. - Prefer editing existing components over adding duplicate manager classes. - Before adding a singleton, search for existing services, installers, bootstrap code, and ScriptableObject configs. ## C# Style - Use explicit namespaces matching assembly/module names. - Use `private` fields with `[SerializeField]` for inspector references. - Avoid public mutable fields. - Avoid `FindObjectOfType`, `GameObject.Find`, and tag-based lookup in gameplay code unless there is no stable reference path. - Avoid `async void` except Unity event handlers where unavoidable. - Prefer cancellation-aware async flows when using UniTask or Tasks. - Do not catch and rethrow exceptions without adding useful context. - Do not add comments that narrate obvious code. ## Unity Lifecycle Rules - Keep `Awake` for local initialization. - Keep `Start` for cross-object initialization that depends on scene setup. - Keep `Update`, `LateUpdate`, and `FixedUpdate` small and allocation-free. - Unsubscribe events in `OnDisable` or `OnDestroy` consistently with where they were subscribed. - Do not mutate imported ScriptableObject assets as runtime state. ## Asset Rules - `.meta` files must be committed. - Unity serialization must use Force Text. - Version Control mode must use Visible Meta Files. - Do not unpack, restructure, or rename prefabs unless the task explicitly requires it. - Do not modify scenes unless the requested behavior requires scene wiring. - Mention prefab, scene, ScriptableObject, Addressables, or asmdef impact in the final response. ## Minimal Changes / No Slop - Re-read your diff before finishing and remove unused code, stale branches, unused serialized fields, and abandoned helpers. - Do not add single-use abstractions. - Do not add speculative null checks everywhere; validate at boundaries and targeted validation points. - Do not leave commented-out code or tombstone comments. - Do not perform broad cleanup around an unrelated bug fix. ## Validation Before considering a task complete, run: ```bash ./tools/validate-unity.sh ``` If validation fails, fix the cause. Do not weaken tests, remove analyzers, delete assertions, or bypass compilation unless explicitly asked. If Unity is unavailable, state that validation was not run and list the exact command the user should run. ```` ### `docs/architecture.md` ````markdown # Architecture ## Purpose This document describes the stable module boundaries, dependency direction, and runtime ownership model for the Unity project. Coding agents must read this before changing cross-module behavior. ## Project Layout ```plaintext Assets/_Project/ Configs/ Prefabs/ Scenes/ Scripts/ Runtime/ Core/ Gameplay/ UI/ Infrastructure/ Editor/ Tests/ EditMode/ PlayMode/ ``` ## Assemblies | Assembly | Path | Responsibility | May Reference | | --- | --- | --- | --- | | `Project.Core` | `Assets/_Project/Scripts/Runtime/Core` | Shared runtime primitives, bootstrap contracts, scene loading interfaces | Unity runtime packages only | | `Project.Gameplay` | `Assets/_Project/Scripts/Runtime/Gameplay` | Gameplay rules, player, enemies, combat, level systems | `Project.Core` | | `Project.UI` | `Assets/_Project/Scripts/Runtime/UI` | UI screens, widgets, view controllers | `Project.Core` | | `Project.Infrastructure` | `Assets/_Project/Scripts/Runtime/Infrastructure` | Save, platform, analytics, remote config, adapters | `Project.Core` | | `Project.Editor` | `Assets/_Project/Scripts/Editor` | Editor tools, validators, build scripts | Runtime assemblies as needed | | `Project.Tests.EditMode` | `Assets/_Project/Scripts/Tests/EditMode` | EditMode tests | Runtime and editor assemblies as needed | | `Project.Tests.PlayMode` | `Assets/_Project/Scripts/Tests/PlayMode` | PlayMode tests | Runtime assemblies | ## Dependency Rules - `Core` must not reference `Gameplay`, `UI`, or `Infrastructure`. - `Gameplay` must not call UI directly; communicate through events, state, or application services. - `UI` must not implement gameplay, save, payment, or network business logic. - Runtime assemblies must not reference `UnityEditor`. - Editor assemblies may reference runtime assemblies, but runtime assemblies may not reference editor assemblies. - New modules require an `.asmdef` and an entry in this document. ## Runtime Ownership - Bootstrapping starts in `Project.Core.Bootstrap.GameBootstrapper` from `Boot.unity`. - Persistent services are created by the bootstrap flow, not by hidden scene search. - Scene-specific objects are owned by scene roots and must be released on scene unload. - Imported ScriptableObject assets are configuration, not mutable runtime state. - Runtime state that derives from ScriptableObjects must be copied into plain C# objects or scene-owned components. ## Scene Loading - `Boot.unity` initializes global services and loads the next scene. - `MainMenu.unity` owns menu UI and user entry flow. - `Gameplay.unity` owns gameplay scene roots and gameplay-only objects. - Additive scenes must document ownership, loading trigger, and unload trigger in `docs/scenes-and-entrypoints.md`. ## Adding A Feature 1. Identify the owning assembly. 2. Search for existing services, roots, prefabs, configs, and tests. 3. Add the smallest code change in the owning module. 4. Add or update tests. 5. Update docs if scenes, public contracts, assets, or assembly boundaries change. 6. Run `./tools/validate-unity.sh`. ```` ### `docs/conventions.md` ````markdown # Conventions ## C# Naming - Namespaces start with `Project` and follow the owning module, for example `Project.Gameplay.Combat`. - MonoBehaviour class names describe behavior, not vague roles like `Manager` unless the existing architecture already owns that concept. - Serialized private fields use camelCase. - Public properties use PascalCase and expose read-only state unless mutation is intentional. - Async methods end with `Async` unless they are Unity event handlers. ## File And Folder Rules - One top-level public type per `.cs` file unless types are tightly coupled small value types. - Runtime code goes under `Assets/_Project/Scripts/Runtime`. - Editor code goes under `Assets/_Project/Scripts/Editor` or an Editor-only asmdef. - Tests go under `Assets/_Project/Scripts/Tests/EditMode` or `Assets/_Project/Scripts/Tests/PlayMode`. - Do not place project scripts directly under `Assets/`. ## Serialized Fields - Prefer `[SerializeField] private` over public fields. - Preserve serialized field names. - If a serialized field is renamed, add `[FormerlySerializedAs("oldName")]`. - Do not remove serialized fields without checking prefabs and scenes that may use them. - Required scene or prefab references should be validated in `OnValidate` or project validation. ## Unity Lifecycle - `Awake`: initialize local state and cache local components. - `OnEnable`: subscribe to events or start object-scoped work. - `Start`: perform initialization that depends on other scene objects being awake. - `Update`: keep small and allocation-free. - `OnDisable`: unsubscribe from events or cancel object-scoped work. - `OnDestroy`: release resources not already handled by disable flow. ## Comments - Do not write comments that restate code. - Do write comments for hidden Unity constraints, serialization migration, platform-specific behavior, and surprising ordering requirements. - Do not leave commented-out code. ## Error Handling - Validate at true boundaries: user input, save data, network responses, external SDK callbacks, Addressables loads, and editor validation. - Do not catch and rethrow without adding context. - Use actionable logs that include asset paths, scene names, object names, or keys when relevant. ## Prohibited Defaults - No new global singleton without checking existing bootstrap/service patterns. - No `FindObjectOfType` or `GameObject.Find` as a first choice. - No runtime dependency on `UnityEditor`. - No broad asset moves during code-only tasks. - No speculative architecture rewrites. ```` ### `docs/scenes-and-entrypoints.md` ````markdown # Scenes And Entry Points ## Scene List | Scene | Path | Purpose | Owner | | --- | --- | --- | --- | | Boot | `Assets/_Project/Scenes/Boot.unity` | Initializes global services and loads the next scene | Core | | MainMenu | `Assets/_Project/Scenes/MainMenu.unity` | Menu flow, profile selection, settings entry | UI | | Gameplay | `Assets/_Project/Scenes/Gameplay.unity` | Main playable loop | Gameplay | ## Boot.unity Purpose: initializes project-wide services and loads the next scene. Main scripts: - `Project.Core.Bootstrap.GameBootstrapper` - `Project.Core.SceneLoading.SceneLoader` Rules: - Do not place gameplay objects in Boot scene. - Do not make Boot depend on UI or Gameplay assemblies. - New global systems must be registered through the bootstrap flow, not via hidden scene searches. - Persistent objects must have clear ownership and teardown rules. ## MainMenu.unity Purpose: contains menu UI and non-gameplay user flow. Main roots: - `MainMenuRoot` - `MenuCanvas` - `SettingsPanelRoot` Rules: - UI controllers delegate to services; they do not implement save, economy, or network logic. - Button events must be backed by serialized references or explicit binding code. - Preserve controller/keyboard navigation when changing menu UI. ## Gameplay.unity Purpose: contains the playable game loop. Main roots: - `GameplayRoot` - `PlayerSpawnRoot` - `LevelRuntimeRoot` - `GameplayCanvas` Rules: - Scene references should be wired in prefabs or root installers. - Do not add duplicate manager GameObjects without checking existing roots. - Gameplay scene objects must not assume MainMenu objects still exist. - Scene-owned services must release subscriptions and Addressables handles on unload. ## Adding Or Changing Scenes When adding or changing a scene: 1. Add the scene path and purpose to this document. 2. Document root GameObjects and entry scripts. 3. Update build settings or Addressables docs if loading changes. 4. Add PlayMode tests for critical loading behavior when practical. 5. Mention scene impact in the final response. ```` ### `docs/asset-pipeline.md` ````markdown # Asset Pipeline ## Unity Serialization Project Settings must use: ```plaintext Version Control Mode: Visible Meta Files Asset Serialization Mode: Force Text ``` Rules: - `.meta` files must be committed. - Do not manually edit `.meta` files unless fixing GUID or merge issues. - Do not move or rename assets casually; references depend on GUIDs and paths. - Use Git LFS for large binary assets. - Use UnityYAMLMerge for `.unity`, `.prefab`, `.asset`, `.controller`, `.anim`, and `.mat` files. ## Prefabs - Runtime prefabs live under `Assets/_Project/Prefabs`. - UI prefabs live under `Assets/_Project/Prefabs/UI`. - Gameplay prefabs live under `Assets/_Project/Prefabs/Gameplay`. - System prefabs live under `Assets/_Project/Prefabs/Systems`. - Do not unpack third-party prefabs in place. - Do not rename serialized fields without `[FormerlySerializedAs]` and migration notes. - Prefer prefab variants when a variant is truly needed; do not create duplicate prefabs for small temporary differences. ## ScriptableObjects - Balance configs live under `Assets/_Project/Configs/GameBalance`. - ScriptableObject types live under `Project.Core.Config` or the owning feature namespace. - Runtime systems may read config assets but must not mutate imported asset instances during play. - If runtime mutation is needed, copy ScriptableObject data into runtime state. - Required fields should be validated by `OnValidate` or project validation. ## Addressables - Addressable keys must be stable. - Do not hard-code the same key in multiple places; centralize keys in constants or config. - Labels must have clear loading purpose. - Do not move assets between groups without documenting bundle, patch, and remote content impact. - Validate Addressables before release builds. ## Art And Audio - Source art files such as `.psd` and `.blend` use Git LFS. - Runtime textures, models, and audio should follow platform import presets. - Do not change compression, max size, or platform overrides unless the task is about asset optimization or visual/audio quality. - Document large asset additions in the final response. ## Generated Assets - Do not edit generated assets manually. - Generated file locations and regeneration commands must be documented near the generator or in `docs/build-and-release.md`. - If generated assets change, mention the generator command that produced them. ```` ### `docs/testing.md` ````markdown # Testing ## Test Layers | Layer | Location | Purpose | | --- | --- | --- | | Pure C# tests | `Assets/_Project/Scripts/Tests/EditMode` | Rules, calculators, state machines, save formats | | EditMode tests | `Assets/_Project/Scripts/Tests/EditMode` | ScriptableObjects, editor validation, import rules, prefab structure | | PlayMode tests | `Assets/_Project/Scripts/Tests/PlayMode` | Scene loading, MonoBehaviour lifecycle, UI, gameplay smoke tests | ## Commands Run all local validation: ```bash ./tools/validate-unity.sh ``` Run EditMode tests only: ```bash Unity -batchmode -quit -projectPath . -runTests -testPlatform EditMode -testResults Logs/editmode-results.xml -logFile Logs/editmode.log ``` Run PlayMode tests only: ```bash Unity -batchmode -quit -projectPath . -runTests -testPlatform PlayMode -testResults Logs/playmode-results.xml -logFile Logs/playmode.log ``` Run project validation: ```bash Unity -batchmode -quit -projectPath . -executeMethod Project.Editor.Validation.ProjectValidationCommand.RunAll -logFile Logs/project-validation.log ``` ## Testing Rules For Agents - Write or update tests before implementation when behavior is non-trivial. - Human review should confirm generated tests describe the intended behavior before implementation begins. - Do not delete tests to make validation pass. - Do not weaken assertions unless the requirement changed. - Prefer pure C# tests for deterministic logic. - Use PlayMode tests for lifecycle, scene, input, and UI behavior that cannot be proven in EditMode. ## What To Test - Calculators, state machines, cooldowns, timers, and progression rules. - Save/load serialization and migration. - ScriptableObject validation and default constraints. - Addressables key resolution and loader ownership. - Scene boot and critical smoke flows. - UI presenter formatting and critical button flows. ## What Not To Over-Test - Unity engine internals. - Trivial property pass-through. - Visual polish that needs human review unless screenshot tooling exists. - Performance claims without profiler or benchmark support. ## CI Expectations Pull requests should run: - EditMode tests - PlayMode tests for critical flows - Project validation command - Build for at least one representative target platform If a check cannot run locally, the final response must say so and name the CI check that should cover it. ```` ### `docs/build-and-release.md` ````markdown # Build And Release ## Unity Version Unity version: `2022.3.x LTS` The Unity version must be pinned in `ProjectSettings/ProjectVersion.txt`. Do not upgrade Unity versions unless explicitly asked. ## Supported Targets | Platform | Target | Notes | | --- | --- | --- | | Windows | `StandaloneWindows64` | Default CI build target | | macOS | `StandaloneOSX` | Add if release requires it | | Android | `Android` | Requires keystore and signing setup | | iOS | `iOS` | Requires Xcode signing pipeline | ## Local Build Command ```bash Unity -batchmode -quit -projectPath . -executeMethod Project.Editor.Build.BuildCommand.BuildWindows -logFile Logs/build-windows.log ``` Replace the build method with the real project build entry point. ## CI Build Requirements - Checkout must include Git LFS assets. - CI must use the pinned Unity version. - CI must run tests and project validation before release builds. - Build logs must be uploaded as artifacts on failure. - Release builds must not contain development-only scenes or test assets. ## Versioning - Update application version through the documented build pipeline, not by ad-hoc Project Settings edits. - Keep bundle identifiers stable. - Platform-specific version codes must be monotonic. - Mention version or build setting changes in the final response. ## Addressables And Remote Content - Validate Addressables before release. - Document content update restrictions before moving assets between groups. - Catalog and remote paths must match the target environment. - Do not change remote load paths without release owner approval. ## Secrets And Signing - Do not commit keystores, provisioning profiles, API keys, or service account files. - Signing credentials must come from CI secrets or local secure storage. - Unity clients must not contain server-only secrets. ## Release Checklist - `./tools/validate-unity.sh` passes. - Target platform build passes. - Addressables validation/build passes if Addressables changed. - Version and build number are correct. - Scenes in build settings are correct. - No development-only flags are enabled for production. ```` ### `docs/performance.md` ````markdown # Performance ## Budgets Define real budgets per project and platform. Starting placeholders: | Area | Budget | | --- | --- | | Frame time | 16.6 ms for 60 FPS target | | GC allocations during gameplay | 0 B/frame in hot paths | | Scene load | Project-specific | | Memory | Platform-specific | | Build size | Platform-specific | ## Hot Path Rules - No per-frame LINQ in gameplay or UI hot paths. - No repeated string formatting in `Update`, `LateUpdate`, `FixedUpdate`, or UI polling loops. - Cache repeated component lookups. - Avoid broad hierarchy searches during gameplay. - Prefer event-driven UI updates over polling. - Use pooling for frequently spawned short-lived objects. ## Profiling Workflow 1. Reproduce the performance issue in a stable scene or test flow. 2. Capture baseline profiler data. 3. Make the smallest targeted change. 4. Capture after-change data using the same scenario. 5. Document the measured result in the PR or final response. ## Memory Rules - Release Addressables handles consistently. - Do not keep scene object references in persistent services after scene unload. - Avoid mutating imported ScriptableObject assets at runtime. - Be cautious with large texture, audio, and animation imports. - Use platform import overrides intentionally. ## UI Performance - Avoid forcing layout rebuilds every frame. - Avoid enabling/disabling large UI hierarchies repeatedly in hot flows. - Prefer pooled list items for long or frequently refreshed lists. - Do not leave invisible panels doing work. ## Physics Performance - Use physics layers intentionally. - Avoid unnecessary collision matrix expansion. - Avoid expensive queries every frame without measuring. - Keep `FixedUpdate` logic focused and deterministic. ## Agent Rules - Do not claim performance improved without a measurement or clear mechanical reason. - If profiling was not run, say so. - Mention hot paths touched and expected allocation/frame-time impact. ```` ### `docs/cleanup.md` ````markdown # Cleanup Rules For AI Garbage Collection ## Purpose This document defines safe maintenance work for periodic AI cleanup. The goal is to remove obvious slop without creating risky asset, scene, or architecture churn. ## Allowed Cleanup Agents may make one small, focused cleanup PR for: - Removing unused private methods, fields, or using directives. - Removing dead code proven unreachable by current callers. - Removing stale comments or commented-out code. - Fixing documentation references to renamed or removed code. - Updating factual sections of `AGENTS.md` and `docs/*.md` to match the current codebase. - Moving editor-only code out of runtime assemblies when the fix is obvious and small. - Adding missing validation around existing documented invariants. ## Disallowed Cleanup Agents must not do these during automated cleanup unless explicitly asked: - Move or rename `.unity`, `.prefab`, `.asset`, or `.meta` files. - Rename serialized fields. - Rewrite core architecture. - Upgrade Unity, packages, or third-party plugins. - Modify Project Settings broadly. - Reorganize folders for aesthetics. - Delete assets just because text search does not find references. - Change public gameplay behavior without a test or explicit requirement. ## Required Process 1. Read `AGENTS.md` and this file. 2. Pick at most one small maintenance concern. 3. Search for references before deleting code. 4. Avoid asset moves and serialized field renames. 5. Run `./tools/validate-unity.sh` if Unity is available. 6. If no safe cleanup exists, leave the repository unchanged. ## Documentation Sync Scope Agents may update factual sections only: - Project structure - Scene list - Assembly list - Entry points - Test commands - Build commands - Asset locations - Validation commands Agents must not rewrite style, tone, architecture principles, or coding rules unless explicitly asked. ## Cleanup Prompt ```markdown Read `AGENTS.md` and `docs/cleanup.md`. Make one small, safe maintenance improvement only. Prefer C# cleanup over asset moves. Do not rename serialized fields. Do not move `.unity`, `.prefab`, `.asset`, or `.meta` files. Do not modify third-party assets. If no safe cleanup exists, leave the repository unchanged. Before finishing, run `./tools/validate-unity.sh` if Unity is available. ``` ## Final Response Checklist - State the single cleanup concern addressed. - State files changed. - State validation run status. - State any skipped validation and why. ````

我用 Cloudflare 免费搭了一套 AI 内容流水线，真的能跑起来

Sun, 03 May 2026 16:12:27 GMT

![图像](https://pbs.twimg.com/media/HHOT0grbkAA-dHu?format=jpg&name=large) 每天刷那么多信息，有用的记不住、存不下、找不到。看到一篇好文章，丢进收藏夹吃灰；想做个行业周刊，每周手动复制粘贴、用翻译器转录、再搞排版，没干两周就放弃了。信息过载的时代，个人创作者最大的痛点其实不是“缺内容”，而是“缺自动化的工作流”。那能不能一个人，不买云服务器，不掏月租，靠白嫖大厂的免费套餐，给自己搭一个“自动化内容工作台”？完全可以。今天这篇，我们就用 Cloudflare 现成的五件套，搭一个 **“AI 科技周刊自动生成器”**。它能每天自动去抓你想看的信息源，让 AI 先做一轮中文摘要和分类，分好类存进数据库，最后自动发布成一个公开的网页。这套体系不光能做周刊，你还可以把它改造成个人的第二大脑信息流、或者垂直领域的聚合站。 ## 架构一览：五件套都在干嘛只要你有一个 Cloudflare 账号，就相当于拥有了一个全栈的云端机房。我们这条“流水线”会用到它家的 5 个核心服务： ![图像](https://pbs.twimg.com/media/HHOQSmybIAAGV4W?format=jpg&name=large) 用人话解释一下它们的分工： 1. **Workers + Cron**：打工人+定时闹钟。每天定时去 Hacker News、RSS 源抓文章。 2. **Workers AI**：外包脑力。直接用 Llama 等开源大模型，先根据标题、链接或正文片段做中文摘要、推荐理由和标签；如果你接入正文抽取，也可以进一步做深度总结。 3. **R2**：大仓库。用来存抓下来的文章封面图、或者原始的超大 JSON 数据（免流量费）。 4. **D1**：账本。一个好用的数据库，存整理好的文章标题、链接、摘要内容。 5. **Pages**：门面。把 D1 里的数据渲染出来，做成别人能访问的静态网站。不用担心开销，个人项目跑这套流程，Cloudflare 给的免费额度通常够用（文末有成本测算）。下面我们一步步把这套流水线跑起来。 ## Step 1：用 Workers + Cron 定时抓取信息第一步，我们需要一个“打工人”，每天定点去给我们收集素材。在本地建好项目后，一切的配置都在 wrangler.toml 里搞定。我们要用 \[triggers\] 让 Worker 每天早上 8 点自动干活。 ```text # 我用 Cloudflare 免费搭了一套 AI 内容流水线，真的能跑起来 name = "ai-content-pipeline" main = "src/index.js" compatibility_date = "2026-04-30" # 我用 Cloudflare 免费搭了一套 AI 内容流水线，真的能跑起来 [triggers] crons = ["0 0 * * *"] ``` 对应的代码逻辑非常简单，写在 src/index.js 里： ```text export default { // 这是 Cron 定时触发的入口 async scheduled(event, env, ctx) { // 1. 去源头抓数据 (这里拿 Hacker News 举例) const hnResponse = await fetch("https://hacker-news.firebaseio.com/v0/topstories.json"); const storyIds = await hnResponse.json(); // 取前 5 篇 const top5Ids = storyIds.slice(0, 5); let articles = []; for (let id of top5Ids) { const itemRes = await fetch(\`https://hacker-news.firebaseio.com/v0/item/${id}.json\`); const item = await itemRes.json(); articles.push(item); } // 把文章传给下一步处理 (见下文) await processArticles(articles, env); } }; ``` 这就完事了。一分钱不用花，你拥有了一个绝对不会睡过头的爬虫。下面几张截图来自同一套流程的一次真实运行：Worker 从 Hacker News 抓取条目，AI 做中文整理，D1 保存结构化结果，R2 保留原始 JSON，最后由 Pages 展示出来。 Cloudflare 后台 - Worker 定时触发 ![图像](https://pbs.twimg.com/media/HHOQb6JakAEXhzI?format=jpg&name=large) 手动触发流水线结果 ![图像](https://pbs.twimg.com/media/HHOQf8GaEAAHe8-?format=png&name=large) ## Step 2：用 Workers AI 做摘要、翻译、分类文章信息抓回来了，但直接看英文标题和链接仍然很费时间。这时候就要让 AI 上场，先帮我们做一轮可读的中文整理。在 wrangler.toml 里加上 AI 的绑定： ```text [ai] binding = "AI" ``` Cloudflare 贴心地在边缘节点上部署了各种开源大模型（比如 Llama 3 系列），可以直接调，不需要你再去 OpenAI 注册账号充 API。我们在刚才的代码里加上 processArticles 函数： ```text async function processArticles(articles, env) { for (let article of articles) { if (!article.url) continue; const prompt = \` 请将以下新闻标题和链接整理成严格 JSON，不要输出 Markdown。字段包括： - title_zh：中文标题 - summary_zh：2-3 句话中文摘要，只根据标题和链接做保守判断 - why_it_matters：1-2 句话说明它为什么值得创作者或开发者关注 - tags：2-3 个分类标签数组原文标题：${article.title} 原文链接：${article.url} \`; // 调免费的 Llama 3 8B 模型干活 const aiResponse = await env.AI.run('@cf/meta/llama-3.1-8b-instruct-fast', { messages: [{ role: "user", content: prompt }] }); const aiResult = aiResponse.response; // 假设 AI 返回了我们需要的结构化数据 console.log("AI 处理结果:", aiResult); // 接下来我们要把它存起来 await saveToDatabase(article, aiResult, env); } } ``` 这里先用最基本的 prompt 演示标题和链接级别的整理。如果你要做深度处理，还可以让 Worker 去 fetch 原文页面的 HTML，提取正文后再喂给 AI 做深度总结。 ## Step 3：用 R2 存图片和原始资料结构化的数据存数据库，那些大块头的杂物（比如抓下来的文章配图、很长的原始 JSON）存哪里？放 D1 太浪费，放 R2 刚好。 R2 是 Cloudflare 的对象存储，适合放原始 JSON、图片和网页快照这类大文件。它的一个重要优势是没有传统云存储常见的出站流量费用。加配置，在本地创建一个叫 content-store 的 R2 bucket 并绑定： ```text npx wrangler r2 bucket create content-store ``` ```text # 我用 Cloudflare 免费搭了一套 AI 内容流水线，真的能跑起来 [[r2_buckets]] binding = "BUCKET" bucket_name = "content-store" ``` 在代码里把原始抓取的数据存个档，留着以后分析： ```text // 续写上文的代码 async function saveToStorage(article, env) { const fileKey = \`raw/${article.id}.json\`; // 直接丢进 R2 await env.BUCKET.put(fileKey, JSON.stringify(article)); return fileKey; } ``` 以后如果前端要配图，直接把图抓下来 env.BUCKET.put() 进去。需要注意的是，R2 里的文件默认不是公开的，想让网页直接访问图片，还要给 bucket 开启公开访问、绑定自定义域名，或者由 Worker 生成临时访问链接。 R2 原始 JSON 数据 ![图像](https://pbs.twimg.com/media/HHOQso0bgAAHHhr?format=png&name=large) ## Step 4：用 D1 存文章、链接、标签终于来到核心部分了。AI 整理好的精炼内容，我们要存在 D1 数据库里。D1 其实就是一个 Serverless 的 SQLite，极其轻量且不用管运维。先建表，写一个 schema.sql： ```text DROP TABLE IF EXISTS articles; CREATE TABLE articles ( id INTEGER PRIMARY KEY, title TEXT, url TEXT, summary_zh TEXT, why_it_matters TEXT, tags TEXT, created_at DATETIME DEFAULT CURRENT_TIMESTAMP ); ``` 执行命令创建数据库和表结构： ```text npx wrangler d1 create content-db # 我用 Cloudflare 免费搭了一套 AI 内容流水线，真的能跑起来 npx wrangler d1 execute content-db --local --file=schema.sql # 我用 Cloudflare 免费搭了一套 AI 内容流水线，真的能跑起来 npx wrangler d1 execute content-db --remote --file=schema.sql ``` 更新 wrangler.toml： ```text [[d1_databases]] binding = "DB" database_name = "content-db" database_id = "终端里吐出来的那个ID" ``` 回到我们的 Worker 代码，把 AI 处理完的数据塞进去： ```text async function saveToDatabase(article, aiResult, env) { // 假设我们解析了 AI 的结果 const zhTitle = "AI 生成的中文标题"; const summary = "2-3 句话中文摘要"; const whyItMatters = "这条内容为什么值得看"; const tags = "AI, 自动化, 资讯"; // 用标准的 SQL 写入 D1 await env.DB.prepare( \`INSERT INTO articles (id, title, url, summary_zh, why_it_matters, tags) VALUES (?, ?, ?, ?, ?, ?) ON CONFLICT(id) DO NOTHING\` ).bind(article.id, zhTitle, article.url, summary, whyItMatters, tags).run(); } ``` 至此，一个最小版的**后台自动抓取流水线**已经能跑通了。上面的代码为了讲清楚流程做了简化：AI 返回结果还需要按中文标题、多段摘要、推荐理由、标签拆开，R2 存档函数也要接进主流程。把这些补齐后，只等每天早上 8 点，内容就会源源不断地自动流进你的数据库里。 Cloudflare 后台 - D1 数据表 ![图像](https://pbs.twimg.com/media/HHOQ3fFbsAAGCow?format=png&name=large) D1 数据接口返回 ![图像](https://pbs.twimg.com/media/HHOQ55UbAAAePku?format=jpg&name=large) ## Step 5：用 Pages 展示成公开网站光有数据库不行，得让人能看到。最后一步，我们用 Cloudflare Pages 搭一个极简的前端展示页。 Pages 自带了 Functions 功能（基于 Workers），所以我们连后端 API 都不用单独部署，直接在 Pages 项目里建一个 functions/api/articles.js。不过这类接口请求会按 Workers 的规则计入用量，不能简单理解成“Pages 页面访问无限，所以接口也无限”。 ```text // 当访问 /api/articles 时触发 export async function onRequest(context) { // 从绑定的 D1 里查数据 const { results } = await context.env.DB.prepare( "SELECT * FROM articles ORDER BY created_at DESC LIMIT 20" ).all(); return Response.json(results); } ``` 然后在同目录下写一个普通的 index.html，用点基础的 JS 去调这个接口： ```text 我的 AI 科技周刊

每日 AI 科技精选

``` 在 Dashboard 里新建一个 Pages 项目，把这俩文件扔上去，或者用命令行一键部署。连上 D1 binding，你的个人内容订阅站就正式上线了。最终展示页 ![图像](https://pbs.twimg.com/media/HHOQ9_tbsAAtfxF?format=jpg&name=large) ## 免费额度到底够不够用？很多朋友一听到“全家桶”，第一反应就是“这跑起来不得破产？” 我们来算笔账。假设你搞的这个周刊，每天自动抓取 **50 篇** 文章进行 AI 翻译和总结，你的前端页面每天有 **1000 次** 访问。 ![图像](https://pbs.twimg.com/media/HHORDvoagAAkEoq?format=jpg&name=large) 结论很明确：**只要你不拿它做商业级别的大型爬虫，对于个人或者小团队的内容流自动化，Cloudflare 的免费套餐通常已经足够起步。真正上线前，还是建议按自己的抓取频率、AI 输出长度和访问量重新算一遍。** ## 几个防踩坑与进阶玩法 1. **Cron 触发的超时问题**：免费版 Workers 有 CPU 运行时间限制。如果你一次性抓 100 篇文章喂给 AI，很可能会超时。建议用多个短频快的定时任务（比如每 2 小时抓 5 篇），或者利用 Queue 把抓取和 AI 分析拆成异步。 2. **AI 的神经元（Neurons）计算**：Workers AI 现在的免费额度是按 Neurons 算的。尽量选 [@cf/meta](https://x.com/@cf/meta)/llama-3.1-8b-instruct-fast 这样的小且快模型，便宜好用。 3. **加点“料”：Vectorize 语义搜索**：如果你想在页面上加一个“按意思搜文章”的功能，可以把 AI 生成的文本再过一遍 Embedding 模型，存进 Cloudflare Vectorize 库里，瞬间变成高阶版的 RAG。 4. **邮件推送**：如果还想做邮件版周刊，可以再接邮件发送服务或 Cloudflare 的邮件相关能力。这里要注意，收信转发、处理来信和主动发信不是一回事，具体实现前要先确认当前账号可用的发送方式。 ## 结语从抓取、清洗、AI 提炼、存储、到最终的发布。我们只用了不到两百行代码，零服务器成本，就在云端拼起了一个完整的“编辑部”。这就是 Serverless + AI 带来的恐怖生产力。你不需要再陷在日常的 Ctrl+C 和 Ctrl+V 里，那些重复的信息筛选工作，完全可以交给机器在你看不到的边缘节点上默默完成。如果你也有兴趣搞这么一套，可以先去官网注册个账号。准备好你的 API 源，建个 D1 跑跑看。你手头现在最想自动化的是什么内容流？小红书笔记？竞品公众号？还是投资研报？在评论区聊聊，没准你的痛点大家都有同感。**更多 AI 干货同步更新公众号：雨哥聊AI，关注我带你玩转 AI 时代！**

用 Claude Code 将三万行 Go 项目移植到 Rust：Agent Team 实践与 Harness 效率优化

Sun, 03 May 2026 16:01:34 GMT

## 背景 [mihomo](https://github.com/MetaCubeX/mihomo) （Clash Meta）是一个用 Go 编写的规则代理内核，支持 Shadowsocks、Trojan、VLESS 等多种协议，被广泛部署在路由器和 VPS 上。我决定用 Rust 重写它——不是为了 "用 Rust 重写一切" 的执念，而是出于实际需求：更小的二进制体积、更低的内存占用、以及 Rust 类型系统在网络协议实现中带来的安全保障。最终产物 [mihomo-rust](https://github.com/madeye/mihomo-rust) 包含 11 个 workspace crate、31,000+ 行 Rust 代码、40 份技术规格文档、2 份架构决策记录（ADR），以及覆盖单元测试、集成测试、端到端 TProxy 测试的完整 CI 管线。从第一个 commit 到 M1 里程碑基本完成，整个过程高度依赖 Claude Code 的 Agent Team 机制。这篇文章不是一篇 "AI 好厉害" 的宣传稿。它是一份工程实践记录——哪些做法有效，哪些踩了坑，以及如何通过调优 harness 配置让 Claude Code 在大型项目中真正可用。 ![mihomo-rust crate 架构：31,178 行代码分布在 11 个 crate 中](https://maxlv.net/assets/2026/chart-mihomo-crate-breakdown.svg) ## Agent Team：四个角色的分工 Claude Code 的 Agent Team 允许你在一个会话中运行多个专业化 agent，各自承担不同职责。在 mihomo-rust 项目中，我使用了四个角色： | 角色 | 模型 | 职责 | | --- | --- | --- | | **PM** （项目经理） | Sonnet | 拥有路线图、排列优先级、撰写里程碑退出标准、维护 `roadmap.md` | | **Architect** （架构师） | Opus | 编写差距分析报告、ADR、做架构决策、审查技术方案 | | **Engineer** （工程师） | Sonnet | 实现代码、编写测试、处理 CI 修复 | | **QA** | Haiku | 编写测试计划、审查测试覆盖率、维护 CI 状态报告 | ### 为什么这样分配模型这不是随意选择。Opus 放在 Architect 角色上，是因为架构决策需要最强的推理能力——比如决定 gRPC transport 是手写 "gun" 帧还是引入 tonic（最终选择了手写，因为上游 Go 代码本身就没有 protobuf schema，引入 tonic 会增加约 30 个依赖和 2MB 二进制体积）。 Sonnet 用于 PM 和 Engineer，因为这两个角色的工作更偏向结构化执行：PM 按固定模板填充路线图表格，Engineer 按 spec 实现代码。Haiku 用于 QA——测试计划是高度模板化的工作，用最快最便宜的模型即可。 ### 角色之间的信息流四个 agent 并不是各自为战。它们通过文件系统共享状态： ![Agent Team 四角色协作模式与信息流向](https://maxlv.net/assets/2026/chart-mihomo-agent-team.svg) ``` TEXTdocs/vision.md ← PM 拥有，定义目标和非目标 docs/gap-analysis.md ← Architect 产出，PM 消费 docs/roadmap.md ← PM 拥有，引用 Architect 的分析 docs/adr/*.md ← Architect 拥有，不可协商的架构决策 docs/specs/*.md ← PM 拥有格式，Architect 审查技术内容 docs/specs/*-test-plan.md ← QA 产出 docs/ci-status.md ← QA 拥有 ``` 关键原则： **ADR 决定架构（不可协商），spec 填充细节（可讨论），测试计划验证 spec** 。这种分层避免了 agent 之间的决策循环。 ## 里程碑驱动的开发节奏项目分为四个里程碑： - **M0（正确性修复）** ：10 个小项，修复安全漏洞、接线遗漏、CI 缺口——比如 REST API 的 Bearer 认证一直是 `#[allow(dead_code)]` ，GEOIP 规则解析直接返回错误 - **M1（用户可用）** ：协议、传输层、规则、DNS、API 的全面补齐 - **M2（性能优化）** ：基准测试、分配器审计、feature flag 精简 - **M3（运维成熟）** ：热重载、OpenTelemetry、配置校验 M0 和 M1 并行推进——M0 的项都是小范围修复，Engineer 可以在等待 M1 spec 评审时穿插完成。 ![开发速度：Agent Team 全面介入后 commit 密度显著提升](https://maxlv.net/assets/2026/chart-mihomo-dev-velocity.svg) ### 一个具体的例子：Transport Layer 的开发过程 Transport Layer（M1.A）是 M1 的前置依赖——VLESS 协议需要可复用的 TLS/WebSocket/gRPC 传输层，否则每个新协议都要复制粘贴 TLS 握手代码。开发过程如下： 1. **Architect** 编写 ADR-0001，确定 `mihomo-transport` 作为独立 leaf crate，定义 `Transport` trait 接口，决定用 `Box` trait object 而非泛型（因为运行时需要根据 YAML 配置动态组合传输层链） 2. **PM** 将 ADR 翻译为路线图中的四个有序任务（A-1 到 A-4），标注依赖关系——"VMess 在 A-2 完成后解锁" 3. **Engineer** 按序实现：先建 crate 骨架和 TLS 层，迁移 Trojan；然后 WebSocket 层，迁移 v2ray-plugin；然后手写 gRPC gun 帧；最后 HTTP/2 和 HTTPUpgrade 4. **QA** 在每一步验证集成测试仍然通过： `trojan_integration` 和 `v2ray_plugin_integration` 不能因迁移而中断这个流程看起来很重——四个角色处理一个 crate 的创建。但正是这种结构化保证了几件事：gRPC 没有引入不必要的依赖（Architect 决策）、构建顺序没有被打乱（PM 管控）、迁移过程中测试一直是绿的（QA 验证）。 ![Spec 驱动开发流水线：以 Transport Layer 为例](https://maxlv.net/assets/2026/chart-mihomo-spec-workflow.svg) ## CLAUDE.md：Harness 效率的核心杠杆 CLAUDE.md 是 Claude Code 在每次会话开始时自动加载的指导文件。它是提高 harness 效率最重要的手段——写得好，agent 不需要每次都重新探索项目结构。 mihomo-rust 的 CLAUDE.md 只有 101 行，但信息密度很高： ```markdown ## Build Commands cargo build --release cargo test --lib cargo test --test rules_test # 78 rule matching tests cargo test --test trojan_integration # embedded mock server cargo test --test shadowsocks_integration # requires ssserver ## Architecture Listeners → Tunnel (routing) ←→ DNS Resolver | Rule Matching | Proxy Adapters / Groups → Remote Server REST API (Axum) → Runtime control ## Key Patterns - ProxyAdapter trait — all protocols implement this - Rule trait — all rule types implement this - Tunnel — Arc-shared routing engine ``` ### 写好 CLAUDE.md 的几个原则 **只写不能从代码推断的信息。** 不要列出每个文件的路径——agent 可以用 Glob 找到。要写的是：哪些 trait 是架构骨架、哪些测试需要外部依赖（ssserver）、构建命令有什么特殊参数。 **写清楚扩展点。** "如何添加新协议" 和 "如何添加新规则类型" 各三行，告诉 agent 需要改哪三个文件。这比写一整段架构描述更有效——agent 需要的是 actionable 的指令。 **不要写过时的信息。** CLAUDE.md 不是变更日志。如果某个决策已经落实到代码里（比如 fake-ip 已经被移除），就不需要在 CLAUDE.md 里再解释为什么移除。 ## Memory 系统：跨会话的经验积累 Claude Code 的 Memory 系统允许在会话之间持久化信息。mihomo-rust 项目积累了 7 条 memory，全部是 `feedback` 类型——即对 agent 行为的纠正或确认。几条有代表性的： ### "不要在 router 上加 CatchPanic" ``` TEXTprohibits adding CatchPanic or panic-absorbing middleware to axum router. Task #26 requires panics in spawned tokio tasks to abort the process so failures are detectable. ``` 这条 memory 源于一次具体事件：Engineer agent 试图在 Axum router 上加 `tower::catch_panic` 来 "提高健壮性"。但 QA 的测试计划要求 panic 必须导致进程终止，以便在 soak test 中被检测到。保存这条 memory 后，后续会话中 Engineer 不再犯同样的错误。 ### "tokio::time::pause() 不虚拟化系统调用" ``` TEXTtokio::time::pause()/advance() only affects sleep/Instant futures, not kernel syscalls like TcpStream::peek(), read(), recv(). ``` 这条是 Engineer 在写 sniffer 测试时踩的坑。 `tokio::time::pause()` 看起来可以用来加速超时测试，但它只影响 tokio 自己的定时器，不影响实际的 socket IO。这个知识点保存后，在后续编写 boring-tls 测试时直接规避了同样的陷阱。 ### "里程碑完成时必须重启所有 teammate" ``` TEXTMandatory shutdown and respawn all four teammates at milestone completion. Respawn with model assignment: architect=opus, pm/engineer=sonnet, qa=haiku. Do not clear mid-milestone or if any state isn't saved. ``` 这是最重要的一条操作规范。Agent Team 的上下文窗口是有限的——经历一整个里程碑的讨论后，上下文中充满了过时的中间状态。在里程碑边界处 "重启" 所有 agent，让它们从干净的状态重新读取文件系统中的文档，比带着旧上下文继续工作更高效。 ## 上游分歧策略：ADR-0002 的实践价值移植项目最棘手的问题之一是：上游的 bug 要不要复制？ ADR-0002 定义了一个简单的二分类法： - **Class A（安全/隐私/路由意图）** ：硬错误，拒绝加载。用户读配置文件时会误以为自己得到了 X，实际上得到的 Y 更不安全 - **Class B（性能/兼容性）** ：警告一次，继续运行。流量到达正确目的地，只是走了更慢的路径 ![上游分歧策略：二分类决策框架](https://maxlv.net/assets/2026/chart-mihomo-divergence.svg) 具体案例： | 场景 | 上游行为 | mihomo-rust | 分类 | | --- | --- | --- | --- | | VMess `cipher: zero` | 接受，明文传输 | 解析时报错 | A | | `alterId > 0` | 运行废弃的 MD5 密钥推导 | 警告并强制为 0 | B | | sniffer peek IO 错误 | 静默跳过 | 记日志，保留原始 metadata | A | | `default-nameserver` 包含 `tls://` | 接受，运行时 bootstrap 死循环 | 加载时报错 | A | 这个分类法的价值在于：它让 Engineer agent 在实现过程中遇到 spec 未预见的边界情况时，有一个明确的默认规则——"不确定时选 Class A（硬错误），在 PR 描述中标注"。这比每次都暂停来请求 Architect 决策高效得多。对 QA 来说，测试用例中引用分歧分类（ `Class A per ADR-0002: upstream accepts, we reject` ）让审查者一眼就能判断测试的意图。 ## Spec 驱动开发：40 份文档不是官僚主义项目产出了 40 份 spec 文档和对应的测试计划。这看起来像是过度工程，但在 agent team 的协作模式下，spec 是协调四个 agent 的关键工具。每份 spec 的固定结构： 1. **YAML schema** ：配置文件中的字段定义 2. **Struct shapes** ：Rust 结构体的字段和类型 3. **Error types** ：所有错误情况的枚举 4. **Divergences table** ：与上游的分歧，引用 ADR-0002 分类 5. **Test plan** ：测试矩阵（独立文件）为什么 spec 比直接告诉 Engineer "去实现 VLESS" 更高效？因为 **spec 是 agent 之间的接口协议** 。Architect 在 spec 的 struct shapes 部分定义类型签名，Engineer 实现它们，QA 根据 spec 的 error types 生成测试用例。没有 spec，每个 agent 都需要自己去读上游 Go 代码来理解应该怎么做，这会导致三个 agent 对同一个问题产生三种理解。一个具体的数字：transport-layer.md 这份 spec 覆盖了 M1.A 的全部四个子任务，因为 ADR-0001 已经确定了架构。spec 只需要填充 YAML schema、struct shapes 和 per-layer 测试——大约 200 行。而 Engineer 根据这 200 行 spec 产出了整个 `mihomo-transport` crate 的代码。 ## 效率优化：踩过的坑和学到的经验 ### 1\. 上下文窗口是最稀缺的资源 Agent team 中每个 agent 都有独立的上下文窗口。长时间运行的会话会导致上下文被早期的探索、失败尝试和中间状态填满。解决方案： - 在 CLAUDE.md 中写清楚关键信息，让 agent 不需要每次都重新探索 - 里程碑边界处重启所有 agent - 用文件系统（docs/、specs/）而不是上下文窗口来传递状态 ### 2\. 文档是给 Agent 写的，不只是给人写的传统软件项目中，文档是写给下一个读代码的人看的。在 agent team 模式下，文档同时也是 agent 的 "system prompt"——它们通过读取 `docs/` 来理解项目状态和决策历史。这意味着文档的写法需要调整： - **用表格代替散文。** Agent 解析表格比理解段落高效 - **引用要精确。** "参见 ADR-0001" 比 "参见之前的架构讨论" 好，因为 agent 可以直接定位文件 - **状态要明确。** 每个工作项标注 "completed / in-progress / blocked"，而不是 "我们之前讨论过这个" ### 3\. Memory 要精简且可操作 Memory 系统的陷阱是存太多信息。mihomo-rust 只保存了 7 条 memory，全部是 feedback 类型——即 "不要做 X" 或 "做 Y 时注意 Z" 的规则。不保存的东西： - 代码模式和约定（从代码本身可以推断） - Git 历史（ `git log` 更权威） - 调试方案（修复已经在代码里了） - 临时任务状态（用 task 系统而非 memory） ### 4\. 测试是验证 Agent 工作质量的唯一可靠手段 Agent 生成的代码看起来可能是正确的，但 "看起来正确" 不等于 "运行正确"。 ![测试基础设施：619 个测试函数覆盖 5 个层次](https://maxlv.net/assets/2026/chart-mihomo-test-coverage.svg) mihomo-rust 的 CI 管线包含： - 100+ 单元测试 - 82 个 API 集成测试 - 78 个规则匹配测试 - 5 个协议级集成测试（Trojan、Shadowsocks、v2ray-plugin、VLESS、boring-tls） - Docker 化的 TProxy 端到端测试 - MSRV 校验（确保声称的最低 Rust 版本是真的）每次 Engineer agent 提交代码后，跑完整测试套件是不可跳过的步骤。在 ECH/uTLS 的开发中，31 个测试用例（包括 C13-C15 的真实 BoringSSL 服务器端到端握手）是判断 "这个 feature 可以合并" 的唯一标准。 ### 5\. 让 Agent 管理自己的状态文档 ECH/uTLS feature 的开发展示了一种有效模式：PM agent 维护一份 `ech-utls-status.md` ，记录 16 个 task 的状态、每个 task 的 owner、完成的 commit hash、以及关键决策（为什么选择 boring 而不是 rustls 做 ECH backend、为什么 `random` profile 在 `TlsLayer::new` 时解析而不是每次连接时）。这份状态文档既是 agent 团队的协作界面，也是人类审查时的速查表。 ## 数字与成本一些客观数据： | 指标 | 数值 | | --- | --- | | 总 Rust 代码量 | 31,178 行（117 个源文件） | | Workspace crate 数 | 11 | | 最大 crate | mihomo-proxy（9,797 行，27 文件） | | Git commits | 106 | | Claude 直接 commit | 10 | | Spec 文档 | 40 份（最大 695 行） | | ADR | 2 份 | | 测试函数 | 619 个（408 同步 + 211 异步） | | 集成测试套件 | 24 个 | | CI jobs | 5（lint、test、tproxy、msrv、macos） | | Cargo 依赖 | 375 个 | | 开发跨度 | ~4 周（2026-02-21 至 2026-04-12） | | 单日最高 commit | 27（2026-04-08，M0 sweep + 6 specs） | Claude 直接 commit 只有 10 个（主要是 CI 修复和 simple-obfs 插件），并不意味着 Claude 只贡献了 10 个 commit 的工作量。大部分 commit 的作者是我，但代码是在 Claude Code 会话中协作完成的——我审查、修改、然后以自己的名义提交。Claude 的贡献更多体现在：编写 spec、生成代码初稿、执行重构、维护文档。 ## 总结：什么时候值得用 Agent Team Agent Team 不是银弹。以下场景值得使用： - **项目规模大到一个上下文窗口装不下。** mihomo-rust 有 11 个 crate、31K 行代码、40 份文档。单个 agent 无法同时 hold 住全局架构和局部实现细节 - **需要不同层次的决策。** 架构决策（用不用 tonic）、项目管理决策（M1 先做什么）、实现决策（这个 struct 的字段类型）需要不同的思维模式 - **有明确的文档驱动流程。** Agent team 的协作基于文件系统——如果你的团队没有写 spec 的习惯，agent team 的效率会大打折扣 - **需要在里程碑之间保持一致性。** Memory 系统和文档保证了跨会话的知识不丢失不值得使用的场景： - 小型项目（< 5K 行），单个 agent 足够 - 探索性原型开发，结构化流程是负担 - 没有测试基础设施的项目——你无法验证 agent 产出的质量 Claude Code 改变的不是 "AI 能不能写代码" 这个问题，而是 "AI 写的代码能不能被工程化地验证和集成"。Agent Team + CLAUDE.md + Memory + Spec 驱动开发构成了一个完整的 harness，让 AI 辅助从 "试试看能不能跑" 变成了一个可重复、可审查、可扩展的工程流程。

利用器工程：为编码代理准备TypeScript代码库

Sun, 03 May 2026 15:58:15 GMT

![图像](https://pbs.twimg.com/media/HHXAPmMaMAAaqVa?format=jpg&name=large) At , I pushed our product team to fully embrace vibe coding with Claude Code and Codex, while enforcing guardrails and code quality standards.[@HacktronAI](https://x.com/@HacktronAI) Vibe coding works best when the codebase has strong **affordances**性时效果最佳 — a concept in design that describes the possible actions an actor (in this case, a coding agent) can take, in relation to an object (in this case, the codebase): > Affordance: a use or purpose that a thing can have, that people notice as part of the way they see or experience it. For a coding agent like Claude Code or Cursor to produce productive code instead of "AI slop" that becomes expensive to maintain and clean up later, building a codebase with obvious structure and automated guardrails becomes important. Even the smartest models today can't possibly reason about every edge case without a good harness. And even with coding agents like Claude Code, designing repositories in a thoughtful way can go a long way in improving the quality of the code. 仓库A repository should be treated less like a pile of code that can be executed, and more like an 应被视为代理的**execution environment for agents**. Good vibe coding, therefore, would mean that the environment provides: - Fast validation against "bad engineering" - A constrained blast radius - Guardrails that enforce invariants before commiting - Tests and scripts that the agent can use to "vibe-check" itself ## Make the repository legible to agents 使用并搭建一个单一仓库。如果你想跨多个仓库处理前端和后端微服务，你需要让你的编码代理在这些仓库之间切换上下文，或者赋予它们过于宽泛的权限，让他们能访问同一会话中的所有仓库。这可不好。所以直接用monorepo吧。[PNPM](https://pnpm.io/) ```plaintext apps/ frontend/ backend/ docs/ architecture.md conventions.md packages/ eslint-config/ shared-utils/ shared-tyles/ typescript-config/ CLAUDE.md package.json pnpm-lock.yaml pnpm-workspace.yaml turbo.json ``` monorepo 结构允许你创建多个使用共享包的应用。这些可以是效用和类型定义。此外，我发现将ESLint和TypeScript配置标准化为共享包很有用，这样它们可以轻松导入到新的应用和包中。例如，一旦你导出类似的ESLint配置，包含在共享包中： ```typescript // packages/eslint-config/base.js import js from '@eslint/js' import eslintConfigPrettier from 'eslint-config-prettier' import turboPlugin from 'eslint-plugin-turbo' import tseslint from 'typescript-eslint' import onlyWarn from 'eslint-plugin-only-warn' /** * A shared ESLint configuration for the repository. * * @type {import("eslint").Linter.Config[]} * */ export const config = [ js.configs.recommended, eslintConfigPrettier, ...tseslint.configs.recommended, { plugins: { turbo: turboPlugin, }, rules: { 'turbo/no-undeclared-env-vars': 'warn', '@typescript-eslint/no-unused-expressions': 'off', '@typescript-eslint/no-unused-vars': [ 'warn', { argsIgnorePattern: '^_', varsIgnorePattern: '^_', caughtErrorsIgnorePattern: '^_', }, ], }, }, { plugins: { onlyWarn, }, }, { ignores: ['dist/**'], }, ] ``` 每个应用和包都可以直接从这个配置导入。 ```typescript // apps/frontend/eslint.config.mjs import { config } from '@my-project/eslint-config/base' export default config ``` ## 技能体现了最佳实践技能包括帮助编码代理生成成语代码。当然，很多观点带有强烈主观色彩，所以我们也会写自己的技能，来概括这些年来学到的最佳实践。如果你不知道这些最佳实践应该是什么，可以谷歌一下，或者指向一个展示强大软件工程原则的示例仓库，让它自己设计技能。[NestJS最佳实践](https://github.com/Kadajett/agent-nestjs-skills)[Typescript-advanced-types](https://github.com/sickn33/antigravity-awesome-skills/blob/main/plugins/antigravity-awesome-skills-claude/skills/typescript-advanced-types/SKILL.md) 我们有工程师使用各种不同的代理：Claude、Codex、Cursor 等，所以如果我们想让这些技能在团队成员间有用并共享，我们需要每个编码代理使用相同的技能集。这就是为什么技能存储在 **.agents** 中，.**codex**、**.claude** 等与主 **.agents** 目录中的技能有符号链接。 ```plaintext .agents/ skills/ typescript-expert/ SKILL.md typescript-advanced-types/ SKILL.md [...] .codex/ skills/ typescript-expert -> ../../agents/skills/typescript-expert typescript-advanced-types -> ../../agents/skills/typescript-advanced-types [...] .claude/ skills/ typescript-expert -> ../../agents/skills/typescript-expert typescript-advanced-types -> ../../agents/skills/typescript-advanced-types [...] [...] ``` ## 经纪人阅读和维护的文档如果写得好，CLAUDE.md（或等效文件）在提供自我演进的文档方面起到了很大作用。这些文档可以概述架构、技术栈，更重要的是，AI代理应遵守的规则。 ```plaintext # VibeSlop - The Best Vibe Coded Application ## Overview VibeSlop has a NestJS backend and a Nuxt frontend. It is a B2B AI SaaS. [...] ## Notion Documentation **IMPORTANT**: VibeSlop has comprehensive documentation in Notion that should be kept in sync with code changes. **Main page**: https://notion.so/[...] ### Documentation Structure | Section | Page ID | Description | | -------------- | ---------- | ------------------------------ | | Authentication | \`DEADBEEF\` | Auth guards, token types, RBAC | | [...] | [...] | [...] | ### When to Update Notion Docs Update the relevant Notion page when: - Adding new API endpoints → Update API Reference - Adding/modifying entities → Update Database & Entities - Changing auth guards or token handling → Update Authentication [...] ### How to Update Use the Notion MCP tools: - \`mcp__notionMCP__notion-fetch\` - Read existing page content - \`mcp__notionMCP__notion-update-page\` - Update page content - \`mcp__notionMCP__notion-create-pages\` - Create new nested pages [...] ## AI Coding Rules (MANDATORY) These rules are non-negotiable. Every code change — whether new feature, bugfix, or refactor — must comply. Violations must be fixed before committing. ### DTO & OpenAPI Contract [...] ### TypeScript Strictness - **No casting** except \`as const\`. No \`as unknown as X\`, \`as any\`, \`as SomeType\`, \`@ts-ignore\`, \`// @ts-expect-error\`. - **Use enums** instead of magic strings. If a value has a fixed set of options, define an enum. - **Use optional fields sparingly** — prefer union types (\`string | null\`) over optional (\`string?\`) when the field is semantically required but may be absent. - **No re-declaring types** that already exist in \`@my-project/shared-types\`, entity definitions, or generated code. - \`pnpm check-types\` must pass before committing. ### Architecture [...] ### Minimal Changes / No Slop AI-generated code accumulates: narration comments, single-use helpers, dead code from earlier iterations, error handling for cases that can't happen. Before declaring done, re-read your own diff with a hostile eye and cut everything the current implementation doesn't need. The principle is that a bug fix does not need surrounding cleanup, a one-shot change does not need a helper, and previous iterations are obsolete the moment a later iteration supersedes them. - **Re-read the diff end-to-end before finishing.** After several iterations, files carry leftovers — replaced methods, unused imports, stale branches, helpers that nothing calls anymore. Delete them. Git has the history; the codebase does not need a tombstone. - **No narration comments.** Don't explain WHAT (names do that) or reference the task ("added for X", "used by Y flow", "handles issue Z"). Only write a comment when the WHY is non-obvious: a hidden constraint, a workaround, a surprising invariant. - ✗ \`// Loop through findings and send feedback to Slack\` - ✗ \`// Added for the unfurl flow\` / \`// TODO: remove old logic once migrated\` - ✓ \`// Stripe retries webhooks on 5xx — dedupe on event.id before mutating state\` - **No commented-out code, no "removed X" tombstones, no backwards-compat shims for code you just deleted in the same PR.** If it's gone, it's gone. Don't keep a renamed \`_oldMethod\` "just in case". - **No single-use abstractions.** Don't create a helper, wrapper, base class, or custom decorator until a second caller exists. Three similar lines beats a premature abstraction. \`packages/shared-utils/src/status-mapper.ts\` is what justified extraction looks like — used across \`scan/\`, \`findings/\`, and \`cost-estimation/\`. Don't manufacture that bar; let duplication prove it. - **No speculative error handling.** Trust internal callers and framework guarantees. DTOs already validate controller input via \`class-validator\` — a service that receives a typed \`SendFeedbackDto\` (\`src/findings/dto/send-feedback.dto.ts\`) does not re-check that \`reaction\` is a string. Validate only at true boundaries: HTTP input, webhook payloads, external API responses, untyped env vars. - ✗ \`try { return await this.repo.findOne(...) } catch (e) { throw e }\` - ✗ \`if (!user) throw new Error('user required')\` where the parameter type is \`User\`, not \`User | undefined\` - ✗ Wrapping a single \`repo.save()\` in a try/catch that logs and rethrows - **Prefer editing existing files and reusing existing types.** Search \`src/utils/\`, \`src/services/\`, \`src/dto/\`, and \`@my-project/shared-utils\` before writing a new helper. Reuse \`PaginationDto\` (\`src/dto/pagination.dto.ts\`) for paginated endpoints instead of defining \`page\`/\`limit\` again. Reuse entity types from \`@my-project/shared-types\` instead of redeclaring shapes. Don't split a 200-line service into four files unless there's an actual reason. - **Keep the shape minimal.** Controllers stay thin — validate → service → return, no branching, no queries (see \`src/findings/findings.controller.ts\`). DTOs carry request/response fields only, decorated with \`@ApiProperty\` + \`class-validator\` — nothing more (see \`src/findings/dto/send-feedback.dto.ts\`, \`src/dto/pagination.dto.ts\`). Entities stay as columns + relations — no computed getters or lifecycle hooks unless actually needed (see \`src/seat/organization-developer.entity.ts\`). - **Frontend caveat:** UI iteration is where slop compounds fastest — unused props, stale Tailwind classes, dead conditional branches from designs two revs ago, state nothing reads. Same rule applies with more force: read the component top-to-bottom against the current design before declaring done, and delete anything the current design doesn't use. ### Quality Gates - Tests must pass (\`pnpm test\`) before committing. - Linter must pass (\`pnpm lint\`) before committing. - Type-checker must pass (\`pnpm check-types\`) before committing. ``` 这里有几点： 1. 我们通过指示代理更新 Notion 文档来强制执行自我文档开发。这假设使用 Notion MCP。 2. 我们根据过去观察到的行为来执行AI编码规则。例如，我们看到前端代码由于UI迭代的特性，会产生大量杂乱：它旨在产生许多不同的变体，直到开发者满意为止。这意味着编码代理常常留下大量陈旧且死掉的代码。我们发现执行“最小改动”规则帮助很大。 ## “垃圾收集”针对污水即使我们尽了最大努力，“粗糙代码”依然不可避免。人类以前也不会写出粗糙的代码。但人工智能让我们通过定期审计代码库，比如代码库中出现无引用的函数、过时文档等问题来应对这个问题。我们通过创建一个GitHub Actions工作流程，每24小时运行一次Claude代码，并提示它： 1. 根据我们在仓库中文档维护的一套规则，清理质量较差的代码。 2. 请根据最新的代码变更更新上述 **CLAUDE.md**。 ```yaml name: Claude Garbage Collection on: workflow_dispatch: schedule: - cron: '0 0 * * *' concurrency: group: claude-garbage-collection cancel-in-progress: false jobs: cleanup: strategy: fail-fast: false matrix: target_branch: - staging runs-on: ubuntu-latest permissions: contents: write pull-requests: write issues: write id-token: write actions: read steps: - name: Checkout repository uses: actions/checkout@v4 with: fetch-depth: 1 ref: ${{ matrix.target_branch }} - name: Setup pnpm uses: pnpm/action-setup@v3 with: version: 10 - id: auth uses: google-github-actions/auth@v2 with: workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }} service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }} token_format: access_token - name: Set NPM_TOKEN for Artifact Registry run: echo "NPM_TOKEN=${{ steps.auth.outputs.access_token }}" >> "$GITHUB_ENV" - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '24.x' cache: 'pnpm' - name: Install dependencies run: pnpm install --frozen-lockfile - name: Run Claude garbage collection task id: claude-cleanup uses: anthropics/claude-code-action@v1 with: claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} base_branch: ${{ matrix.target_branch }} prompt: | Read \`CLAUDE.md\` and \`docs/cleanup/README.md\`. Use \`docs/cleanup/\` as the source of truth for this garbage collection pass. Work only against \`${{ matrix.target_branch }}\` and keep the change scoped to that branch's current state. You may make multiple improvements, but each PR must stay focused on one small, safe maintenance concern. Leave the repository unchanged if there is no clear cleanup to make. additional_permissions: | actions: read claude_args: "--allowedTools 'Edit,MultiEdit,Write,Read,Glob,Grep,LS,Bash(git:*),Bash(bun:*),Bash(npm:*),Bash(npx:*),Bash(pnpm:*),Bash(gh:*)'" sync-claude-md: strategy: fail-fast: false matrix: target_branch: - staging runs-on: ubuntu-latest permissions: contents: write pull-requests: write issues: write id-token: write actions: read steps: - name: Checkout repository uses: actions/checkout@v4 with: fetch-depth: 1 ref: ${{ matrix.target_branch }} - name: Sync CLAUDE.md with codebase id: claude-md-sync uses: anthropics/claude-code-action@v1 with: claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} base_branch: ${{ matrix.target_branch }} prompt: | Your sole task is to update all \`CLAUDE.md\` files so they accurately reflect the current codebase on the \`${{ matrix.target_branch }}\` branch. Steps: 1. Read every \`CLAUDE.md\` file in the repo (root \`.claude/CLAUDE.md\` and any nested ones like \`apps/my-app/CLAUDE.md\`, etc.). 2. Audit each section against the actual codebase: - **Project structure**: list directories under \`apps/my-app/src/\` and update the tree if modules were added, renamed, or removed. - **Key entities**: check \`apps/my-app/src/**/entities/*.entity.ts\` and update the entity table. - **API namespaces**: check all \`@Controller()\` decorators and update the namespace table. - **Key commands**: verify each command in \`package.json\` scripts still exists. - **Environment variables**: check \`.env.example\` and update the env var list. - **Path aliases**: check \`tsconfig.json\` path mappings. - **Shared packages**: check \`packages/*/package.json\` names. - **Guards & auth**: check \`src/guards/\` and \`src/middleware/\` for current guard list. 3. Remove references to files, modules, entities, or endpoints that no longer exist. 4. Add entries for new modules, entities, or endpoints that are missing from the docs. 5. Do NOT change style, tone, or conventions sections — only factual/structural sections. 6. If nothing is out of date, make no changes and do not open a PR. Keep the PR focused: only \`CLAUDE.md\` file changes, nothing else. additional_permissions: | actions: read claude_args: "--allowedTools 'Edit,MultiEdit,Write,Read,Glob,Grep,LS,Bash(git:*),Bash(bun:*),Bash(npm:*),Bash(npx:*),Bash(pnpm:*),Bash(gh:*)'" ``` 这在很多时候产生了易于合并的拉取请求，也为我们节省了无数小时的手动重构和清理工作。它几乎像是一个垃圾回收引擎，在后台清理死代码和陈旧文档，除了审核（大多干净的）PR外，几乎不需要我们手动操作。 ## 让糟糕的代码难以提交让Claude Code帮你运行**git提交**真是太丢人了......现在这已经成了常态，很多人都这么做。所以最好的做法是使用在提交时强制执行质量的钩子。 ![图像](https://pbs.twimg.com/media/HHXB961a4AAxcF2?format=jpg&name=large) 你可以很容易地设置这个： ```bash pnpm add -D husky lint-staged pnpm exec husky init ``` **package.json**年： ```json { "lint-staged": { "*.{ts,tsx}": ["eslint --fix", "prettier --write"], "*.{json,md,yml,yaml}": ["prettier --write"] } } ``` 这确保所有代码至少在进入 GitHub 前通过了 linting 和格式规则。测试和排版检查呢？现在是时候更进一步，给这位特工...... ## 一个命令来验证所有内容经纪人需要一个终点线。功能完成后，功能测试可以通过Playwright或Cursor内置浏览器轻松完成。但它怎么知道代码是否适合审核？你可以创建一个这样的脚本，进行类型检查、lints、运行单元测试，并生成生产版本： ```json { "scripts": { "validate": "pnpm typecheck && pnpm lint && pnpm test && pnpm build" [...] } } ``` 然后通过例如**，CLAUDE.md** 指示代理使用该命令。 ```markdown Before considering a task complete, run: pnpm validate If it fails, fix the errors rather than working around the checks. Do not remove tests or weaken types unless explicitly asked. ``` ## 始终是测试驱动开发代理只有在能够高度自信地完成“代码→测试/→再次验证代码”循环时，代理才有优势，确保测试/验证步骤真正反映开发者的需求。这正是经过验证的TDD方法论真正闪耀的地方。首先，你向经纪人描述了预期的规格。你可以为此写一个Markdown文件。接下来，代理生成测试用例。现在，你手动检查这些测试用例，看看它们是否反映你想要的行为： ```typescript it('does not charge customers twice for the same billing period', () => { // ... }) ``` If they don't, then the agent should change the tests. Once you're satisfied with the test spec, then (and only then) get the agent to start doing the real coding work. For coding agents, a good test suite is not only good documentation, but also serve as great supervision. ## CI where local harness engineering isn't enough Local hooks can only catch so many obvious problems. In the end, CI tests are where many bugs are found before they make it to production. One example of where CI tests are most useful is for security. It's no secret that vibe coding has produced a lot more software vulnerabilities in recent months! When agents generate code quickly, they also generate more places for auth checks to be skipped, dependencies to sprawl, and business logic assumptions to break. For example, tools like [GitGuardian](https://www.gitguardian.com/) can catch accidentally-committed secrets, and [Socket](https://socket.dev/) can catch vulnerable or suspicious dependencies to stop supply-chain attacks. For deeper application security issues, especially the kinds generic scanners struggle with, you can also use AI-native tools like [Hacktron](https://www.hacktron.ai/blog/introducing-hacktron-review) in CI to review pull request for real code-level vulnerabilities: broken authorization, unsafe business logic, and other security regressions that require more context than simple pattern matching. The advantage of tools like [Hacktron](https://www.hacktron.ai/blog/introducing-hacktron-review) is that unlike traditional scanners that still rely on known syntactic patterns and AI reviewers that provide only functional testing and code quality issues, Hacktron finds real security vulnerabilties that are introduced throughout the lifetime of your organisation using context-aware analysis to identify the security issues that Claude and Codex miss. ## Always think about affordance I hope this article has been helpful to you. I've outlined some techniques and ways that we think about vibe coding while enforcing code quality and security. The key thing to bear in mind is to always think about what your codebase and development environment is affording to the model. The output of your coding agent will depend heavily on that, because the environment dictates the constraints in which these agents operate.

很多人问我的 AI UI 工作流，被我完整拆成一条完整链路：先讲需求和目标，再落 UI Spec，再做视觉探索，再进入组件化实现，最后再修复

Sat, 02 May 2026 14:45:09 GMT

Max.King *2026年5月2日 13:36* MaxKing宝藏全栈开发者 × 量化交易 × AI 重度用户。这里记录我用 AI **提升效率、解决问题、优化流程** 的真实实践，也分享工具背后的判断、踩坑和可复用方法。上次发了一个文章： [别再只让 Codex 写代码了，它更适合接管整条 UI 生产线](https://mp.weixin.qq.com/s?__biz=MzkwNzU5OTI0OA==&mid=2247483929&idx=1&sn=4615c470a1bc93064accaf47f625201d&scene=21#wechat_redirect) 本意是想着能少用一个工具就少用一个，codex能一手包，何必多用一个呢？但很多人发了留言关于 AI 生成图片，再生成代码的工作流。我从留言中，发现最常见的误区其实不是工具选错了，而是顺序一开始就乱了。如果只是生成一张图，很多工具都能做到。根据图片生成一段代码，现在也有不少工具可以试。但真实项目里的页面，不是一张静态图。它有业务目标，有模块优先级，有数据状态，有响应式适配，有组件复用，也有后续维护成本。所以我更愿意把 AI UI 落地工作流拆成一条完整链路：先讲需求和目标，再落 UI Spec，再做视觉探索，再进入组件化实现，接着用截图对比把偏差一轮一轮收回来。这篇就按这个顺序拆。我把这套链路叫做 **AI UI 工作流** 。它不是“先出一张图，再把图转成代码”这么简单，而是把页面从想法推进到可运行、可维护状态的一组步骤。这条链路大致分成六步： 1\. 需求与目标：先讲清楚页面服务谁，解决什么问题，用户核心动作是什么。 2\. UI Spec：把页面拆成结构化说明，包括模块、组件、状态、响应式和验收标准。 3\. 视觉探索：基于 UI Spec 生成视觉参考图，看信息层级、模块关系和视觉风格。 4\. 组件化实现：用 Codex / Cursor 根据 UI Spec + 参考图落 React 页面，优先复用组件库。 5\. 截图对比与修正：用浏览器截图和参考图对比（script可以自动生成截图），逐项修正布局、间距、密度和状态。 6\. 交付与沉淀：把 Prompt、UI Spec、组件结构、mock 数据和修正清单沉淀成模板。 AI UI 落地不是图直接转代码。最容易踩坑的地方，是图看起来没问题，真到实现阶段才发现结构和状态都不对。 ![图片](https://mmbiz.qpic.cn/sz_mmbiz_png/76wHjqI4MlRcpBVsibrbiazbKSsicNCfy6tFL4cTdM4mVVwicnw0TNtNcJBx6dRZkEDmYbVSrwX7Qv61QWnQM5ByZFgDpNuczYZZriaWZLtJM1j0/640?wx_fmt=png&from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=0) 01 \-MaxKing.cc- ## 为什么一定要有工作流？很多人做 AI 页面，会直接跳到工具层。用哪个工具生成图？用哪个工具把图转代码？要不要进 Figma？要不要直接丢截图？ Codex、Cursor、Claude Code 该怎么搭？这些问题当然重要，但如果没有工作流，工具越多，反而越乱。因为 **出图、写代码、修页面，本质上是三个不同问题** 。出图解决的是视觉方向。它回答的是：这个页面大概应该长什么样。写代码解决的是工程实现。它回答的是：这个页面怎么拆组件、怎么接数据、怎么维护。修页面解决的是落地偏差。它回答的是：生成结果和预期之间差在哪里，怎么一步步收敛。 ![配图1](https://mmbiz.qpic.cn/mmbiz_png/76wHjqI4MlQ8JxwKbtPSvesorsUZ05j9zPQwfUtib1ib3ic2ibEkp1bY1icMvMT5s6Wpuoy9oe1hZCs7J3VL4EkIFxaxKAOjaKdpgXnPnaDstnZ0/640?from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=1) 一开始让 AI 生成图，图很好看；然后拿图去转代码，发现结构不对；再让 AI 改代码，它开始乱改；到头来还是自己手动调。这类问题通常不是某个工具不行，而是前后步骤没有对齐。所以我的判断很直接：AI UI 落地不能只靠单点工具，必须靠一套输入输出明确的工作流。每一步都要知道：这一环输入什么？输出什么？由谁判断？进入下一步的标准是什么？只有这样，AI 才不是“随机帮你生成一下”，而是能真正进入开发流程。 02 \-MaxKing.cc- ## 第一步：先拆需求与目标我现在做 AI 页面，第一步不是打开图片生成工具，也不是打开代码编辑器。第一步是先问清楚：这个页面到底要解决什么问题？比如一个交易仪表盘页面，不能只说：我要一个高级一点的交易后台。这个需求太空了。更好的拆法，是先把人和目标说清楚：这个页面给谁用？用户打开页面后最重要的事情是什么？页面最重要的信息是什么？用户有没有关键操作？页面做到哪一步，才算满足需求？比如交易仪表盘，它不是单纯“做一个好看的后台”。更准确的描述应该是：这是一个给个人交易者 / 专业交易员使用的账户首页，目标是让用户登录后快速查看账户风险、当前持仓、交易信号和最近活动。页面优先级是：风险预警 > 账户概览 > 持仓表格 > 信号面板 > 最近活动。这段话看起来普通，但它会决定后面所有结果。如果这一步不清楚，AI 会自己补脑。它可能会把页面做得很炫，但风险模块不突出。它可能会加很多图表，但真正关键的持仓信息不清楚。它可能会做得像展示页，但不像一个真实可用的业务页面。先把目标、用户、主路径和信息优先级讲清楚，后面的工作才有坐标系。 03 \-MaxKing.cc- ## 第二步：把需求变成 UI Spec 需求说清楚以后，我不会马上让 AI 画图。我会先整理一份 **UI Spec** 。 UI Spec 就是写给 AI 和工程实现看的结构化页面说明书。它关心的不是“好不好看”，而是页面目的、模块、组件、状态和布局。也就是说，它要把一个还比较模糊的页面想法，拆成后面可以直接执行的结构。比如一个页面至少要讲清楚：页面目的是什么？目标用户是谁？核心动作是什么？页面有哪些模块？每个模块是什么组件类型？有哪些状态？桌面端和移动端怎么适配？页面怎么验收？还是以交易仪表盘举例，可以先写成这样： YAMLMaxKing.cc ``` page: name: 交易仪表盘 purpose: 帮助用户快速查看账户风险、持仓和交易信号 target_user: 个人交易者 / 专业交易员 primary_action: 查看当前账户风险 layout_type: dashboard sections: - name: 账户概览 component_type: Metric Cards priority: high - name: 风险预警 component_type: Alert Card priority: high - name: 持仓表格 component_type: Data Table priority: medium - name: 信号面板 component_type: Signal Cards priority: medium states: - loading - empty - error - normal ``` ![配图2](https://mmbiz.qpic.cn/sz_mmbiz_png/76wHjqI4MlRGoJI09TrIcQlHpj7R8iaRvyuh4XD7lk8UxXdx3eibUwPvVE0oroAcibCIM8A6Eam22Xl18UB6VVDLjeY2etjqXoialDeBqBVOmjY/640?from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=2) 这份东西的价值很大。因为后面的视觉生成、代码实现、截图修正，都可以围绕它展开。没有 UI Spec，AI 只能根据一句话或一张图猜结构。有了 UI Spec，AI 至少知道这个页面应该怎么组织。 UI Spec 是这套工作流里最关键的中间层。它解决的是：从“模糊想法”到“可执行页面结构”的问题。 04 \-MaxKing.cc- ## 第三步：再做视觉探索有了 UI Spec 以后，我才会进入视觉探索。这里可以用 gpt-image-2 或其他图片生成工具。但这一步的目标，不是让 AI 随便画一个“高级页面”。我会明确告诉它：这张图只是视觉参考，不是最终设计稿。优先保证页面结构、信息层级和模块关系清楚。页面要像真实 SaaS 产品界面，不要像概念海报。不要过度科幻，不要复杂 3D，不要无意义装饰。后续要能落到 React + Tailwind + 组件库里。也就是说，视觉探索阶段主要看四件事。第一，信息层级是否清楚。用户第一眼能不能看到最重要的信息？第二，模块关系是否合理。账户概览、风险预警、持仓表格、信号面板之间的关系是否清楚？第三，视觉密度是否合适。交易仪表盘不能太空，也不能乱成一团。第四，是否适合组件化实现。卡片、表格、按钮、徽章、状态提示能不能拆成真实组件？这里有一个很重要的判断：视觉图不是源头，UI Spec 才是源头。图片只是帮助我们确认视觉方向。它不能决定产品结构，也不能替代工程约束。如果生成图有些细节很好，可以吸收。如果生成图和 UI Spec 冲突，我会优先相信 UI Spec。 05 \-MaxKing.cc- ## 第四步：组件化实现视觉方向确认后，才进入代码实现。这一步我通常会用 Codex / Cursor 这类 coding agent。但我不会只丢一张图给它。我会同时给它： \* UI Spec \* 视觉参考图 \* 技术栈 \* 组件库约束 \* 页面状态 \* mock 数据要求 \* 验收标准比如技术栈可以先约束为： TEXTMaxKing.cc ``` React TypeScript Tailwind CSS shadcn/ui ``` 同时要求它：优先复用 Card、Table、Badge、Button、Tabs、Alert 这类基础组件。不要为了还原视觉效果写一堆不可维护的代码。不要把所有东西写在一个大组件里。 mock 数据集中放在 `mockData.ts` 。页面必须支持 loading、empty、error、normal 四种状态。响应式至少支持桌面端和移动端。代码实现阶段的目标，是根据 UI Spec 做组件化实现，并尽量贴近视觉参考。这里要接受一个现实：第一版代码通常不会完美。它可能布局基本对了，但间距不够好。它可能组件结构对了，但视觉密度还要调。它可能桌面端能看，移动端还要优化。没关系。第一版最重要的是：能跑起来。结构是对的。组件边界是清楚的。状态没有漏掉。后续可以截图修正。 06 \-MaxKing.cc- ## 第五步：截图对比与修正很多人对 AI 页面失望，是因为他们期待一次生成完美结果。我现在不这么期待。我更关注它能不能进入一个稳定的修正闭环。这个闭环是： TEXTMaxKing.cc ``` 参考图 ↓ 代码实现 ↓ 浏览器截图 ↓ 对比差异 ↓ 局部修正 ↓ 再次截图 ``` 这一步非常重要。因为浏览器里的真实页面，和静态视觉图一定会有差异。真实页面要处理宽度，要处理数据长度，要处理字体渲染，要处理不同屏幕，要处理 loading、empty、error 状态。所以我会让 AI 或自己对比：布局结构是否一致？模块顺序是否正确？主次信息是否清楚？卡片间距是否过松或过紧？表格密度是否合适？风险预警是否突出？移动端是否跑版？然后一次只修 3 到 5 个问题。我不建议直接说：这个页面不像，重新写。这样 AI 很容易把已经正确的部分也改坏。更好的方式是：请不要重写整个页面，只修正以下 3 个问题：风险预警权重不够、持仓表格信息过挤、移动端卡片间距过大。修改后说明涉及哪些组件和样式。 AI 页面不是一次生成出来的，而是一轮一轮收敛出来的。 07 \-MaxKing.cc- ## 第六步：交付与沉淀很多人做到页面能用就结束了。但我现在更关注收尾一步：沉淀。因为真正有价值的不是这一次页面生成成功，而是下一次能不能更快。一个页面做完以后，我会尽量沉淀这些东西：页面需求拆解。 UI Spec。视觉生成 Prompt。代码实现 Prompt。组件拆分方式。 mock 数据。截图修正清单。验收标准。比如交易仪表盘这次做完以后，后面它就可以复用到：账户首页。数据看板。风控页面。策略监控页面。后台管理首页。只需要替换业务字段、模块优先级和视觉风格，就可以快速生成下一版。不是每次从零开始问 AI，而是把每次成功经验变成模板。这就是 AI 工作流真正的复利。我把这套流程整理成了一份《AI UI 落地工作流资料包》。下一步可以这样做收藏这篇，后面你做 AI 页面时，先对照工作流顺序再开工具。如果你也在做页面落地，评论区说出你最卡的一步：需求、UI Spec、视觉、代码还是修正。想看下一篇就关注，我会继续把 AI 页面前置判断拆开。 \- END - 关于 MaxKing宝藏我是 MaxKing，全栈开发者、量化交易实践者，也是 AI 重度用户。这里分享的不是遥远概念，而是我在真实使用、搭建和踩坑后留下的判断。如果这篇文章对你有启发，欢迎 **点赞、在看、转发** ，也欢迎加我好友交流 AI 工具和自动化实践。 AI UI 落地工作流实战：从漂亮图到可上线页面 · 目录继续滑动看下一个 MaxKing宝藏向上滑动看下一个拖拽到此处完成下载图片将完成下载 AIX智能下载器保存

别再只让 Codex 写代码了，它更适合接管整条 UI 生产线

Sat, 02 May 2026 14:45:04 GMT

Max.King *2026年4月27日 21:29* MaxKing宝藏全栈开发者 × 量化交易 × AI 重度用户。这里记录我用 AI **提升效率、解决问题、优化流程** 的真实实践，也分享工具背后的判断、踩坑和可复用方法。我最近做页面时，有个感受越来越明显：直接让 AI 写代码，速度已经很快了，但页面经常会变成“能用的后台表单”。按钮有了，布局有了，逻辑也能跑，可一眼看过去，不像一个认真打磨过的产品页面。这时候问题就不在代码能力了。你让 Codex 继续改颜色、改间距、改卡片，它也能改。但你会发现，自己一直在扮演产品经理、设计师和前端之间的翻译器。真正浪费时间的，不是写代码，而是把页面想清楚。这就是我觉得 gpt-image-2 有意思的地方。它不是单独来“画一张图”的。更准确地说，它可以变成 Codex 工作流里的视觉生产节点：Codex 先拆方案，gpt-image-2 出 UI，Codex 再把 UI 变成代码。 01 \-MaxKing.cc- ## UI 的问题，往往不是代码问题我以前做内部工具，常见做法是直接开口：帮我写一个管理页面，要有列表、筛选、详情、导出。 AI 很快能写出来。问题是，第一版通常像一个功能集合：信息都在，但主次关系弱；按钮都在，但不知道哪个动作最重要；页面能跑，但没有产品感。 UI 的难点，不只是把功能摆上去。它真正要解决的是：用户进来先看什么？哪个动作是主路径？异常状态怎么展示？哪些信息应该被弱化？如果这些问题没先想清楚，代码写得越快，返工也越快。 ![配图1](https://mmbiz.qpic.cn/sz_mmbiz_png/76wHjqI4MlRiaoficRsrxS3l0vf5gDc9M4gc3VjwOKZBEjibg2sEfssaeTfJf9qTibiaR1hjamL5icB5o7gmjw9Gn1d2qbQy31tjnxibXNqtFibAqNw/640?from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=0) 这就是 Codex 和 gpt-image-2 放在一起的价值。 02 \-MaxKing.cc- ## Codex 先拆方案，gpt-image-2 再给视觉 Codex 先做它擅长的事：把业务目标拆成页面结构。比如这个页面是给运营看数据，还是给用户完成配置？它需要强调趋势，还是强调操作？它应该有几个状态，空状态、加载态、错误态怎么处理？然后 gpt-image-2 接手，把这些结构变成一个可以看的 UI 方向。这一步很关键。因为只要有了视觉稿，你就不再是在脑子里想象页面。你可以直接判断：这个页面是不是太重？主按钮是不是抢眼？信息密度是不是适合用户？视觉稿不是终点，是讨论对象。过去没有设计资源的小团队，最容易跳过这一步。大家直接写代码，然后在代码里改样式。改到后面，谁都说不清到底是业务问题、设计问题，还是组件问题。现在这条链路可以换一种跑法：Codex 先出页面方案，gpt-image-2 出视觉稿，人只做判断和取舍，Codex 再把确定下来的方向拆成组件。 03 \-MaxKing.cc- ## 真正省下来的，是角色切换成本这个变化对个人开发者尤其明显。一个人做产品，最痛苦的不是不会写某个组件，而是你要在不同角色之间来回切换。你刚在想用户路径，下一秒就要想 CSS；刚决定页面主次，又要开始处理表单状态。 Codex + gpt-image-2 的组合，本质是在减少角色切换。 Codex 像一个能理解工程边界的协作者。它知道页面最终要落成组件、状态、接口和样式。gpt-image-2 像一个视觉放大器，把抽象描述变成可观察的方案。两者合在一起，才开始像一条 UI 生产流水线。这里不要误解。它不是说设计师不重要，也不是说前端不重要。恰恰相反，人的判断更重要了。你要判断这个页面是不是符合业务目标，要判断视觉是不是服务主路径，要判断哪些细节值得做，哪些只是装饰。 ![配图2](https://mmbiz.qpic.cn/sz_mmbiz_png/76wHjqI4MlRsG2T2gibVfZNXicvJTKUhB2RsG6AzUHd73eHfax8pofUx0uvhEHUicXWsd1TJySRnA6ylVPCAPTa1YbszG3EiaYOK6UJPtyxCADE/640?from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=1) AI 负责把可能性快速摊开，人负责决定哪条路值得走。 04 \-MaxKing.cc- ## 这条流水线，最适合三类场景我更建议把这套流程用在三类场景里。一类是新产品的第一版页面。这时候不需要一开始就写很重的工程代码。先让 Codex 把页面目标、用户动作、信息层级写清楚，再让 gpt-image-2 出几版视觉方向。你会很快发现，原来自己需求里有很多没想清楚的地方。一类是内部工具的体验升级。很多内部系统不是不能用，而是太像表格堆砌。用 Codex 先拆角色和任务，再让 gpt-image-2 给出更清晰的信息组织方式，往往能找到比“换个主题色”更有效的改法。一类是自动化交付里的页面生成。比如你已经有一个内容系统、运营后台、数据看板，过去每次新增页面都要人工想结构。以后可以把业务约束交给 Codex，让它生成页面方案、视觉提示和组件清单，再进入代码实现。这个链路越跑，复利越明显。因为每做一次，你不只得到一个页面。你还会沉淀一套页面结构模板、一套视觉 prompt 规范、一套组件拆解方式。后面再做类似页面，就不是从零开始。真正有价值的，不是某一次生成结果，而是可复用的生产方法。 05 \-MaxKing.cc- ## 下一次做页面，先别急着要代码如果你现在想试，不要一上来就问：帮我生成一个漂亮页面。更好的问法是：这个页面服务谁？用户进来要完成什么动作？最重要的信息是什么？有哪些状态？哪些内容可以弱化？哪些模块未来要复用？把这些问题交给 Codex，让它先出方案。方案稳定后，再让 gpt-image-2 生成 UI。视觉方向选定后，再让 Codex 回到代码。这才是我认为 gpt-image-2 真正改变 UI 生产的地方，也是一条 Codex 生成 UI；AI UI 生产流水线；gpt-image-2 前端工作流。它不是让你少写几行 CSS。它是让 UI 从“脑子里的想象”和“代码里的试错”，变成一条可以被拆解、被观察、被复用的流水线。 \- END - 关于 MaxKing宝藏我是 MaxKing，全栈开发者、量化交易实践者，也是 AI 重度用户。这里分享的不是遥远概念，而是我在真实使用、搭建和踩坑后留下的判断。如果这篇文章对你有启发，欢迎 **点赞、在看、转发** ，也欢迎加我好友交流 AI 工具和自动化实践。 MaxKing宝藏邀请你前往腾讯公益一起捐关爱女童守护花开 AI UI 落地工作流实战：从漂亮图到可上线页面 · 目录继续滑动看下一个 MaxKing宝藏向上滑动看下一个拖拽到此处完成下载图片将完成下载 AIX智能下载器

不要再直接把 UI 图转成代码了，先看这份 UI Spec 模板

Sat, 02 May 2026 14:44:54 GMT

Max.King *2026年5月1日 22:07* MaxKing宝藏全栈开发者 × 量化交易 × AI 重度用户。这里记录我用 AI **提升效率、解决问题、优化流程** 的真实实践，也分享工具背后的判断、踩坑和可复用方法。有人把一张刚生成的后台图贴到群里，顺手问了一句：这图已经挺像样了，能不能直接丢给图片转代码工具？我盯着那张图看了两秒，第一反应不是“好不好看”，而是它到底有没有告诉工具，这个页面给谁用、先看什么、状态怎么补、手机端怎么折。图确实顺，阴影、留白、层级都在，第一眼很容易让人放松警惕。可真把它放进浏览器，数据一长，卡片就开始挤；屏幕一缩，按钮和列表就互相顶；几个状态页没补齐，页面看起来就像少了骨架。那一刻会特别直观： **图能给 AI 视觉结果，给不了工程判断。** 那条留言其实点得很准：UI Spec 不是多写一步，而是先把结构说清楚，再让工具去画、去写、去落地。 ![配图1](https://mmbiz.qpic.cn/sz_mmbiz_png/76wHjqI4MlRsmV4wrvnGOBgOfqw6QXo7TqlUIOM2X7MwibsicHUoejb2jLS5CxiabhPZlJtib9ibTs2ca7g3m49to6xZERaXyIVxztjic0ZMgON8U/640?from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=0) 01 \-MaxKing.cc- ## 为什么不要直接把图转成代码？这条留言最有价值的地方，不是在评价 image2，而是把真正的问题挑明了：你要交付的不是图，是页面。很多人第一次看到 AI 生成的 UI 图，都会先被外观骗到。深色背景、柔和阴影、整齐卡片、像样的图标，放在一张图里看，确实很有完成度。但只要把它放进真实业务里，问题很快就会冒出来：数据一长，卡片就撑；屏幕一窄，布局就塌；按钮一多，主次就乱。这不是“AI 不行”这么简单。更准确地说，是你给它的输入，本来就不够它做工程判断。 02 \-MaxKing.cc- ## UI Spec 到底是什么？很多人会把“看起来像界面的图片”直接当成 UI 设计稿。这个误会很常见，因为它们第一眼确实很像。但两者要解决的事不一样。图片主要回答“长什么样”，设计稿还要回答“怎么组织、怎么响应、怎么切换、怎么复用”。前者偏外观，后者更接近页面骨架。一张图能告诉你颜色和氛围，却很难告诉你层级、约束和状态。同一张图，放在大屏上看可能很完整，切到手机预览里就开始挤；占位文案看着舒服，换成真实数据以后，留白和卡片边界立刻紧起来。那种“图很稳、页面很飘”的感觉，通常就是在这里暴露的。而前端真正要落地的，恰恰就是这些工程信息。可以把它理解成看效果图盖房子。效果图能看出大概风格，却不会告诉你承重墙、管线、楼板和动线怎么排。页面也是一样，光有图，代码工具只能猜。图片不是源头，它只是视觉表达。页面能不能跑，决定权在结构，不在像素。 ![配图2](https://mmbiz.qpic.cn/mmbiz_png/76wHjqI4MlS4icyT7lvUTPF8XLRrEaibRibHia4qRmOz3Zsh2ILIRBBY1mCOzVpUjaia8Uf7okkymsYdKDKvYeprPAaEubC9CDFsm565e99OciaNQ/640?from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=1) 03 \-MaxKing.cc- ## 最小可用 UI Spec 模板 | 前端真正需要的信息 | 单张图片是否提供？ | 缺失后的结果 | | --- | --- | --- | | **组件层级嵌套** | ❌ | 页面容易被平铺成一堆 `div` ，看着像，结构却散。 | | **交互与页面状态** | ❌ | 没有 Hover、Loading、Empty、Error，页面像静止的海报。 | | **响应式规则** | ❌ | 一到手机端就跑版，文字挤压，模块互相顶撞。 | | **设计 Token** | ❌ | 颜色、字号、间距只能靠猜，后面很难接入统一体系。 | | **真实数据边界** | ❌ | 占位符一换成长文本，卡片马上撑爆。 | 这张表的意思很简单： **单张图不是不能用，而是不能单独当源头。** 它可以帮你定风格、定氛围、定视觉密度，但页面落地最麻烦的部分，不在“看起来好不好看”，而在“能不能实现、能不能维护、能不能继续扩展”。 04 \-MaxKing.cc- ## 为什么 image-to-code 总会跑偏 image-to-code 真正的难点，不是“把图翻译成代码”，而是“根据视觉表象猜结构”。它看得到三个很像的卡片，却未必知道这三个卡片其实应该抽成同一个组件。它看得到一个很亮的按钮，却未必知道它是业务主操作还是次要操作。它看得到一块炫光效果，却未必知道那只是装饰，不该变成复杂的绝对定位。所以你会看到一种很典型的结果：生成速度很快，返工也很快。第一眼觉得能用，第二眼就开始改，第三眼已经在手动重构。问题不在于工具不会画，而在于它不知道该先理解什么。它是在猜结构，不是在翻译结构。 05 \-MaxKing.cc- ## 破局的关键，是先补 UI Spec 如果只在“图”和“代码”之间来回跳，大概率还是会回到手改。更稳的做法，是在中间补一层 UI Spec。这一步不是什么概念包装。它要解决的，就是 AI 生成的 UI 图在落地时最容易缺掉的那层结构判断。这层东西也不需要写得很玄。它本质上就是一份结构化界面规范，可以是 YAML，也可以是 JSON。格式不是重点，重点是先把工程判断写清楚： 1\. 页面要解决什么问题。 2. 主路径和次路径分别是什么。 3. 哪些模块可以复用。 4. 哪些状态必须补齐。 5. 响应式规则怎么变。先写结构，再谈好看。下面这份就是我现在会先写出来的最小模板： YAMLMaxKing.cc ``` page: name: 页面名称 purpose: 页面目的 target_user: 目标用户 primary_action: 用户核心动作 layout_type: dashboard / form / list / detail / landing sections: - name: 模块名称 purpose: 模块作用 priority: high / medium / low component_type: Card / Table / Form / Tabs / Chart states: - loading - empty - error - normal responsive: desktop: 桌面端布局 tablet: 平板端布局 mobile: 移动端布局 visual_tokens: color_style: 视觉风格 density: 信息密度 radius: 圆角规则 spacing: 间距规则 acceptance: - 验收标准 1 - 验收标准 2 ``` 有了这层东西，gpt-image-2 之类的模型再去出图，就不再是“凭感觉画一张漂亮图”，而是围绕一套已经被约束过的结构去做视觉探索。图的角色也会变轻：它更像情绪板，不是唯一标准。 06 \-MaxKing.cc- ## 用交易仪表盘举个例子比如我要做一个交易仪表盘，我不会直接说：帮我生成一个高级的交易后台页面。这种说法太空了，AI 很容易生成一个很酷但不一定能用的页面。我会先写成这样： YAMLMaxKing.cc ``` page: name: 交易仪表盘 purpose: 帮助用户快速查看账户风险、持仓和交易信号 target_user: 个人交易者 / 专业交易员 primary_action: 查看当前账户风险 layout_type: dashboard sections: - name: 账户概览 component_type: Metric Cards priority: high - name: 风险预警 component_type: Alert Card priority: high - name: 持仓表格 component_type: Data Table priority: medium - name: 信号面板 component_type: Signal Cards priority: medium states: - loading - empty - error - normal ``` 这样 AI 至少知道：这个页面不是单纯展示数据，而是要帮助用户快速判断账户风险。风险预警和账户概览是高优先级，持仓表格和信号面板是中等优先级，页面还必须考虑 loading、empty、error、normal 四种状态。后面再用这份 UI Spec 去生成视觉图，结果会比“随便生成一个交易后台”稳定很多。再拿这份 UI Spec 加上参考图去生成代码，也比直接丢一张图给图片转代码工具更容易落地。 ![配图3](https://mmbiz.qpic.cn/sz_mmbiz_png/76wHjqI4MlSF2jEicib1dZgZKydhBKJ43IKhrZFkZic39ueIicJ6iax44Jhr1YUtwf5n0qvCiaEicNqL8zCicTXricTnRpovpH7Um51PEZpdHYRrLNe8/640?from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=2) 07 \-MaxKing.cc- ## 这份模板怎么用？我的建议很简单，按四步走就够了。 **先自己填一版。** 不用追求完美，先把页面目标、核心用户、主要模块写出来。 **再让 AI 检查。** 你可以直接问它：这份 UI Spec 是否缺模块、缺状态、缺响应式规则？有没有不适合工程落地的地方？ **然后用 UI Spec 生成视觉图。** 这时候再让 AI 做视觉探索，而不是让它凭空决定页面结构。 **收口时用 UI Spec + 参考图生成代码。** 代码阶段不要只追求还原图片，而是要根据 UI Spec 做组件化实现。我的经验是，只要前面这一步写清楚，后面 AI 生成页面会稳定很多。它不一定一次完美，但至少不会完全跑偏。页面需求 → UI Spec → 视觉参考 → 代码实现 → 截图修正 ![图片](https://mmbiz.qpic.cn/mmbiz_png/76wHjqI4MlTmYzqso2xz6qBJIHibSP0Suy42aXYDBVQnL49JyR242Z2EQ5ibpJOZTwthuibZaaXt5TEzmOEjEZ1tibhviaqST5hBOPT65ibK8OVQ4/640?wx_fmt=png&from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=3) 下一步可以这样做如果你也在做 AI UI，先把这份 UI Spec 模板收藏起来，下一次别再从一张图直接开工。如果你已经踩过“图很好看，页面一落地就跑偏”的坑，欢迎把你最卡的一步留在评论区，说清楚问题就够了。需要继续看后续拆解的话，可以转给正在做页面的同事，下一篇会接着讲怎么把结构写对。完整 UI Spec 模板我已经放进资料包了，关注公众号，回复 `UI工作流` 领取。 \- END - 关于 MaxKing宝藏我是 MaxKing，全栈开发者、量化交易实践者，也是 AI 重度用户。这里分享的不是遥远概念，而是我在真实使用、搭建和踩坑后留下的判断。如果这篇文章对你有启发，欢迎 **点赞、在看、转发** ，也欢迎加我好友交流 AI 工具和自动化实践。 MaxKing宝藏邀请你前往腾讯公益一起捐关爱女童守护花开 AI UI 落地工作流实战：从漂亮图到可上线页面 · 目录继续滑动看下一个 MaxKing宝藏向上滑动看下一个拖拽到此处完成下载图片将完成下载 AIX智能下载器保存

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Sat, 02 May 2026 14:44:50 GMT

Jiahang Lin ¹ ^∗‡, Shichun Liu ¹ ^∗‡, Chengjun Pan ² ^∗‡, Lizhi Lin ³, Shihan Dou ¹, Xuanjing Huang ¹, Hang Yan ³, Zhenhua Han ³ ^†, Tao Gui ¹ ^† ¹ Fudan University ² Peking University ³ Shanghai Qiji Zhifeng Co., Ltd ###### Abstract Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: ❶ *component observability* gives every editable harness component a file-level representation so the action space is explicit and revertible; ❷ *experience observability* distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and ❸ *decision observability* pairs every edit with a self-declared prediction, later verified against the next round’s task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at $12\%$ fewer tokens than the seed, and on Terminal-Bench 2 it yields $+5.1$ to $+10.1$ pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving alongside their base models. ^†![Refer to caption](https://arxiv.org/html/2604.25850v3/x1.png) Figure 1: AHE evolves a bash-only seed past every human-designed and self-evolving baseline on Terminal-Bench 2. All three role agents share one base model, isolating the gain to harness edits rather than analyzer or editor capability. ## 1 Introduction Coding agents are increasingly deployed on long-horizon software-engineering tasks, with measurable progress on issue resolution over real-world code repositories [^14] [^46] [^7] and multi-step terminal workflows [^21]. In practice, such progress relies not only on the underlying language model, but equally on the surrounding engineering components: the system prompt that shapes work style, the tools that expose the file system and shell, and the middleware that controls context, execution, and recovery. This collection of model-external, editable components is collectively referred to as the agent’s *harness* [^30] [^18] [^42] [^45] [^33] [^31]. Harness design materially shifts task completion on long-horizon coding benchmarks, even with the base model held fixed [^40] [^42], making harness engineering a first-class lever for improving coding agents. Moreover, the optimal harness is model-specific: a harness tuned for one base model often underperforms on another and must be re-adapted as the base model changes. In current practice, this adaptation is performed manually—developers inspect trajectories, identify recurring failure patterns, and hand-craft edits across prompts, tools, middleware, and skills. Yet as base models advance rapidly [^39] [^38] [^44] [^6] [^36] [^35], this manual loop struggles to keep pace, creating a widening gap between model capability and the harness needed to realize it [^33]. An intuitive direction is to automate this loop with an evolution agent that optimizes harness components based on experience [^1] [^49] [^4]. However, few existing approaches jointly evolve the full set of editable components [^16]; most focus on a single component, typically the prompt [^32] [^50] [^20], skills [^19] [^43], or an in-context playbook [^49]. Jointly evolving multiple components end-to-end faces two structural obstacles: long, unstructured trajectories yield little actionable signal, and tightly coupled harness frameworks make edits beyond the prompt error-prone. This leaves the central question of agent-driven harness evolution open: How can an evolution agent jointly and stably evolve all editable components of a coding agent’s harness? Our central insight is that this question is bottlenecked by *observability*, not by agent capability: once the evolution agent receives structured context over a clear action space, it can reliably converge on better harness designs [^34] [^53]. We implement this in Agentic Harness Engineering (AHE, Figure 2), a closed loop driven by three observability pillars: ❶ *component observability* via a decoupled harness that exposes seven editable component types as files, so each failure pattern maps cleanly to a single component class; ❷ *experience observability* via a layered, drill-down evidence corpus distilled from millions of raw trajectory tokens, so the evolver consumes structured root causes rather than raw logs; and ❸ *decision observability* via a change manifest that pairs every edit with a self-declared prediction, later verified against the next round’s task-level outcomes, so each edit becomes a falsifiable contract and ineffective ones are reverted at file granularity. We empirically validate AHE on Terminal-Bench 2 [^21]: ten iterations lift pass@1 from 69.7% to 77.0%, surpassing the human-designed Codex CLI [^25] and the self-evolving baselines ACE [^49] and TF-GRPO [^4]. Without further evolution, the frozen harness transfers to SWE-bench-verified [^14], and across three alternate base-model families it yields consistent pass@1 gains of $+5.1$ to $+10.1$ pp, with the largest on bases further from saturation, suggesting that AHE encodes coordination patterns that less-saturated models lean on more heavily. A component ablation pinpoints where this gain lives: tools, middleware, and long-term memory each carry the improvement on their own, while the system prompt alone regresses, indicating that factual harness structure transfers across tasks and models whereas prose-level strategy does not. This paper makes three contributions: - We formulate *agent-driven harness evolution* for coding agents and propose AHE, which identifies *observability across components, trajectories, and decisions* as the design pivot and turns every harness edit into a falsifiable, file-level contract through three observability pillars: a decoupled component substrate, a layered trajectory-distillation pipeline, and a change manifest whose self-declared predictions are verified by next-round task deltas. - We empirically show that AHE lifts pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpasses human-designed and automated baselines, and produces a frozen harness that transfers across benchmarks and base-model families. - Our analysis reveals two limits of agent-driven evolution: harness components interact non-additively, so stacking effective edits caps the aggregate gain; and the loop’s self-attribution is reliable for fixes but blind to regressions, pinpointing regression foresight as the clearest direction for future self-evolution loops. ## 2 Related Work ### 2.1 Harness Engineering and Evaluation for Coding Agents Harness engineering refers to the practice of designing the system surrounding the model, including its tools, interfaces, memory, execution constraints, and feedback loops, which together shape what an agent can do on long-horizon tasks [^30] [^18] [^40] [^3] [^33] [^31]. Concretely, the harness mediates how the model perceives and acts on its environment: it exposes the action and observation interfaces over which tool-augmented reasoning unfolds [^3], custom agent-computer interfaces for repository navigation, file editing, and command execution [^45], as well as sandboxed execution and orchestration support that keep long-horizon runs reproducible [^42]. Verifying that such systems actually help has driven the parallel maturation of coding-agent evaluation along two axes: task horizon and environmental realism. Coverage extends from short-horizon function-level benchmarks focused on contamination and freshness control [^52] [^12], through repository-scale executable patch resolution [^14] [^46] [^7], to multi-hour, terminal-driven workflows that exercise long-horizon, realistic execution [^22] [^5] [^21]. A parallel infrastructure track packages executable runtimes and verifiers around these benchmarks [^28] [^13] [^47], whose attention to reproducible, traceable, and verifiable execution directly motivates the observation system AHE builds on. ### 2.2 Automated Optimization of LLM Agents Approaches to automated agent optimization differ in what evidence the optimizer observes and what it can edit. Some revise the agent’s own outputs through episodic critique and reflection [^20] [^32] [^9]. Others target prompts and instructions [^15]: structured playbooks [^49], semantic-advantage priors [^4], jointly optimized instruction-demonstration pipelines for multi-stage programs [^27], and reflective updates driven by Pareto-frontier traces [^1]. A separate line edits program structure itself, in the form of skill libraries [^41], scored program and agent archives evolved through mutation [^24] [^11], and graph-structured workflows searched or learned from rollouts [^48] [^51]. AHE tunes the full harness as a combinatorial whole rather than a single editable surface, so cross-component trade-offs become legible to the optimizer. It also keeps the human prior minimal, leaving methodology for the optimizer to discover from rollouts rather than fixing it by hand. We describe the substrate, trajectory analysis, and iteration that realize these choices in Section 3. ## 3 Method AHE turns harness optimization into a closed loop driven by another agent, with the base model held fixed and only the explicit harness edited. Our design principle is that every phase of this loop must be *observable*: AHE faithfully records the artifacts each phase produces (the harness components an iteration writes, the rollout trajectories it generates, the edit decisions it commits) and represents them in structured, layered forms that another agent can read and act on. Three observability layers implement this principle. Component observability (§3.1) is realized by a decoupled, file-level harness substrate that maps each failure pattern to a single component class. Experience observability (§3.2) is realized by a layered evidence corpus distilled from raw rollouts and indexed for drill-down access. Decision observability (§3.3) is realized by a change manifest that pairs every edit with a self-declared prediction the next round verifies. The three layers compose into the iteration of Algorithm 1, which runs unattended round after round. ### 3.1 NexAU: an editable, decoupled harness substrate ![Refer to caption](https://arxiv.org/html/2604.25850v3/x2.png) Figure 2: The AHE pipeline links three observable surfaces into one closed loop. Components, rollout experience, and edit decisions each surface as structured artifacts another agent reads, and every edit becomes a falsifiable prediction the next round verifies. We instantiate the harness $H$ on the NexAU framework [^23] [^37], which exposes seven orthogonal component types as explicit files at fixed mount points in a single workspace: system prompt, tool description, tool implementation, middleware, skill, sub-agent configuration, and long-term memory. The component types are loosely coupled, so adding a middleware does not require editing the system prompt, and adding a skill does not require touching any tool. This decoupling is what realizes component observability: each failure pattern maps to a single component class, giving the evolve agent a clean action space and localizing every pass-rate change to one file rather than scattering it across hundreds of lines of unstructured prompt prose. Each logical edit becomes one commit on the workspace’s git history, which yields file-level diffs and rollback granularity for free. Our seed harness $H_{0}$ is deliberately minimal: a single shell-execution tool, no middleware, no skills, no sub-agents. A seed already fitted to the target benchmark would contaminate every subsequent edit’s attribution, since we could not tell whether a gain came from the loop or from the seed. The minimal seed forces every component AHE adds to earn its place against measured rollouts. ### 3.2 Agent Debugger: layered trajectory evidence We generate $k$ traces for each task in a benchmark using a harness $H$, which may contain errors resulting from the deficiencies of the harness that can be acted on, but scattered across millions of tokens of raw messages. To extract insights from agent trajectories and enable experience observability, we apply Agent Debugger [^17] framework to use an agent to explore trajectories framed as a navigable, file-based environment where each trajectory message lives in its own file and is reached through generic shell and scripting tools. Traces with the same query are placed in one environment, and the debugger is required to analyze the root cause of the failure or the success pattern, which is stored in *per-task analysis* report for each task. The analysis also includes pass/fail status of the task to ground the Evolve Agent. Finally, a *benchmark-level overview* is aggregated from every report into a single document as an entry point for every iteration. In addition to these reports, we also provide *original* traces in case the agents need to verify the claims in the reports. The traces are provided both in raw form and lightly processed to remove unnecessary content. All of these content is provided as files allowing progressive disclosure [^29] which saves on tokens and enable better agent decisions. ### 3.3 Evolve Agent: evidence-driven, auditable edits The Evolve Agent closes the AHE loop. In each round it reads the layered evidence corpus produced by the Agent Debugger, decides which harness components to add, modify, or remove, applies those edits to the workspace, and records the reasoning behind every edit. Two constraints govern these edits, and together they realize decision observability: every edit becomes a falsifiable, file-level claim recorded in a versioned manifest, and the next round’s verdict either confirms or reverts it. The first constraint is controllability: the Evolve Agent writes only inside the harness workspace, while the runs directory, tracer, verifier, and LLM configuration are read-only, and the seed system prompt (Appendix B.1) is marked non-deletable. These restrictions block the shortcuts an unconstrained self-modifier would take, such as disabling the verifier, swapping the model, or raising the reasoning budget, and keep every recorded gain attributable to harness edits. The second constraint is that every change is evidence-driven and ships with a recorded prediction. Each edit attaches a manifest entry that names the failure evidence, the inferred root cause, the targeted fix, and a predicted impact comprising both expected fixes and at-risk regressions; this manifest is the loop’s evidence ledger (see Appendix B.2). In the next round, the loop intersects the predicted-fix and predicted-regression sets with the observed task-level deltas to produce a per-edit verdict. Each edit thereby becomes falsifiable by the next evaluation, which replaces rationale-driven self-justification with a measurable contract between rounds. Algorithm 1 AHE outer loop. seed harness $H_{0}$, base model $M$, benchmark $D$, rollouts per task $k$, max iterations $N$ $H_{\text{best}}\leftarrow H_{0}$ for $t=1$ to $N$ do $T_{t}\leftarrow\textsc{Rollout}(M,H_{t-1},D,k)$ $\triangleright$ phase 1: $k$ rollouts per task $\widetilde{T}_{t}\leftarrow\textsc{Clean}(T_{t})$ $\triangleright$ phase 2: drop base64, dedup tool output if $t\geq 2$ then $\triangleright$ phase 3: attribute prior manifest, then rollback $V_{t}\leftarrow\textsc{Attribute}(C_{t-1},T_{t-1},T_{t})$ $H_{t-1}\leftarrow\textsc{Rollback}(H_{t-1},V_{t})$ else $V_{t}\leftarrow\emptyset$ end if $R_{t}\leftarrow\textsc{AgentDebugger}(\widetilde{T}_{t})$ $\triangleright$ phase 4: layered distillation $(H_{t},C_{t})\leftarrow\textsc{Evolve}(H_{t-1},R_{t},V_{t})$ $\triangleright$ phase 5: workspace edits + new manifest $\textsc{Commit}(H_{t},C_{t},t)$ $\triangleright$ phase 6: tag iteration in git if $\textsc{Pass@1}(T_{t})>\textsc{Pass@1}(H_{\text{best}})$ then $H_{\text{best}}\leftarrow H_{t}$ end if end for return $H_{\text{best}}$ Algorithm 1 composes the three substrates into one iteration: rollout, clean, attribute the prior manifest and revert rejected edits, distill, edit, commit. We run $k\geq 2$ rollouts per task so each task carries a pass-rate signal, which stabilizes pass@1 and lets partial-pass tasks anchor comparative diagnosis. Attribution runs *before* distillation, so its verdict lands inside the evidence corpus and binds each prior manifest entry as a contract rather than a rationale. A one-shot explore agent (Appendix B.3) runs in parallel with iteration $1$ to seed a small number of reusable skills from the NexAU source and public coding-agent references. These skills receive no special protection: from iteration $2$ onward the Evolve Agent may keep, refine, or remove them based on observed rollouts. ## 4 Experiments We organize our empirical study around three questions: where AHE sits on the map of existing approaches to harness design, whether what it produces is portable beyond its optimization target, and what inside the loop drives the gain. ### 4.1 Setup ##### Evaluation. We drive evolution on the full 89 tasks of Terminal-Bench 2 [^21], split as 4 easy, 55 medium, and 30 hard, with per-task timeout extended to 1 hour. For cross-benchmark transfer we evaluate the AHE harness on SWE-bench-verified [^14], 500 tasks across seven repositories. We report two metrics per configuration: pass@1, the mean binary success rate over $k$ rollouts per task; and tokens/trial, the mean per-trial total of prompt plus completion tokens across all LLM calls, in thousands. Infrastructure-aborted or timed-out trials count as failures under pass@1 (matching the official terminal-bench leaderboard) and are excluded from token means to avoid truncated figures. Runtime infrastructure (framework, dispatcher, sandbox, tracer, and concurrency) is detailed in Appendix A. ##### Models. For both the evolution loop and the main experiment of §4.2, all three role agents (the Code Agent, the Agent Debugger, and the Evolve Agent) share one base model, GPT-5.4 [^26] at the high reasoning setting. For cross-model transfer (§4.3), we re-evaluate the Code Agent on five alternate bases: GPT-5.4 at medium and xhigh reasoning, qwen-3.6-plus [^38] [^44], gemini-3.1-flash-lite-preview [^8], and deepseek-v4-flash [^6]. ### 4.2 RQ1: Main Results Table 1: Pass@1 on Terminal-Bench 2 across 89 tasks, by official difficulty. NexAU ₀ is the shared seed; ACE, TF-GRPO, and AHE are three self-evolution loops layered on top of it. Bold marks the best per column; ties are all bold.

Human-designed harness
Method	All	Easy	Med.	Hard
	89	4	55	30
opencode	47.2%	75.0%	52.7%	33.3%
terminus-2	62.9%	75.0%	74.5%	40.0%
Codex	71.9%	75.0%	80.0%	56.7%
Self-evolved from NexAU ₀
NexAU ₀	69.7%	87.5%	78.2%	51.7%
ACE	68.9%	91.7%	78.2%	48.9%
TF-GRPO	72.3%	100.0%	79.4%	55.6%
AHE	77.0%	100.0%	88.2%	53.3%

We run a single AHE campaign of ten iterations from the bash-only NexAU ₀ seed (§3.1), with $k{=}2$ rollouts per task per iteration on Terminal-Bench 2, finishing in roughly 32 hours; the best resulting configuration is reported as AHE. The two self-evolve baselines ACE [^49] and TF-GRPO [^4] start from the same NexAU ₀ seed. ##### AHE outperforms both human-designed and self-evolve baselines. AHE outperforms every baseline on our panel: three human-designed harnesses, opencode [^2], terminus-2 [^10], and Codex-CLI [^25], and the two self-evolve baselines ACE and TF-GRPO. Figure 1 shows the gain accumulates across iterations, with continued evolution pushing pass@1 further above the NexAU ₀ seed. By difficulty, the only exception is the Hard tier, where AHE marginally trails Codex-CLI. We trace this gap to interference between AHE’s components on long-horizon tasks rather than to a missing capability: swapping AHE’s long-term memory alone into the NexAU ₀ seed, without the other AHE components, already surpasses Codex-CLI on Hard (§4.4.1). ##### Prompt-only self-evolution misses the components that carry AHE’s gain. The gaps to ACE and TF-GRPO trace to a layer mismatch. ACE distills natural-language playbooks the agent reads in-context, and TF-GRPO is a trajectory-feedback variant of GRPO that reinforces successful tool sequences; starting from the same NexAU ₀ seed as AHE, neither method opens the surrounding scaffolding to edits. AHE jointly evolves system prompt, tools, middleware, and long-term memory across iterations, and §4.4.1 quantifies which of these layers carries the improvement: swapping in AHE’s tools, middleware, or long-term memory alone yields $+3.3$, $+2.2$, and $+5.6$ pp, while the system prompt alone is $-2.3$ pp. The harness components ACE and TF-GRPO never edit are exactly where the gain lives. ### 4.3 RQ2: Transfer to Unseen Tasks and Base Models AHE’s harness is evolved on Terminal-Bench 2 with GPT-5.4 high. We probe whether it encodes general coding-agent experience or overfits to that target by re-using the workspace as-is, without further evolution, in two off-target settings: a different task surface (SWE-bench-verified) and four alternate base models. Table 2: Cross-benchmark transfer on SWE-bench-verified. ACE, TF-GRPO, and AHE share the NexAU ₀ seed and differ only in their self-evolution loop; all four columns run on GPT-5.4. AHE and the two self-evolve baselines are evolved on Terminal-Bench 2 and evaluated without in-domain re-evolution. Per-column bold marks the best; ties are all bold.

Repo	$N$	ACE	TF-GRPO	NexAU ₀	AHE	ACE	TF-GRPO	NexAU ₀	AHE
		Success rate $↑$				Tokens k $↓$
All	500	74.6%	74.2%	75.2%	75.6%	679	582	526	461
django	231	79.2%	78.8%	79.2%	81.0%	707	583	527	484
sympy	75	69.3%	68.0%	70.7%	70.7%	602	572	494	479
sphinx-doc	44	61.4%	65.9%	68.2%	70.5%	990	848	731	656
matplotlib	34	70.6%	70.6%	73.5%	73.5%	622	530	486	391
scikit-learn	32	93.8%	93.8%	93.8%	87.5%	451	378	307	257
pydata	22	77.3%	77.3%	77.3%	72.7%	563	516	386	338
astropy	22	59.1%	59.1%	54.5%	50.0%	546	470	667	277

## 黑暗工厂日记 #2：别再组建机器人团队了我像组建人类团队一样为我的AI流水线配备了角色：产品经理、工程师、质检员、评审员。结果它重现了官僚主义。解决办法是彻底抛弃人类类比。当我第一次尝试实施一个“软件黑暗工厂”——一个由智能体构建、测试和交付的无人值守流水线时，我犯了一个可预见的错误。我把自主性当作人员配置问题来处理。我创建了一个“产品经理智能体”、“工程师智能体”、“质检智能体”和一个“评审智能体”，然后围绕它们重建了人类团队的工作流程。它看起来有条不紊，但运作起来却像一个官僚机构：过多的交接、过多的等待、过多的时间在不同步骤之间转换上下文。 ## 瓶颈不在于代码生成我研究的那些以智能体优先模式实际交付产品的团队越多，就越清晰地看到：困难的部分不在于代码生成。而是辅助系统工程，即设计能让智能体工作可靠的环境。工程师的角色正在被重新定义。人类负责掌舵：确定优先级、明确验收标准、验证结果。智能体负责执行：生成代码、测试、文档、CI配置，甚至评审循环。代码库成为意图和约束的记录系统，其结构使得智能体能够找到重要信息，并忽略无关内容。 ## 我们仍在模仿自己这促使我有了第二个认识：我们仍处于自主系统的V1阶段，因为我们总是在以我们自己为类比来建模它们。我们问：“一个优秀的工程师会怎么做？”然后试图模拟那个人。但机器不需要会议或说服。它们需要明确的意图、快速的反馈和可强制执行的约束。在高吞吐量下，传统规范会颠倒过来。等待变成了浪费。纠错变得廉价。流程必须为迭代、并行和机械强制执行而构建：测试、自定义代码检查器、结构检查以及智能体无需繁琐步骤即可读取和执行的遥测数据。 ## 工厂是一个验证流水线我一直在研究的“软件工厂”方法毫不浪漫地指出：规范和场景驱动着非交互式智能体编写代码、运行辅助系统并自主收敛，无需人工审查。这听起来鲁莽，直到你看到其背后的基础设施。第三方服务的行为克隆让你能够运行大量场景、安全探索边缘情况，并在不触及实时系统的情况下验证行为。工厂不是一场对话。它是一个验证流水线，将场景转化为证据，将证据转化为发布。 ## 信任必须基于证据信任是真正的障碍。缺乏信任会阻碍生产部署，而解决方案是自动化推理。神经符号方法，结合了生成模型和符号方法，可以证明或反驳安全属性和正确性主张，而不是仅仅依赖于“它通过了我们的测试”。在黑暗工厂中，信任不能是一种感觉。它必须是证据：可检查的约束、可重现的行为、可诊断的故障。所以，别再试图让机器像人一样工作了。从第一性原理出发，围绕机器擅长之处重新设计。将规范、场景、辅助系统、遥测数据和自动化检查作为一流的生产资产。优化你的代码库，使其对智能体而言更易读。将架构和“品味”编码为不变式。通过持续清理来应对熵增，而不是依靠英雄式的周五加班。黑暗工厂不是一个团队。它是一个系统。 ## 延伸阅读 * OpenAI谈辅助系统工程——零人工编写代码交付真实产品 * StrongDM的软件工厂——“数字孪生宇宙”规模化验证方法 * Werner Vogels和Byron Cook谈自动化推理——为何信任必须来自证明，而非感觉

黑暗工厂日记 #1：组织理论、错误数学与自主软件工厂的架构

Tue, 21 Apr 2026 06:17:35 GMT

灯火通明

Tue, 21 Apr 2026 06:09:39 GMT

黑暗工厂日记 #6：为什么选择Pi作为编码代理

Tue, 21 Apr 2026 06:08:35 GMT

黑暗工厂日记 #4：管理单次代理运行中的上下文

Tue, 21 Apr 2026 06:07:57 GMT

黑暗工厂日记 #5：每个上下文窗口一个任务

Tue, 21 Apr 2026 06:07:19 GMT

黑暗工厂日记 #3：让工厂防错

Tue, 21 Apr 2026 06:06:56 GMT