How I made "I watched it" a 90-second job

I have 41 YouTube videos bookmarked right now. I've watched maybe four of them all the way through.

This is not a discipline problem. It's a math problem. A good 45-minute interview costs 45 minutes (or 23 min even at 2X speed), and "I'll watch it later" is where good videos go to die. Meanwhile the actual signal in most of them is about six minutes long: two frameworks, three numbers, one slide worth screenshotting. The other 39 minutes are intro, sponsor read, and the host saying "that's so interesting" between the parts that matter.

So I built a small Claude Code skill that does the obvious thing. You paste a YouTube link. About 90 seconds later, there's a markdown note in my notes vault with the TL;DR, the 5 to 10 takeaways that actually earned a spot, each one deep-linked to the exact second it happens, screenshots of the slides the speaker pointed at, and the full transcript collapsed at the bottom for when I want to go deeper. It costs roughly four cents a video.

It's called yt-digest, and this week it's a public repo you can fork. Here's what it does, how it works, and the one design decision that took three tries to get right.


What this is not

There is a whole genre of AI content tooling right now built around one move: take a transcript, fan it out into 30 social posts. Reel scripts, carousels, LinkedIn posts, a newsletter, the works. One video in, a month of content out.

yt-digest does the opposite. It is not a repurposing factory and it does not produce a single thing you'd publish. It turns a video into knowledge you keep, not content you ship. The output goes into your second brain, gets indexed alongside everything else you've saved, and sits there searchable. Six months from now you grep "unlimited offer construct" and land on the exact 17-second clip that taught it to you, screenshot and all.

That distinction matters because the two jobs pull in opposite directions. Repurposing optimizes for volume and polish on the way out. A research digest optimizes for fidelity and recall on the way in. If you've ever watched a great talk, nodded along, and then completely failed to recall the one number that mattered when you needed it three weeks later, you already understand the problem this solves.


What it targets, and what you get back

The input is any YouTube URL. Long-form interview, conference talk, tutorial, a Shorts link, doesn't matter, as long as the video has captions (auto-generated is fine).

The output is two things written into your vault:

vault/research/youtube/<slug>.md          the digest
vault/attachments/youtube/<slug>/          frame-MM-SS.png screenshots

The digest has a fixed structure: frontmatter with the source, channel, length, and dates so it's filterable later; a TL;DR of 3 to 5 standalone bullets; a Key Takeaways section where each numbered point carries a timestamp deep-link back to the source video and, if the speaker was pointing at a slide or dashboard or chart, an embedded screenshot of that exact frame; an Open Questions section for the threads worth pulling; and the full transcript in a collapsed block at the bottom, every line timestamped so future search lands you at the right second.

The deep-links are the part that surprised me with how much I use them. A takeaway isn't just "the speaker said pricing should be outcome-based." It's that, plus a link that opens the video at 7:08 where they said it, plus the verbatim quote. When I'm skeptical of my own digest, one click and I'm watching the source. Trust, but verify, with no friction.

Here's a real one. I ran it on a 47:55 interview, Greg Isenberg talking to Nick from Orgo about the solo AI-agent business. Out came 8 TL;DR bullets, 10 numbered takeaways, and 6 screenshots of the slides where Nick laid out his tool stack and pricing tiers. Ninety seconds. Four cents. I read the whole thing in under three minutes and pulled two ideas straight into my own notes without ever opening the video. The video is still there if I want it. I just rarely need it now.


How it works

The whole thing is four files and two local tools. No SaaS, no API keys beyond the model you already run Claude Code on. yt-dlp pulls the metadata, the captions, and (when screenshots are on) a 720p copy of the video. ffmpeg grabs single frames at the timestamps that matter. A short Python script cleans up the transcript. Claude does the judgment part, reading the full transcript and deciding which 5 to 10 moments earn a bullet and which of those need a screenshot.

That last step is where the value is, and it's worth saying why a human-written selection brief beats "summarize this video." The skill tells the model exactly what counts as a key moment: concrete frameworks, specific numbers and benchmarks, counterintuitive claims with their reasoning, and any moment where the speaker references something on screen. It also tells it what to skip: intros, outros, sponsor reads, and filler restatements. A generic summary flattens everything to the same gray paste. A brief with opinions about what's worth keeping produces a digest that reads like notes a sharp colleague took for you.


The bug that took three tries

Here's the one part I'd flag if you fork this. YouTube auto-captions roll. Each caption cue doesn't replace the previous one, it extends it: "People are charging" becomes "People are charging $5,000 a month per" becomes "$5,000 a month per customer to build." If you naively dedupe by exact match, none of those lines are duplicates, so you keep all of them. My first run turned an 8,000-word video into a 25,000-word transcript with 2,283 cues, most of them the same sentence at six different stages of being typed out.

The fix is a rolling-merge pass. Walk the cues, and for each new one, check whether it's an extension of the line you're building (does it start with what you have, or does the tail of what you have overlap the head of the new one). If yes, splice them. If no, flush the line and start fresh. Then a second pass splits the merged blocks back into sentences and interpolates a timestamp for each one by word position, so the deep-links still point at the right second. It's about 40 lines of Python and it's the difference between a clean transcript and an unreadable one. The code is in parse_vtt.py if you want to lift it.


Running it from your phone

The skill takes all its input from command-line arguments, never stops to ask a question mid-run, and writes to a predictable path. Its last line of output is always DIGEST: <path>. That design is deliberate: it means anything that can run a shell command can drive it. A cron job, a CI step, or a chat bot.

The version I actually use is wired to a chat bot. I drop a YouTube link in a channel from my phone while I'm away from my desk, a headless Claude Code run picks it up, and 90 seconds later the digest is in my vault and the bot posts the TL;DR back into the thread. Watching a 45-minute video became a thing I do at a bus stop.

One warning if you build that part: in a headless run, a small or cheap model will sometimes describe what the skill would do instead of actually running the scripts. Zero tool calls, a few paragraphs of confident narration, nothing written to disk. The interactive run is fine on a cheap model. The unattended path needs a more capable one. I lost a couple of runs to this before I figured out it was a model-tier issue, not a skill bug.


How to tune it for your own research

Three knobs, all in SKILL.md:

First, where digests land. The defaults are vault/research/youtube/ and vault/attachments/youtube/, which assume Obsidian. Change them to wherever your notes live. The skill writes plain markdown, so it works in any editor.

Second, what counts as a key moment. The selection brief is the soul of the thing, and it's just a list you can edit. If you research pricing, tell it to capture every dollar figure and offer structure. If you research code, tell it to screenshot every terminal and editor pane. The skill is generic out of the box; it gets sharp when you point it at your domain.

Third, whether to grab screenshots at all. Digesting a talking-head podcast with nothing on screen? Pass --no-screenshots and it skips the video download entirely, which drops the cost to about a penny and the time to a few seconds.

How to run it

Install the prerequisites (brew install yt-dlp ffmpeg), drop the skill into .claude/skills/yt-digest/, then inside Claude Code:

/yt-digest <https://youtu.be/BI-MNjm1tTQ>

That's the whole interface. The repo has the full install steps, the four files, and the MIT license. Fork it, change the paths, point it at the videos clogging your own bookmarks.

The bookmarks were never the problem. The 45 minutes were. This is how I got them back.

This is the kind of small, self-contained system I build for fun and then end up using every day. If you want the rest of them as they ship, the library is below, and the newsletter is where they land first.

Until next week,
The GTM Architects

Keep Reading