give me an eta for you and i together to write a nodejs app that runs locally in pc, takes input of the html file from a perplexity thread conversation page and outputs a concise markdown document containing only conversation text, with each single question post and response post preceded with a # or ## (to facilitate a TOC). I can assist html/css analysis by giving you examples of post texts to locate in the html.
You and I can get to a working first version in roughly 30–60 minutes of focused time, assuming: you can provide one example HTML export of a thread, and we iterate once or twice on selectors and formatting.logrocket+1
Agree on input/output format (5–10 min)
You confirm how you’ll get the HTML (Save Page As, browser “View Source”, etc.).
We define the exact Markdown structure:
One file out, UTF‑8 text.
Each user question as # Question N: ... (or
similar).
Each assistant reply as ## Answer N directly after
its question.
No citations, sidebars, or UI text, just conversation turns.
Inspect HTML structure (10–15 min)
You paste short snippets of the raw HTML for one user message and one assistant message so we can find stable CSS classes/attributes around:
Thread container
Individual “message bubbles”
Who is speaker (you vs assistant)
From that, we define the selectors (e.g., “all
.message inside .thread”,
“role=user/assistant” attributes, etc.).twilio+1
Build basic Node.js parser (15–25 min)
Using Node 18+ and a DOM parser like
cheerio:logrocket+1
Read HTML file from disk (fs.readFile).
Load into cheerio and select all message nodes in order.
For each message:
Detect speaker (user vs assistant) via class/attribute/text.
Extract the visible text only (strip buttons, citations, links unless you want to keep link text).
Accumulate turns and format Markdown headings (# for
user, ## for assistant) into a string.
Write to a .md file next to the input.
Refine and edge cases (10–20 min)
Fix any noise (e.g., “Citations”, “Related”, “Sources”) by refining selectors or ignoring certain sub‑elements.
Handle multiline content and inline formatting (e.g., keep lists and code blocks by preserving their text content).
Optionally add:
CLI arguments
(node export.js input.html output.md)
Per‑thread title as a top‑level # Title.
Send, in the next message:
Node.js version you’re using (e.g.,
node -v).
A small, real snippet of the Perplexity thread HTML including:
One full user message block.
The assistant response right after it.
The minimal wrapping container around both.
Once I see that, I’ll draft the initial package.json and
export-thread.js (or similar) file contents, and you can
run it locally and report back what the output looks like.
sounds good. you tell me what download method/format you think would be more conducive for accurate post text location.
For our use case, the best option is:
.html file, no
assets).That keeps the original markup structure intact and avoids the browser rewriting parts of the DOM for local assets, which makes the selectors more stable and predictable for parsing with Node + Cheerio.mozilla+1
If your browser offers multiple save options, choose the one equivalent to “HTML only”, not “Web page, complete”, MHTML, or PDF. Those other formats either change the markup more or wrap it in extra layers that add noise for parsing.456bereastreet+1
Once you’ve saved one thread like that, you can either:
Paste a trimmed HTML snippet (one user message + one assistant message + their container), or
Tell me the browser you’re using and the exact save option name you chose.
it would be great if we could visually in the output dic distinguish between questions and answers, but i don’t think it would be correct to nest one type inside the other since both types are logically at ower level with conversations always alternating between the two types. i believe that in the input answers may contain various types of text formatting, tables, etc., but since questions are entered in html text areas they should be plain text. so maybe style questions as italic #s? it would be good if this formatting gets used also for TOC entries generated by Local Markdown Viewer, to easily distinguish Q from A entries. any other ideas how to distinguish Qs from As?
A reliable way to distinguish Q vs A (and have that show in TOCs) is to encode the type directly in the heading text, not just via italics.
Here are a few workable patterns:
Prefix labels in headings (most robust)
Questions: # Q1 · your question text
Answers:
## A1 · answer summary or same question text
TOC entries will show the leading Q1 / A1, so
you get clear visual grouping even in plain TOC lists.
Italics plus labels
Questions: # _Q1 · your question text_
Answers: ## A1 · answer summary
This keeps the “Q” headings visually lighter while still readable if a
TOC or viewer strips formatting.markdownguide+2
Symbol-based cue
Questions: # âť“ your question text
Answers: ## đź’ˇ answer summary
Many TOC generators preserve emoji and symbols, so you get easy scanning
even if italics/bold are stripped.stackoverflow+1
Level + label (no nesting)
Always alternate, but keep them at adjacent levels, e.g.:
Question: ## Q: your question text
Answer: ### A: immediately after
They’re not logically nested in the conversation sense, but
many TOC tools visually indent ### under ##,
giving an at-a-glance pair without actually merging types.
Given what you want:
I’d propose we default to:
## Q1 · your question text
### A1 · answer summary or same title
Or, if you strongly want same level:
## ❓ Q1 · your question text
## 💡 A1 · answer
Both will show Q vs A clearly in Local Markdown Viewer’s TOC because the distinction is encoded in the literal heading text, not in structure alone.docusaurus+1
If you tell me:
Which of these styles you’d like, and
Whether your viewer shows heading markup (e.g., italics) in the TOC or strips it,
I’ll bake that scheme directly into how the Node script generates the headings.
#4
We’ll go with option 4: different heading levels plus Q/A labels, alternating but not nesting logically.
Concretely, for each turn pair we’ll generate:
Question heading and body:
## Q1: <question text>
(followed by any extra question lines, if present)
Answer heading and body:
### A1:
full answer content (with whatever Markdown formatting we can preserve from the HTML)
So the pattern for turn n will be:
## Q<n>: <question text>
### A<n>:
<answer content...>
This gives you:
Clear visual distinction in the document (Q as higher-level, A as “under” it visually).
TOC entries where Q and A are visually grouped, but questions and answers remain separate heading types.docusaurus+1
Next step: once you send a minimal HTML snippet for one question and its answer, I’ll draft the Node script that emits headings in exactly this format.
I have another feature wish (but not a hard requirement, in case it’s not feasible). The threads that these HTML pages for have names/titles which don’t show anywhere on the visible HTML page, but are the identifiers used in Perplexity’s thread listings. I suspect that this name/label may occur in the HTML content somewhere even though it’s not displayed. If so, please try to extract this and use it as the top-level markdown heading (or a similar appropriate markdown directory for signifying page title). Node version 25.6.1. I don’t think we’ve mentioned it explicitly, but the TOC would be useful if the entry labels just reproduced the question and answer texts in their entirety. Please think up an algorithm to generate a useful short label for each question and answer, such as the first sentence or phrase from each question and answer. I am attaching the exported single-html-file. To help you identify the different elements in it: * The thread name/title is precisely “Rendering web PDF files in Brave” * The single question consists of 2 text paragraphs, one beginning with “Usually or always” and ending with “than render them.” and the other beginning with “If it would help” and ending with “let me know.” * The single response contains various begins with multiple display formats but begins with the text “This behavior is” and ends with the text “and browsers respect it.” (with a “brave” hyperlink label).
It looks like the HTML file didn’t actually come through — I don’t
see it attached in the conversation. Please try again, either by
re-attaching it or pasting the raw HTML of the relevant sections (the
<head>, one question block, and one answer block are
enough to start).
In the meantime, here’s the design work I can do now:
Based on how Perplexity threads work, the title almost certainly
lives in one or more of these locations in the
<head>, in priority order the script will try:​​
<title> tag (most reliable)
<meta property="og:title">
<meta name="twitter:title">
A <script type="application/ld+json"> block
with a name or headline field
A __NEXT_DATA__ JSON blob (common in Next.js apps
like Perplexity)
If none are found, the output file gets
# (Untitled Thread) and continues gracefully.
For each heading, a makeLabel() function will:
Normalize — collapse all whitespace, strip leading/trailing space.
First full sentence — find the first
., ?, or ! followed by
whitespace; use everything up to and including it, if within 100
chars.
First phrase — if no sentence break, use up to
the first ,, ;, or —
break.
Hard truncation — if still too long, cut at the
last word boundary before 100 chars and append ….
Applied to what you described:
| Turn | Raw start | Generated label |
|---|---|---|
| Q1 | “Usually or always…” | Usually or always (up to first sentence
break) |
| A1 | “This behavior is…” | This behavior is (up to first sentence
break) |
So your TOC will look like:
## Q1: Usually or always…
### A1: This behavior is…
Once you re-send the HTML, I’ll lock in the exact CSS selectors and write the full script. If you can paste just these three pieces from the raw file it’s enough to start:
The <head> block (for title
extraction).
The outermost container of one user question, with its inner text visible.
The outermost container of the answer that follows it.
I forgot to attach the export. Here it is.
The diagnosis is clear. Here’s the situation:
The saved file is only 9,109 bytes and contains
<main id="root"></main> — a completely empty
root div. Perplexity is a React single-page application, so the
conversation content is rendered entirely by JavaScript at
runtime. A plain “Save Page As HTML Only” captures the static
shell before JavaScript runs — so none of your conversation text, thread
title, or any content ends up in the file. The
<title> and og:title both just say
"Perplexity", not the thread name.​
The correct capture tool for this is SingleFile
(available for Chrome/Brave/Firefox). It specifically waits for the
JavaScript to finish rendering, then serializes the live
DOM — including all your conversation content — into one clean
.html file. That’s the file format our Node script
needs.
Install it from the Chrome Web Store in Brave:
Search “SingleFile” in the Chrome Web Store and install it.
Navigate to your Perplexity thread and wait for the full page to load.
Click the SingleFile toolbar icon → it will save a complete,
self-contained .html file.
Feed that file to our script.
Once SingleFile captures the rendered DOM, the thread title
will likely appear in a heading element (an
<h1> or element with a data-* attribute)
and the <title> tag will also be populated with the
thread-specific name rather than just "Perplexity". We’ll
verify that with the new file.
Give it a try with SingleFile and re-export the same thread — once you attach that version, we can immediately identify all the selectors and start writing the actual Node script.
perp_test-sf.html
Excellent — the SingleFile export is perfect at 1.2MB. I have everything I need. Here’s what I found:​
Title: In <title> tag →
"Rendering web PDF files in Brave" âś“
Questions:
h1[class*="group/query"] span[class*="select-text"] → plain
text, newlines preserved
Answers:
div[id^="markdown-content-"] (numbered
0, 1, 2...) → rich HTML content
Here is the complete, ready-to-run script:
package.json
{
"name": "perp-thread-exporter",
"version": "1.0.0",
"type": "module",
"dependencies": {
"cheerio": "^1.0.0",
"turndown": "^7.2.0",
"turndown-plugin-gfm": "^1.0.2"
}
}convert.js
import { load } from 'cheerio';
import TurndownService from 'turndown';
import gfmPlugin from 'turndown-plugin-gfm';
import { readFileSync, writeFileSync } from 'fs';
import { resolve } from 'path';
const [,, inputPath, outputPath] = process.argv;
if (!inputPath) {
console.error('Usage: node convert.js <input.html> [output.md]');
process.exit(1);
}
const html = readFileSync(resolve(inputPath), 'utf-8');
const $ = load(html);
// ── Title ─────────────────────────────────────────────────────────────────────
const title = $('title').text().trim() || '(Untitled Thread)';
// ── Turndown (HTML → Markdown) ────────────────────────────────────────────────
const td = new TurndownService({ headingStyle: 'atx', codeBlockStyle: 'fenced' });
const { gfm } = gfmPlugin;
td.use(gfm); // enables Markdown tables
// Strip Perplexity inline citation spans and nbsp spacers
td.addRule('removeCitations', {
filter: node =>
node.nodeName === 'SPAN' && (
node.hasAttribute?.('data-pplx-citation') ||
(node.getAttribute?.('class') ?? '').includes('citation-nbsp')
),
replacement: () => ''
});
// ── Label generator ───────────────────────────────────────────────────────────
function makeLabel(text, maxLen = 80) {
const s = text.replace(/^#+\s*/, '').replace(/\s+/g, ' ').trim();
const sentence = s.match(/^(.{1,80}?[.?!])(?:\s|$)/);
if (sentence) return sentence[1];
const phrase = s.match(/^(.{1,80}?[,;—])(?:\s|$)/);
if (phrase) return phrase[1];
if (s.length <= maxLen) return s;
const cut = s.slice(0, maxLen);
const space = cut.lastIndexOf(' ');
return (space > 0 ? cut.slice(0, space) : cut) + '…';
}
// ── Extract questions ─────────────────────────────────────────────────────────
// h1[class*="group/query"] is Perplexity's user-question heading element.
// NOTE: verify this selector still matches follow-up questions in multi-turn exports.
const questions = [];
$('h1[class*="group/query"]').each((_, el) => {