📄 Viewing rendered HTML — download raw .md

code

Perplexity-html-to-md script

Q1: give me an eta for you and i together to write a…

give me an eta for you and i together to write a nodejs app that runs locally in pc, takes input of the html file from a perplexity thread conversation page and outputs a concise markdown document containing only conversation text, with each single question post and response post preceded with a # or ## (to facilitate a TOC). I can assist html/css analysis by giving you examples of post texts to locate in the html.

A1: You and I can get to a working first version in…

You and I can get to a working first version in roughly 30–60 minutes of focused time, assuming: you can provide one example HTML export of a thread, and we iterate once or twice on selectors and formatting.logrocket+1

Plan and rough timeline

  1. Agree on input/output format (5–10 min)

  2. Inspect HTML structure (10–15 min)

  3. Build basic Node.js parser (15–25 min)
    Using Node 18+ and a DOM parser like cheerio:logrocket+1

  4. Refine and edge cases (10–20 min)

What I need from you to start

Send, in the next message:

  1. Node.js version you’re using (e.g., node -v).

  2. A small, real snippet of the Perplexity thread HTML including:

Once I see that, I’ll draft the initial package.json and export-thread.js (or similar) file contents, and you can run it locally and report back what the output looks like.

Q2: sounds good.

sounds good. you tell me what download method/format you think would be more conducive for accurate post text location.

A2: For our use case,

For our use case, the best option is:

That keeps the original markup structure intact and avoids the browser rewriting parts of the DOM for local assets, which makes the selectors more stable and predictable for parsing with Node + Cheerio.mozilla+1

If your browser offers multiple save options, choose the one equivalent to “HTML only”, not “Web page, complete”, MHTML, or PDF. Those other formats either change the markup more or wrap it in extra layers that add noise for parsing.456bereastreet+1

Once you’ve saved one thread like that, you can either:

Q3: it would be great if we could visually in the…

it would be great if we could visually in the output dic distinguish between questions and answers, but i don’t think it would be correct to nest one type inside the other since both types are logically at ower level with conversations always alternating between the two types. i believe that in the input answers may contain various types of text formatting, tables, etc., but since questions are entered in html text areas they should be plain text. so maybe style questions as italic #s? it would be good if this formatting gets used also for TOC entries generated by Local Markdown Viewer, to easily distinguish Q from A entries. any other ideas how to distinguish Qs from As?

A3: A reliable way to distinguish Q vs A (and have…

A reliable way to distinguish Q vs A (and have that show in TOCs) is to encode the type directly in the heading text, not just via italics.

Here are a few workable patterns:

  1. Prefix labels in headings (most robust)

  2. Italics plus labels

  3. Symbol-based cue

  4. Level + label (no nesting)

Given what you want:

Both will show Q vs A clearly in Local Markdown Viewer’s TOC because the distinction is encoded in the literal heading text, not in structure alone.docusaurus+1

If you tell me:

I’ll bake that scheme directly into how the Node script generates the headings.

Q4: 4

#4

A4: We’ll go with option 4: different heading levels…

We’ll go with option 4: different heading levels plus Q/A labels, alternating but not nesting logically.

Concretely, for each turn pair we’ll generate:

So the pattern for turn n will be:

## Q<n>: <question text>

### A<n>:
<answer content...>

This gives you:

Next step: once you send a minimal HTML snippet for one question and its answer, I’ll draft the Node script that emits headings in exactly this format.

Q5: I have another feature wish (but not a hard…

I have another feature wish (but not a hard requirement, in case it’s not feasible). The threads that these HTML pages for have names/titles which don’t show anywhere on the visible HTML page, but are the identifiers used in Perplexity’s thread listings. I suspect that this name/label may occur in the HTML content somewhere even though it’s not displayed. If so, please try to extract this and use it as the top-level markdown heading (or a similar appropriate markdown directory for signifying page title). Node version 25.6.1. I don’t think we’ve mentioned it explicitly, but the TOC would be useful if the entry labels just reproduced the question and answer texts in their entirety. Please think up an algorithm to generate a useful short label for each question and answer, such as the first sentence or phrase from each question and answer. I am attaching the exported single-html-file. To help you identify the different elements in it: * The thread name/title is precisely “Rendering web PDF files in Brave” * The single question consists of 2 text paragraphs, one beginning with “Usually or always” and ending with “than render them.” and the other beginning with “If it would help” and ending with “let me know.” * The single response contains various begins with multiple display formats but begins with the text “This behavior is” and ends with the text “and browsers respect it.” (with a “brave” hyperlink label).

A5: It looks like the HTML file didn’t actually come…

It looks like the HTML file didn’t actually come through — I don’t see it attached in the conversation. Please try again, either by re-attaching it or pasting the raw HTML of the relevant sections (the <head>, one question block, and one answer block are enough to start).

In the meantime, here’s the design work I can do now:

Title Extraction Strategy

Based on how Perplexity threads work, the title almost certainly lives in one or more of these locations in the <head>, in priority order the script will try:​​

  1. <title> tag (most reliable)

  2. <meta property="og:title">

  3. <meta name="twitter:title">

  4. A <script type="application/ld+json"> block with a name or headline field

  5. A __NEXT_DATA__ JSON blob (common in Next.js apps like Perplexity)

If none are found, the output file gets # (Untitled Thread) and continues gracefully.

Label Generation Algorithm

For each heading, a makeLabel() function will:

  1. Normalize — collapse all whitespace, strip leading/trailing space.

  2. First full sentence — find the first ., ?, or ! followed by whitespace; use everything up to and including it, if within 100 chars.

  3. First phrase — if no sentence break, use up to the first ,, ;, or — break.

  4. Hard truncation — if still too long, cut at the last word boundary before 100 chars and append ….

Applied to what you described:

Turn Raw start Generated label
Q1 “Usually or always…” Usually or always (up to first sentence break)
A1 “This behavior is…” This behavior is (up to first sentence break)

So your TOC will look like:

## Q1: Usually or always…
### A1: This behavior is…

Once you re-send the HTML, I’ll lock in the exact CSS selectors and write the full script. If you can paste just these three pieces from the raw file it’s enough to start:

  1. The <head> block (for title extraction).

  2. The outermost container of one user question, with its inner text visible.

  3. The outermost container of the answer that follows it.

Q6: I forgot to attach the export.

I forgot to attach the export. Here it is.

A6: The diagnosis is clear.

The diagnosis is clear. Here’s the situation:

The Core Problem: Perplexity is a SPA

The saved file is only 9,109 bytes and contains <main id="root"></main> — a completely empty root div. Perplexity is a React single-page application, so the conversation content is rendered entirely by JavaScript at runtime. A plain “Save Page As HTML Only” captures the static shell before JavaScript runs — so none of your conversation text, thread title, or any content ends up in the file. The <title> and og:title both just say "Perplexity", not the thread name.​

The Fix: Use the SingleFile Extension

The correct capture tool for this is SingleFile (available for Chrome/Brave/Firefox). It specifically waits for the JavaScript to finish rendering, then serializes the live DOM — including all your conversation content — into one clean .html file. That’s the file format our Node script needs.

Install it from the Chrome Web Store in Brave:

  1. Search “SingleFile” in the Chrome Web Store and install it.

  2. Navigate to your Perplexity thread and wait for the full page to load.

  3. Click the SingleFile toolbar icon → it will save a complete, self-contained .html file.

  4. Feed that file to our script.

What This Changes for the Script

Once SingleFile captures the rendered DOM, the thread title will likely appear in a heading element (an <h1> or element with a data-* attribute) and the <title> tag will also be populated with the thread-specific name rather than just "Perplexity". We’ll verify that with the new file.

Give it a try with SingleFile and re-export the same thread — once you attach that version, we can immediately identify all the selectors and start writing the actual Node script.

Q7: perp-sf.html

perp_test-sf.html

A7: Excellent -

Excellent — the SingleFile export is perfect at 1.2MB. I have everything I need. Here’s what I found:​

Here is the complete, ready-to-run script:


package.json

{
  "name": "perp-thread-exporter",
  "version": "1.0.0",
  "type": "module",
  "dependencies": {
    "cheerio": "^1.0.0",
    "turndown": "^7.2.0",
    "turndown-plugin-gfm": "^1.0.2"
  }
}

convert.js

import { load } from 'cheerio';
import TurndownService from 'turndown';
import gfmPlugin from 'turndown-plugin-gfm';
import { readFileSync, writeFileSync } from 'fs';
import { resolve } from 'path';

const [,, inputPath, outputPath] = process.argv;

if (!inputPath) {
  console.error('Usage: node convert.js <input.html> [output.md]');
  process.exit(1);
}

const html = readFileSync(resolve(inputPath), 'utf-8');
const $ = load(html);

// ── Title ─────────────────────────────────────────────────────────────────────
const title = $('title').text().trim() || '(Untitled Thread)';

// ── Turndown (HTML → Markdown) ────────────────────────────────────────────────
const td = new TurndownService({ headingStyle: 'atx', codeBlockStyle: 'fenced' });
const { gfm } = gfmPlugin;
td.use(gfm); // enables Markdown tables

// Strip Perplexity inline citation spans and nbsp spacers
td.addRule('removeCitations', {
  filter: node =>
    node.nodeName === 'SPAN' && (
      node.hasAttribute?.('data-pplx-citation') ||
      (node.getAttribute?.('class') ?? '').includes('citation-nbsp')
    ),
  replacement: () => ''
});

// ── Label generator ───────────────────────────────────────────────────────────
function makeLabel(text, maxLen = 80) {
  const s = text.replace(/^#+\s*/, '').replace(/\s+/g, ' ').trim();
  const sentence = s.match(/^(.{1,80}?[.?!])(?:\s|$)/);
  if (sentence) return sentence[1];
  const phrase = s.match(/^(.{1,80}?[,;—])(?:\s|$)/);
  if (phrase) return phrase[1];
  if (s.length <= maxLen) return s;
  const cut = s.slice(0, maxLen);
  const space = cut.lastIndexOf(' ');
  return (space > 0 ? cut.slice(0, space) : cut) + '…';
}

// ── Extract questions ─────────────────────────────────────────────────────────
// h1[class*="group/query"] is Perplexity's user-question heading element.
// NOTE: verify this selector still matches follow-up questions in multi-turn exports.
const questions = [];
$('h1[class*="group/query"]').each((_, el) => {