code

give me an eta for you and i together to write a nodejs app that runs locally in pc, takes input of the html file from a perplexity thread conversation page and outputs a concise markdown document containing only conversation text, with each single question post and response post preceded with a # or ## (to facilitate a TOC). I can assist html/css analysis by giving you examples of post texts to locate in the html.

A1: You and I can get to a working first version in…

You and I can get to a working first version in roughly 30–60 minutes of focused time, assuming: you can provide one example HTML export of a thread, and we iterate once or twice on selectors and formatting.logrocket+1

Plan and rough timeline

What I need from you to start

Once I see that, I’ll draft the initial package.json and export-thread.js (or similar) file contents, and you can run it locally and report back what the output looks like.

Q2: sounds good.

sounds good. you tell me what download method/format you think would be more conducive for accurate post text location.

A2: For our use case,

That keeps the original markup structure intact and avoids the browser rewriting parts of the DOM for local assets, which makes the selectors more stable and predictable for parsing with Node + Cheerio.mozilla+1

If your browser offers multiple save options, choose the one equivalent to “HTML only”, not “Web page, complete”, MHTML, or PDF. Those other formats either change the markup more or wrap it in extra layers that add noise for parsing.456bereastreet+1

Q3: it would be great if we could visually in the…

it would be great if we could visually in the output dic distinguish between questions and answers, but i don’t think it would be correct to nest one type inside the other since both types are logically at ower level with conversations always alternating between the two types. i believe that in the input answers may contain various types of text formatting, tables, etc., but since questions are entered in html text areas they should be plain text. so maybe style questions as italic #s? it would be good if this formatting gets used also for TOC entries generated by Local Markdown Viewer, to easily distinguish Q from A entries. any other ideas how to distinguish Qs from As?

A3: A reliable way to distinguish Q vs A (and have…

A reliable way to distinguish Q vs A (and have that show in TOCs) is to encode the type directly in the heading text, not just via italics.

Both will show Q vs A clearly in Local Markdown Viewer’s TOC because the distinction is encoded in the literal heading text, not in structure alone.docusaurus+1

I’ll bake that scheme directly into how the Node script generates the headings.

Q4: 4

A4: We’ll go with option 4: different heading levels…

We’ll go with option 4: different heading levels plus Q/A labels, alternating but not nesting logically.

Next step: once you send a minimal HTML snippet for one question and its answer, I’ll draft the Node script that emits headings in exactly this format.

Q5: I have another feature wish (but not a hard…

I have another feature wish (but not a hard requirement, in case it’s not feasible). The threads that these HTML pages for have names/titles which don’t show anywhere on the visible HTML page, but are the identifiers used in Perplexity’s thread listings. I suspect that this name/label may occur in the HTML content somewhere even though it’s not displayed. If so, please try to extract this and use it as the top-level markdown heading (or a similar appropriate markdown directory for signifying page title). Node version 25.6.1. I don’t think we’ve mentioned it explicitly, but the TOC would be useful if the entry labels just reproduced the question and answer texts in their entirety. Please think up an algorithm to generate a useful short label for each question and answer, such as the first sentence or phrase from each question and answer. I am attaching the exported single-html-file. To help you identify the different elements in it: * The thread name/title is precisely “Rendering web PDF files in Brave” * The single question consists of 2 text paragraphs, one beginning with “Usually or always” and ending with “than render them.” and the other beginning with “If it would help” and ending with “let me know.” * The single response contains various begins with multiple display formats but begins with the text “This behavior is” and ends with the text “and browsers respect it.” (with a “brave” hyperlink label).

A5: It looks like the HTML file didn’t actually come…

It looks like the HTML file didn’t actually come through — I don’t see it attached in the conversation. Please try again, either by re-attaching it or pasting the raw HTML of the relevant sections (the <head>, one question block, and one answer block are enough to start).

Title Extraction Strategy

Based on how Perplexity threads work, the title almost certainly lives in one or more of these locations in the <head>, in priority order the script will try:

If none are found, the output file gets # (Untitled Thread) and continues gracefully.

Label Generation Algorithm

Once you re-send the HTML, I’ll lock in the exact CSS selectors and write the full script. If you can paste just these three pieces from the raw file it’s enough to start:

Turn	Raw start	Generated label
Q1	“Usually or always…”	`Usually or always` (up to first sentence break)
A1	“This behavior is…”	`This behavior is` (up to first sentence break)

Q6: I forgot to attach the export.

A6: The diagnosis is clear.

The Core Problem: Perplexity is a SPA

The saved file is only 9,109 bytes and contains <main id="root"></main> — a completely empty root div. Perplexity is a React single-page application, so the conversation content is rendered entirely by JavaScript at runtime. A plain “Save Page As HTML Only” captures the static shell before JavaScript runs — so none of your conversation text, thread title, or any content ends up in the file. The <title> and og:title both just say "Perplexity", not the thread name.

The Fix: Use the SingleFile Extension

The correct capture tool for this is SingleFile (available for Chrome/Brave/Firefox). It specifically waits for the JavaScript to finish rendering, then serializes the live DOM — including all your conversation content — into one clean .html file. That’s the file format our Node script needs.

What This Changes for the Script

Once SingleFile captures the rendered DOM, the thread title will likely appear in a heading element (an <h1> or element with a data-* attribute) and the <title> tag will also be populated with the thread-specific name rather than just "Perplexity". We’ll verify that with the new file.

Give it a try with SingleFile and re-export the same thread — once you attach that version, we can immediately identify all the selectors and start writing the actual Node script.

Q7: perp-sf.html

A7: Excellent -

Excellent — the SingleFile export is perfect at 1.2MB. I have everything I need. Here’s what I found:

{
  "name": "perp-thread-exporter",
  "version": "1.0.0",
  "type": "module",
  "dependencies": {
    "cheerio": "^1.0.0",
    "turndown": "^7.2.0",
    "turndown-plugin-gfm": "^1.0.2"
  }
}

import { load } from 'cheerio';
import TurndownService from 'turndown';
import gfmPlugin from 'turndown-plugin-gfm';
import { readFileSync, writeFileSync } from 'fs';
import { resolve } from 'path';

const [,, inputPath, outputPath] = process.argv;

if (!inputPath) {
  console.error('Usage: node convert.js <input.html> [output.md]');
  process.exit(1);
}

const html = readFileSync(resolve(inputPath), 'utf-8');
const $ = load(html);

// ── Title ─────────────────────────────────────────────────────────────────────
const title = $('title').text().trim() || '(Untitled Thread)';

// ── Turndown (HTML → Markdown) ────────────────────────────────────────────────
const td = new TurndownService({ headingStyle: 'atx', codeBlockStyle: 'fenced' });
const { gfm } = gfmPlugin;
td.use(gfm); // enables Markdown tables

// Strip Perplexity inline citation spans and nbsp spacers
td.addRule('removeCitations', {
  filter: node =>
    node.nodeName === 'SPAN' && (
      node.hasAttribute?.('data-pplx-citation') ||
      (node.getAttribute?.('class') ?? '').includes('citation-nbsp')
    ),
  replacement: () => ''
});

// ── Label generator ───────────────────────────────────────────────────────────
function makeLabel(text, maxLen = 80) {
  const s = text.replace(/^#+\s*/, '').replace(/\s+/g, ' ').trim();
  const sentence = s.match(/^(.{1,80}?[.?!])(?:\s|$)/);
  if (sentence) return sentence[1];
  const phrase = s.match(/^(.{1,80}?[,;—])(?:\s|$)/);
  if (phrase) return phrase[1];
  if (s.length <= maxLen) return s;
  const cut = s.slice(0, maxLen);
  const space = cut.lastIndexOf(' ');
  return (space > 0 ? cut.slice(0, space) : cut) + '…';
}

// ── Extract questions ─────────────────────────────────────────────────────────
// h1[class*="group/query"] is Perplexity's user-question heading element.
// NOTE: verify this selector still matches follow-up questions in multi-turn exports.
const questions = [];
$('h1[class*="group/query"]').each((_, el) => {
  const text = $(el).find('span[class*="select-text"]').text().trim();
  if (text) questions.push(text);
});

// ── Extract answers ───────────────────────────────────────────────────────────
// div[id^="markdown-content-"] holds each AI response, numbered 0, 1, 2, …
const answers = [];
$('[id^="markdown-content-"]').each((_, el) => {
  const md = td.turndown($(el).html() ?? '');
  answers.push(md);
});

// ── Assemble Markdown ─────────────────────────────────────────────────────────
const lines = [`# ${title}`, ''];

const count = Math.max(questions.length, answers.length);
for (let i = 0; i < count; i++) {
  const n = i + 1;

  if (questions[i] != null) {
    lines.push(`## Q${n}: ${makeLabel(questions[i])}`);
    lines.push('');
    lines.push(questions[i]);
    lines.push('');
  }

  if (answers[i] != null) {
    const firstTextLine = answers[i].split('\n').find(l => l.trim()) ?? '';
    lines.push(`### A${n}: ${makeLabel(firstTextLine)}`);</