📄 Viewing rendered HTML — download raw .md

pmpc

playwright-multi-page-crawl

Session ID: 6bd840fd-9e4b-4f8d-b02f-3e3dc62d0e67
Date: 2026-03-28 15:12:14


🤖 Claude

Here are the key findings:

File Format by Doc Type

Doc Type Best Format Why
API reference (method signatures, params, returns) Markdown Headers act as retrieval index points; code blocks render well
Task/how-to (admin procedures) Markdown Step lists, prerequisites, navigation paths are narrative
Concept/overview Markdown Descriptive prose with section headings
Structured catalogs (field lists, property tables) JSON More precise for structured queries, consistent schema

Markdown is the winner for almost everything we’re crawling. JSON only wins for highly tabular/catalog data we don’t yet have.

ChatGPT Project Source Limits

Constraint Value
Max files per project 20
Max tokens per file 2M (but practical: 50k-150k for good retrieval)
Practical safe file size 1-2 MB text
Recommended chunk size 5-20 KB per topic

The Critical Problem

The thread’s own advice says “one method per file, 5-20 KB each” for optimal retrieval — that’s potentially hundreds of files. But ChatGPT Projects caps at 20 files. So we must consolidate: group related content into ~10 files per role (admin, dev), each file being a topical bundle with clear markdown headings as internal index points.

For example, a developer project might have files like:
- server-api-GlideRecord.md (all GlideRecord methods in one file)
- server-api-GlideSystem.md
- client-api-reference.md
- rest-api-reference.md

What I Recommend We Do Next

  1. Don’t design the file-splitting logic yet — we don’t know how ChatGPT actually performs with different file sizes in your specific use case.

  2. I’ll write you a small test-file generator that produces markdown files at specified sizes (e.g., 50KB, 100KB, 500KB, 1MB, 2MB) with realistic ServiceNow-style API doc content. You upload these to a ChatGPT project and test:

  3. Once you have those numbers, we design the consolidation strategy and output format.

Want me to write that test-file generator?


👤 User

Yes. And advise me how I can judge “retrieval quality”.


🤖 Claude

Good question. Here’s what I’d recommend for the test:

What “retrieval quality” means practically: You ask the ChatGPT project a question whose answer is in the uploaded source file, and check whether it (a) finds the right content, (b) answers accurately, and (c) doesn’t hallucinate details that aren’t in the source.

Test protocol:

  1. Upload a test file to a ChatGPT project
  2. Ask 3-5 questions whose answers are only in that file (not in the model’s training data) — this is critical, otherwise you can’t tell if it’s retrieving vs. remembering
  3. Grade each answer: Did it find the content? Was the answer correct? Did it cite the right section?

How to ensure answers must come from retrieval, not training data:

The test files should contain fictional ServiceNow API content — invented class names, made-up method signatures, fake parameter names. Something like:

## GlideVortex.spinUp()

Initiates a vortex computation cycle on the specified table.

Parameters:
- tableName (String): Target table sys_name
- torqueLevel (Number): Spin intensity, 1-10

Returns: GlideVortexResult

If ChatGPT correctly answers “What does GlideVortex.spinUp() return?” with “GlideVortexResult” — that’s successful retrieval. If it says “I don’t see that in my sources” — retrieval failed. If it invents an answer — hallucination.

The test matrix I’ll generate files for:

File Size Content density
1 50 KB ~20 fake API methods, heavy headings
2 200 KB ~80 methods
3 500 KB ~200 methods
4 1 MB ~400 methods
5 2 MB ~800 methods

For each file, I’ll embed 5 “needle” entries at known positions (start, 25%, middle, 75%, end) with distinctive names you can query for. You test whether ChatGPT finds needles at all positions across all file sizes. When retrieval starts missing needles, you’ve found the practical ceiling.

Want me to write the generator now?


👤 User

yes


🤖 Claude

Let me verify the needles are actually in the files and properly distributed.


🤖 Claude

All 5 needles properly distributed across all 5 files. Files are at data/test-docs/.

Here’s your testing procedure:

For each file (start with 50k.md, work up):

  1. Create a ChatGPT project (or use an existing one), upload the file as a source
  2. Ask these 5 questions — each targets a needle at a different position in the file:
# Question Expected answer
1 What does GlideZypherAlpha.quantumLeapAlpha() return? ZypherResult (with leapDistance, resonanceFreq, stabilized)
2 What is the secret test code for GlideZypherBravo? NEEDLE_Bravo_25PCT
3 What error code does GlideZypherCharlie throw? ZF-CharlieCharlieCharlie
4 What parameters does quantumLeapDelta take? warpFactor (Number) and crystalKey (String)
5 What range of warpFactor requires admin role in GlideZypherEcho? Values above 50

What to record for each file size:
- How many of the 5 questions got correct answers
- Which needle positions failed (start/25%/50%/75%/end)
- Whether ChatGPT cited the source file or appeared to guess

When retrieval starts failing or getting unreliable, that’s your practical ceiling.