←
LIB-14
Created: 2025-12-25
•
Updated: 2025-12-25
Relationships
Loading...
Attachments
Loading...
Comments (5)
Claude
·
2025-12-25
## Verification Complete
All acceptance criteria verified:
1. **ReaderLM-v2 running on GPU 4** - Container `readerlm-v2` up for 6+ hours
2. **OpenAI-compatible API** - Responding at `http://localhost:10010/v1`
3. **HTML-to-Markdown conversion tested**:
- Simple HTML: `<h1>Test</h1>` → `# Test` ✓
- Complex Notion HTML (900KB): Successfully chunked and converted to 21KB markdown ✓
4. **Documentation complete** - CLAUDE.md updated with usage instructions
Scripts verified:
- `scripts/start_readerlm.sh` - Starts vLLM container with health checks
- `scripts/stop_readerlm.sh` - Stops and removes container
- `scripts/html_to_markdown.py` - CLI with --inplace, --dry-run support
agent
·
2025-12-25
## Implementation Complete
### What was done
1. **Deployed ReaderLM-v2 via vLLM** on GPU 4 (RTX 3090) on chungus
- Model: jinaai/ReaderLM-v2 (1.54B params, bfloat16)
- Port: 10010 (OpenAI-compatible API)
- VRAM usage: ~22GB (with KV cache pre-allocation)
2. **Created start/stop scripts**
- `scripts/start_readerlm.sh`: Starts vLLM container with health checks
- `scripts/stop_readerlm.sh`: Stops and removes container
3. **Created conversion utility**
- `scripts/html_to_markdown.py`: CLI tool for batch processing
- Supports chunking for large files (tested on 900KB Notion HTML -> 20KB markdown)
- Detects HTML vs markdown to skip already-clean files
### Test Results
- Simple HTML: Works perfectly (headings, lists, links, images)
- Full Notion HTML pages: Successfully extracted content from 900KB HTML
- Weaviate MDX files: Converted JSX/HTML to markdown
### Performance
- Prompt throughput: ~1500 tokens/s
- Generation throughput: ~150 tokens/s
- Conversion time: ~30-60 seconds for large files with chunking
### Files Changed
- `scripts/start_readerlm.sh`: vLLM startup script
- `scripts/stop_readerlm.sh`: vLLM shutdown script
- `scripts/html_to_markdown.py`: CLI conversion utility
- `CLAUDE.md`: Updated with documentation
### How to Use
```bash
# Start the service
./scripts/start_readerlm.sh
# Convert files
python scripts/html_to_markdown.py input.html output.md
python scripts/html_to_markdown.py file.md --inplace
# Stop the service
./scripts/stop_readerlm.sh
```
### Notes
- Service is currently running on GPU 4 (port 10010)
- Uses `restart: unless-stopped` so it will survive reboots
- Ready for integration with d2m or batch processing scripts
agent
·
2025-12-25
## Reference: Jina AI docs available
Jina AI documentation is available in llm-code-docs for reference when implementing:
`~/github/llm-code-docs/docs/*/jina*/`
This includes ReaderLM usage examples, API documentation, and best practices.
agent
·
2025-12-25
## Priority Test Data: Pure HTML Files
**98 files are full HTML pages** incorrectly saved as .md - these are the priority conversion targets.
### Location
`~/github/llm-code-docs/docs/llms-txt/notion/` - 97 files
`~/github/llm-code-docs/docs/llms-txt/huggingface-hub/` - 1 file
### Example
`docs/llms-txt/notion/webhooks.md` starts with:
```html
<!DOCTYPE html><html lang="en" style="" data-color-mode="light"...
```
These are full scraped web pages with:
- Complete `<html>`, `<head>`, `<body>` structure
- CSS/JS includes (should be stripped)
- Actual content buried in the DOM
### Why these are ideal test cases
1. Real-world messy HTML (not clean examples)
2. Need to extract content from complex page structure
3. Must strip scripts, styles, navigation
4. Preserving code blocks and API documentation
This is exactly what ReaderLM-v2 was designed for.
agent
·
2025-12-25
## Test Data
**910 markdown files in llm-code-docs contain raw HTML** that should be converted to clean markdown.
### Location
`~/github/llm-code-docs/docs/github-scraped/` - especially weaviate docs (MDX with JSX/HTML)
### Example files to test
- `docs/github-scraped/weaviate/docs-cloud-manage-clusters-status.md`
- `docs/github-scraped/weaviate/docs-cloud-platform-billing.md`
- `docs/github-scraped/weaviate/docs-agents-personalization-tutorial-recipe-recommender.md`
### What needs converting
These files contain:
- `<div style={{...}}>` JSX blocks
- `<iframe>` embeds
- `<details>`/`<summary>` HTML
- `<script>` tags (should be removed)
- `<br />` tags
### Success criteria
Convert HTML elements to markdown equivalents or remove non-content elements (scripts, iframes, style blocks).