# MVP Specification: Patent Ingestion and Structuring
## 🎯 Objective
Create a system that ingests user-curated folders of patent documents (markdown and PDF), validates structure, parses content, normalizes formatting, and produces a clean, structured JSON object per patent. These objects populate a database and enable searching, filtering, and LLM-based summaries.
—
## 🧭 User Flow
1. User opens the interface (desktop or web-based MVP). 2. Clicks an “Add Patent Folder” button — launches file/folder picker. 3. User selects a folder (named after a patent ID, e.g., `US_2017250687_A1/`) containing:
- `Claims.md`
- `Description.md`
- `Bibliographic_Data.md`
- `Legal_Events.md`
- `Patent_Family.md`
- PDF (named to match patent ID)
4. The system validates folder structure:
- Verifies expected files exist
- Reads files to check contents match type (e.g. `Claims.md` starts with a claim)
- Attempts soft corrections (e.g. removes headers like “CLAIMS”)
- Flags problems (e.g. wrong file type, duplicate or missing files)
5. Displays summary screen:
- Extracted metadata (title, author, filing date, etc.)
- Status of each file (✅ or ⚠️)
- Suggested corrections if applicable
6. On confirmation, generates a structured JSON representation. 7. JSON object is stored to local or cloud DB (e.g., SQLite for MVP). 8. (Optional) LLM generates a plain-language summary block. 9. System refreshes and displays the newly ingested patent in the list view.
—
## 📁 Expected Folder Format
Each patent is stored in its own folder: —
## 🧾 `patent.json` Schema (V1)
```json {
"id": "US_2017250687_A1", "friendly_name": "Interactive Touch Book", "family_name": "US_2017250687_A1", "authors": ["Dr. Kate Stone"], "assignee": "Novalia Ltd.", "status": "Granted", "jurisdiction": "US", "filing_date": "2017-02-28", "priority_date": "2014-10-10", "publication_date": "2017-09-07", "grant_date": "2021-06-15", "application_number": "15/445,678", "publication_number": "US2017250687A1", "patent_number": "US11000000B2",
"files": {
"claims_md": "Claims.md",
"description_md": "Description.md",
"biblio_md": "Bibliographic_Data.md",
"legal_events_md": "Legal_Events.md",
"family_md": "Patent_Family.md",
"pdf": "US_2017250687_A1.pdf"
},
"summary": {
"plain_language": "This patent describes an interactive book...",
"use_case": "Educational books, museum guides",
"novelty": "Capacitive touch through overlaid pages",
"component_breakdown": ["Touch sensors", "Pages", "Speakers", "MCU"],
"illustrative_metaphor": "Like a talking book that responds to touch",
"review_status": "pending"
},
"tags": [ "printed electronics", "interactive media", "education", "touch sensor" ]
}
🧠 Backend Processing
• Validation Module • Confirms folder and file structure • Soft-fixes filenames and markdown structure • Flags missing or malformed data • Extraction Module • Pulls metadata from Bibliographic_Data.md and others • Normalizes author/assignee names, dates, and IDs • Summarization Module (optional) • Uses LLM to generate plain-language patent summaries • Prompt includes instructions for clarity, tone, novelty, etc. • Database Interface • Stores validated JSON in searchable format (e.g. SQLite, TinyDB) • Links to physical files for browsing and future NLP ingestion
⸻
🔍 Enables
• Full local control of patent data • No reliance on external APIs • Flexible downstream filtering (e.g. granted only, by date, by inventor) • Visual interface for file correction and ingestion • Foundation for semantic and LLM-powered features
⸻
🔜 Future Enhancements
• API ingestion from Espacenet or other databases • PDF-to-markdown converter to extract claims/description directly • Timeline view for legal events and status • Patent family linking • Collaborative tagging and annotation
