# MVP Specification: Patent Ingestion and Structuring ## 🎯 Objective Create a system that ingests user-curated folders of patent documents (markdown and PDF), validates structure, parses content, normalizes formatting, and produces a clean, structured JSON object per patent. These objects populate a database and enable searching, filtering, and LLM-based summaries. --- ## 🧭 User Flow 1. **User opens the interface** (desktop or web-based MVP). 2. **Clicks an "Add Patent Folder" button** β€” launches file/folder picker. 3. **User selects a folder** (named after a patent ID, e.g., `US_2017250687_A1/`) containing: - `Claims.md` - `Description.md` - `Bibliographic_Data.md` - `Legal_Events.md` - `Patent_Family.md` - PDF (named to match patent ID) 4. The system **validates folder structure**: - Verifies expected files exist - Reads files to check contents match type (e.g. `Claims.md` starts with a claim) - Attempts soft corrections (e.g. removes headers like "CLAIMS") - Flags problems (e.g. wrong file type, duplicate or missing files) 5. **Displays summary screen**: - Extracted metadata (title, author, filing date, etc.) - Status of each file (βœ… or ⚠️) - Suggested corrections if applicable 6. **On confirmation**, generates a structured JSON representation. 7. JSON object is stored to local or cloud DB (e.g., SQLite for MVP). 8. (Optional) LLM generates a plain-language summary block. 9. System refreshes and displays the newly ingested patent in the list view. --- ## πŸ“ Expected Folder Format Each patent is stored in its own folder: --- ## 🧾 `patent.json` Schema (V1) ```json { "id": "US_2017250687_A1", "friendly_name": "Interactive Touch Book", "family_name": "US_2017250687_A1", "authors": ["Dr. Kate Stone"], "assignee": "Novalia Ltd.", "status": "Granted", "jurisdiction": "US", "filing_date": "2017-02-28", "priority_date": "2014-10-10", "publication_date": "2017-09-07", "grant_date": "2021-06-15", "application_number": "15/445,678", "publication_number": "US2017250687A1", "patent_number": "US11000000B2", "files": { "claims_md": "Claims.md", "description_md": "Description.md", "biblio_md": "Bibliographic_Data.md", "legal_events_md": "Legal_Events.md", "family_md": "Patent_Family.md", "pdf": "US_2017250687_A1.pdf" }, "summary": { "plain_language": "This patent describes an interactive book...", "use_case": "Educational books, museum guides", "novelty": "Capacitive touch through overlaid pages", "component_breakdown": ["Touch sensors", "Pages", "Speakers", "MCU"], "illustrative_metaphor": "Like a talking book that responds to touch", "review_status": "pending" }, "tags": [ "printed electronics", "interactive media", "education", "touch sensor" ] } 🧠 Backend Processing β€’ Validation Module β€’ Confirms folder and file structure β€’ Soft-fixes filenames and markdown structure β€’ Flags missing or malformed data β€’ Extraction Module β€’ Pulls metadata from Bibliographic_Data.md and others β€’ Normalizes author/assignee names, dates, and IDs β€’ Summarization Module (optional) β€’ Uses LLM to generate plain-language patent summaries β€’ Prompt includes instructions for clarity, tone, novelty, etc. β€’ Database Interface β€’ Stores validated JSON in searchable format (e.g. SQLite, TinyDB) β€’ Links to physical files for browsing and future NLP ingestion βΈ» πŸ” Enables β€’ Full local control of patent data β€’ No reliance on external APIs β€’ Flexible downstream filtering (e.g. granted only, by date, by inventor) β€’ Visual interface for file correction and ingestion β€’ Foundation for semantic and LLM-powered features βΈ» πŸ”œ Future Enhancements β€’ API ingestion from Espacenet or other databases β€’ PDF-to-markdown converter to extract claims/description directly β€’ Timeline view for legal events and status β€’ Patent family linking β€’ Collaborative tagging and annotation * [[wiki:projects:patents:ingest|Patent Ingest]] * [[wili:projects:patents:mvp1|MVP 1]]