# MVP Specification: Patent Ingestion and Structuring

## 🎯 Objective

Create a system that ingests user-curated folders of patent documents (markdown and PDF), validates structure, parses content, normalizes formatting, and produces a clean, structured JSON object per patent. These objects populate a database and enable searching, filtering, and LLM-based summaries.

---

## 🧭 User Flow

1. **User opens the interface** (desktop or web-based MVP).
2. **Clicks an "Add Patent Folder" button** — launches file/folder picker.
3. **User selects a folder** (named after a patent ID, e.g., `US_2017250687_A1/`) containing:
   - `Claims.md`
   - `Description.md`
   - `Bibliographic_Data.md`
   - `Legal_Events.md`
   - `Patent_Family.md`
   - PDF (named to match patent ID)
4. The system **validates folder structure**:
   - Verifies expected files exist
   - Reads files to check contents match type (e.g. `Claims.md` starts with a claim)
   - Attempts soft corrections (e.g. removes headers like "CLAIMS")
   - Flags problems (e.g. wrong file type, duplicate or missing files)
5. **Displays summary screen**:
   - Extracted metadata (title, author, filing date, etc.)
   - Status of each file (✅ or ⚠️)
   - Suggested corrections if applicable
6. **On confirmation**, generates a structured JSON representation.
7. JSON object is stored to local or cloud DB (e.g., SQLite for MVP).
8. (Optional) LLM generates a plain-language summary block.
9. System refreshes and displays the newly ingested patent in the list view.

---

## 📁 Expected Folder Format

Each patent is stored in its own folder:
---

## 🧾 `patent.json` Schema (V1)

```json
{
  "id": "US_2017250687_A1",
  "friendly_name": "Interactive Touch Book",
  "family_name": "US_2017250687_A1",
  "authors": ["Dr. Kate Stone"],
  "assignee": "Novalia Ltd.",
  "status": "Granted",
  "jurisdiction": "US",
  "filing_date": "2017-02-28",
  "priority_date": "2014-10-10",
  "publication_date": "2017-09-07",
  "grant_date": "2021-06-15",
  "application_number": "15/445,678",
  "publication_number": "US2017250687A1",
  "patent_number": "US11000000B2",

  "files": {
    "claims_md": "Claims.md",
    "description_md": "Description.md",
    "biblio_md": "Bibliographic_Data.md",
    "legal_events_md": "Legal_Events.md",
    "family_md": "Patent_Family.md",
    "pdf": "US_2017250687_A1.pdf"
  },

  "summary": {
    "plain_language": "This patent describes an interactive book...",
    "use_case": "Educational books, museum guides",
    "novelty": "Capacitive touch through overlaid pages",
    "component_breakdown": ["Touch sensors", "Pages", "Speakers", "MCU"],
    "illustrative_metaphor": "Like a talking book that responds to touch",
    "review_status": "pending"
  },

  "tags": [
    "printed electronics",
    "interactive media",
    "education",
    "touch sensor"
  ]
}


🧠 Backend Processing
	•	Validation Module
	•	Confirms folder and file structure
	•	Soft-fixes filenames and markdown structure
	•	Flags missing or malformed data
	•	Extraction Module
	•	Pulls metadata from Bibliographic_Data.md and others
	•	Normalizes author/assignee names, dates, and IDs
	•	Summarization Module (optional)
	•	Uses LLM to generate plain-language patent summaries
	•	Prompt includes instructions for clarity, tone, novelty, etc.
	•	Database Interface
	•	Stores validated JSON in searchable format (e.g. SQLite, TinyDB)
	•	Links to physical files for browsing and future NLP ingestion

⸻

🔍 Enables
	•	Full local control of patent data
	•	No reliance on external APIs
	•	Flexible downstream filtering (e.g. granted only, by date, by inventor)
	•	Visual interface for file correction and ingestion
	•	Foundation for semantic and LLM-powered features

⸻

🔜 Future Enhancements
	•	API ingestion from Espacenet or other databases
	•	PDF-to-markdown converter to extract claims/description directly
	•	Timeline view for legal events and status
	•	Patent family linking
	•	Collaborative tagging and annotation

  * [[wiki:projects:patents:ingest|Patent Ingest]]
  * [[wili:projects:patents:mvp1|MVP 1]]