DokuWiki

# MVP Specification: Patent Ingestion and Structuring

## 🎯 Objective

Create a system that ingests user-curated folders of patent documents (markdown and PDF), validates structure, parses content, normalizes formatting, and produces a clean, structured JSON object per patent. These objects populate a database and enable searching, filtering, and LLM-based summaries.

—

## 🧭 User Flow

1. User opens the interface (desktop or web-based MVP). 2. Clicks an “Add Patent Folder” button — launches file/folder picker. 3. User selects a folder (named after a patent ID, e.g., `US_2017250687_A1/`) containing:

`Claims.md`
`Description.md`
`Bibliographic_Data.md`
`Legal_Events.md`
`Patent_Family.md`
PDF (named to match patent ID)

4. The system validates folder structure:

Verifies expected files exist
Reads files to check contents match type (e.g. `Claims.md` starts with a claim)
Attempts soft corrections (e.g. removes headers like “CLAIMS”)
Flags problems (e.g. wrong file type, duplicate or missing files)

5. Displays summary screen:

Extracted metadata (title, author, filing date, etc.)
Status of each file (✅ or ⚠️)
Suggested corrections if applicable

6. On confirmation, generates a structured JSON representation. 7. JSON object is stored to local or cloud DB (e.g., SQLite for MVP). 8. (Optional) LLM generates a plain-language summary block. 9. System refreshes and displays the newly ingested patent in the list view.

—

## 📁 Expected Folder Format

Each patent is stored in its own folder: —

## 🧾 `patent.json` Schema (V1)

```json {

"id": "US_2017250687_A1",
"friendly_name": "Interactive Touch Book",
"family_name": "US_2017250687_A1",
"authors": ["Dr. Kate Stone"],
"assignee": "Novalia Ltd.",
"status": "Granted",
"jurisdiction": "US",
"filing_date": "2017-02-28",
"priority_date": "2014-10-10",
"publication_date": "2017-09-07",
"grant_date": "2021-06-15",
"application_number": "15/445,678",
"publication_number": "US2017250687A1",
"patent_number": "US11000000B2",

"files": {
  "claims_md": "Claims.md",
  "description_md": "Description.md",
  "biblio_md": "Bibliographic_Data.md",
  "legal_events_md": "Legal_Events.md",
  "family_md": "Patent_Family.md",
  "pdf": "US_2017250687_A1.pdf"
},

"summary": {
  "plain_language": "This patent describes an interactive book...",
  "use_case": "Educational books, museum guides",
  "novelty": "Capacitive touch through overlaid pages",
  "component_breakdown": ["Touch sensors", "Pages", "Speakers", "MCU"],
  "illustrative_metaphor": "Like a talking book that responds to touch",
  "review_status": "pending"
},

"tags": [
  "printed electronics",
  "interactive media",
  "education",
  "touch sensor"
]

}

🧠 Backend Processing

•	Validation Module
•	Confirms folder and file structure
•	Soft-fixes filenames and markdown structure
•	Flags missing or malformed data
•	Extraction Module
•	Pulls metadata from Bibliographic_Data.md and others
•	Normalizes author/assignee names, dates, and IDs
•	Summarization Module (optional)
•	Uses LLM to generate plain-language patent summaries
•	Prompt includes instructions for clarity, tone, novelty, etc.
•	Database Interface
•	Stores validated JSON in searchable format (e.g. SQLite, TinyDB)
•	Links to physical files for browsing and future NLP ingestion

⸻

🔍 Enables

•	Full local control of patent data
•	No reliance on external APIs
•	Flexible downstream filtering (e.g. granted only, by date, by inventor)
•	Visual interface for file correction and ingestion
•	Foundation for semantic and LLM-powered features

⸻

🔜 Future Enhancements

•	API ingestion from Espacenet or other databases
•	PDF-to-markdown converter to extract claims/description directly
•	Timeline view for legal events and status
•	Patent family linking
•	Collaborative tagging and annotation