wiki:projects:patents:ingest

# 📁 Patent Folder Ingest & Validation Module (MVP Part 1)

## 🎯 Purpose

To scan a local directory of patent folders, validate the file structure and content of each, and generate a structured JSON representation of each patent to be used for search, tagging, and future analysis.

—

## 🧩 What This Module Does

### 1. Folder Scanning

- Recursively scans a selected root directory. - Identifies each patent folder by name format:

`[patent number] [patent name]`  
e.g. `US_10101845_B2 Touch Activated Wireless Poster`

—

### 2. Expected Files Per Folder

Each folder may contain up to six files:

File Name	Required?	Description
————————	———–	————-
`Bibliographic_Data.md`	✅	Basic patent metadata
`Claims.md`	✅	List of patent claims
`Description.md`	✅	Technical description
`Legal_Events.md`	⚠️	Grant, expiry, and renewal data
`Patent_Family.md`	⚠️	List of related patents
`*.pdf`	✅	Original scanned patent (any `.pdf` in folder)

- Missing files are logged. - Optional files improve downstream functionality but aren't required for basic ingestion.

—

### 3. File Content Verification

Without using LLMs (yet), we validate content using heuristics:

- `Bibliographic_Data.md`: Look for fields like Title, Inventor(s), Applicant, Filing Date, etc. - `Claims.md`: Should contain numbered claims (e.g., `1.` or `Claim 1.`). If missing or poorly formatted, flag for review. - `Description.md`: Should contain common patent section headers (e.g., `Background`, `Summary`, `Embodiments`). - `Legal_Events.md`: Check for date+event pairings or relevant keywords (e.g., `Granted`, `Expired`, `Lapsed`, `Renewed`). - `Patent_Family.md`: Should contain one or more patent numbers.

—

### 4. Canonical Output: JSON per Patent

Each valid folder is converted into a JSON object:

```json {

"patent_number": "US_10101845_B2",
"patent_name": "Touch Activated Wireless Poster",
"folder_path": "/patents/US_10101845_B2 Touch Activated Wireless Poster",
"files_present": [
  "Bibliographic_Data.md",
  "Claims.md",
  "Description.md",
  "Patent_Family.md",
  "US_10101845_B2.pdf"
],
"validation_flags": {
  "missing_files": ["Legal_Events.md"],
  "format_warnings": {
    "Claims.md": "Claims unnumbered or improperly formatted"
  }
},
"extracted_metadata": {
  "priority_date": "2014-06-27",
  "filing_date": "2015-06-25",
  "inventors": ["Kate Stone"],
  "assignee": "Novalia Ltd",
  "status": "Granted",
  "family_id": "GB201411234"
}

} ```

This JSON is stored either in memory, written to disk, or passed directly to a backend database or service layer.

—

### 5. Validation Report

A report is generated showing:

- Total patents scanned - Valid patents vs invalid - Missing or malformed files - Common error types (e.g., inconsistent claim formatting)

This is output as a `.json` or `.log` file and optionally shown in the UI.

—

## 🏗️ Future Extensibility (Not part of MVP 1)

- LLM summarization of claims and descriptions - Natural-language tagging and categorization - Full patent family visualizations - File repair / normalization tools - Integration with Espacenet, USPTO, or Google Patents APIs

—

## 💻 Technical Summary

Component	Approach
————————	———-
Language	Python 3.x
Modules	`os`, `re`, `json`, optional: `markdown` parser
UI	Start with CLI, later expand to web or desktop UI
Output format	One JSON object per folder, written to disk or passed downstream
Folder structure	Flat or nested, scanned recursively
Error handling	Graceful skips, per-folder logs, cumulative summary
Missing file tracking	Per patent + overall
Format flexibility	Handles various markdown quirks (e.g., carriage returns, lack of headers)

—

## 🧠 Summary

This foundational module ingests a manually curated set of patent folders, verifies that they are complete and well-structured, and produces standardized, reliable JSON files for each. These form the basis for downstream tools: search, summaries, tagging, analytics, and UI presentation.