# ๐ Patent Folder Ingest & Validation Module (MVP Part 1)
## ๐ฏ Purpose
To scan a local directory of patent folders, validate the file structure and content of each, and generate a structured JSON representation of each patent to be used for search, tagging, and future analysis.
โ
## ๐งฉ What This Module Does
### 1. Folder Scanning
- Recursively scans a selected root directory. - Identifies each patent folder by name format:
`[patent number] [patent name]` e.g. `US_10101845_B2 Touch Activated Wireless Poster`
โ
### 2. Expected Files Per Folder
Each folder may contain up to six files:
| File Name | Required? | Description |
| โโโโโโโโ | โโโโ | โโโโ- |
| `Bibliographic_Data.md` | โ | Basic patent metadata |
| `Claims.md` | โ | List of patent claims |
| `Description.md` | โ | Technical description |
| `Legal_Events.md` | โ ๏ธ | Grant, expiry, and renewal data |
| `Patent_Family.md` | โ ๏ธ | List of related patents |
| `*.pdf` | โ | Original scanned patent (any `.pdf` in folder) |
- Missing files are logged. - Optional files improve downstream functionality but aren't required for basic ingestion.
โ
### 3. File Content Verification
Without using LLMs (yet), we validate content using heuristics:
- `Bibliographic_Data.md`: Look for fields like Title, Inventor(s), Applicant, Filing Date, etc. - `Claims.md`: Should contain numbered claims (e.g., `1.` or `Claim 1.`). If missing or poorly formatted, flag for review. - `Description.md`: Should contain common patent section headers (e.g., `Background`, `Summary`, `Embodiments`). - `Legal_Events.md`: Check for date+event pairings or relevant keywords (e.g., `Granted`, `Expired`, `Lapsed`, `Renewed`). - `Patent_Family.md`: Should contain one or more patent numbers.
โ
### 4. Canonical Output: JSON per Patent
Each valid folder is converted into a JSON object:
```json {
"patent_number": "US_10101845_B2",
"patent_name": "Touch Activated Wireless Poster",
"folder_path": "/patents/US_10101845_B2 Touch Activated Wireless Poster",
"files_present": [
"Bibliographic_Data.md",
"Claims.md",
"Description.md",
"Patent_Family.md",
"US_10101845_B2.pdf"
],
"validation_flags": {
"missing_files": ["Legal_Events.md"],
"format_warnings": {
"Claims.md": "Claims unnumbered or improperly formatted"
}
},
"extracted_metadata": {
"priority_date": "2014-06-27",
"filing_date": "2015-06-25",
"inventors": ["Kate Stone"],
"assignee": "Novalia Ltd",
"status": "Granted",
"family_id": "GB201411234"
}
} ```
This JSON is stored either in memory, written to disk, or passed directly to a backend database or service layer.
โ
### 5. Validation Report
A report is generated showing:
- Total patents scanned - Valid patents vs invalid - Missing or malformed files - Common error types (e.g., inconsistent claim formatting)
This is output as a `.json` or `.log` file and optionally shown in the UI.
โ
## ๐๏ธ Future Extensibility (Not part of MVP 1)
- LLM summarization of claims and descriptions - Natural-language tagging and categorization - Full patent family visualizations - File repair / normalization tools - Integration with Espacenet, USPTO, or Google Patents APIs
โ
## ๐ป Technical Summary
| Component | Approach |
| โโโโโโโโ | โโโ- |
| Language | Python 3.x |
| Modules | `os`, `re`, `json`, optional: `markdown` parser |
| UI | Start with CLI, later expand to web or desktop UI |
| Output format | One JSON object per folder, written to disk or passed downstream |
| Folder structure | Flat or nested, scanned recursively |
| Error handling | Graceful skips, per-folder logs, cumulative summary |
| Missing file tracking | Per patent + overall |
| Format flexibility | Handles various markdown quirks (e.g., carriage returns, lack of headers) |
โ
## ๐ง Summary
This foundational module ingests a manually curated set of patent folders, verifies that they are complete and well-structured, and produces standardized, reliable JSON files for each. These form the basis for downstream tools: search, summaries, tagging, analytics, and UI presentation.
