# 📁 Patent Folder Ingest & Validation Module (MVP Part 1) ## 🎯 Purpose To scan a local directory of patent folders, validate the file structure and content of each, and generate a structured JSON representation of each patent to be used for search, tagging, and future analysis. --- ## 🧩 What This Module Does ### 1. Folder Scanning - Recursively scans a selected root directory. - Identifies each patent folder by name format: `[patent number] [patent name]` e.g. `US_10101845_B2 Touch Activated Wireless Poster` --- ### 2. Expected Files Per Folder Each folder may contain up to six files: | File Name | Required? | Description | |------------------------|-----------|-------------| | `Bibliographic_Data.md` | ✅ | Basic patent metadata | | `Claims.md` | ✅ | List of patent claims | | `Description.md` | ✅ | Technical description | | `Legal_Events.md` | ⚠️ | Grant, expiry, and renewal data | | `Patent_Family.md` | ⚠️ | List of related patents | | `*.pdf` | ✅ | Original scanned patent (any `.pdf` in folder) | - Missing files are logged. - Optional files improve downstream functionality but aren't required for basic ingestion. --- ### 3. File Content Verification Without using LLMs (yet), we validate content using heuristics: - `Bibliographic_Data.md`: Look for fields like Title, Inventor(s), Applicant, Filing Date, etc. - `Claims.md`: Should contain numbered claims (e.g., `1.` or `Claim 1.`). If missing or poorly formatted, flag for review. - `Description.md`: Should contain common patent section headers (e.g., `Background`, `Summary`, `Embodiments`). - `Legal_Events.md`: Check for date+event pairings or relevant keywords (e.g., `Granted`, `Expired`, `Lapsed`, `Renewed`). - `Patent_Family.md`: Should contain one or more patent numbers. --- ### 4. Canonical Output: JSON per Patent Each valid folder is converted into a JSON object: ```json { "patent_number": "US_10101845_B2", "patent_name": "Touch Activated Wireless Poster", "folder_path": "/patents/US_10101845_B2 Touch Activated Wireless Poster", "files_present": [ "Bibliographic_Data.md", "Claims.md", "Description.md", "Patent_Family.md", "US_10101845_B2.pdf" ], "validation_flags": { "missing_files": ["Legal_Events.md"], "format_warnings": { "Claims.md": "Claims unnumbered or improperly formatted" } }, "extracted_metadata": { "priority_date": "2014-06-27", "filing_date": "2015-06-25", "inventors": ["Kate Stone"], "assignee": "Novalia Ltd", "status": "Granted", "family_id": "GB201411234" } } ``` This JSON is stored either in memory, written to disk, or passed directly to a backend database or service layer. --- ### 5. Validation Report A report is generated showing: - Total patents scanned - Valid patents vs invalid - Missing or malformed files - Common error types (e.g., inconsistent claim formatting) This is output as a `.json` or `.log` file and optionally shown in the UI. --- ## 🏗️ Future Extensibility (Not part of MVP 1) - LLM summarization of claims and descriptions - Natural-language tagging and categorization - Full patent family visualizations - File repair / normalization tools - Integration with Espacenet, USPTO, or Google Patents APIs --- ## 💻 Technical Summary | Component | Approach | |------------------------|----------| | Language | Python 3.x | | Modules | `os`, `re`, `json`, optional: `markdown` parser | | UI | Start with CLI, later expand to web or desktop UI | | Output format | One JSON object per folder, written to disk or passed downstream | | Folder structure | Flat or nested, scanned recursively | | Error handling | Graceful skips, per-folder logs, cumulative summary | | Missing file tracking | Per patent + overall | | Format flexibility | Handles various markdown quirks (e.g., carriage returns, lack of headers) | --- ## 🧠 Summary This foundational module ingests a manually curated set of patent folders, verifies that they are complete and well-structured, and produces standardized, reliable JSON files for each. These form the basis for downstream tools: search, summaries, tagging, analytics, and UI presentation.