Automating keyword audits for image archives ensures your visual data is consistently tagged and easily discoverable. With Aspose.OCR for .NET, you can read embedded/visible text from images and validate it against a controlled keyword list—then report what’s missing. This guide enhances the workflow with concrete, runnable steps that match the gist at the end, plus optional improvements for scheduling, reporting, and maintenance.

Complete Example


Prerequisites

  • .NET 8 (or .NET 6+) SDK installed.
  • NuGet access to install Aspose.OCR.
  • A folder of images to audit (e.g., C:\Path\To\ImageArchive).
  • (Optional) An Aspose license file if you plan to exceed evaluation limits.

Create the project & add packages

dotnet new console -n ImageArchiveKeywordAudit -f net8.0
cd ImageArchiveKeywordAudit
dotnet add package Aspose.OCR

Step 1 — Prepare Your Keyword List

Decide the canonical keywords your images should contain. In the gist, keywords are hardcoded for simplicity:

// Exact shape used in the gist
List<string> keywords = new List<string>
{
    "mountains", "beaches", "forests", "landscape"
};

Tip (optional): Store keywords in keywords.txt (one per line) and load them into List<string> at runtime to avoid recompiles.


Step 2 — Initialize Aspose.OCR and Scan the Archive

Match the gist: create an OCR engine, enumerate images, OCR each file, and check for keyword presence.

using System;
using System.Collections.Generic;
using System.IO;
using Aspose.Ocr;

namespace ImageArchiveKeywordAudit
{
    class Program
    {
        static void Main(string[] args)
        {
            // Path to the image archive directory (edit to your folder)
            string imageDirectory = @"C:\Path\To\ImageArchive";

            // Keyword list for auditing (matches the gist approach)
            List<string> keywords = new List<string>
            {
                "mountains", "beaches", "forests", "landscape"
            };

            // Initialize Aspose.OCR API (license is optional)
            // new License().SetLicense("Aspose.Total.lic");
            using (AsposeOcr api = new AsposeOcr())
            {
                // Process each JPG in the directory (same filter style as the gist)
                foreach (string imagePath in Directory.GetFiles(imageDirectory, "*.jpg"))
                {
                    // Extract text from the image
                    string extractedText = api.RecognizeImageFile(imagePath);

                    // Audit the extracted text against the keyword list
                    bool containsKeywords = AuditText(extractedText, keywords);

                    // Output the results
                    Console.WriteLine($"Image: {imagePath} - Contains Keywords: {containsKeywords}");
                }
            }
        }

        // Method to audit extracted text against a list of keywords (as in gist)
        static bool AuditText(string text, List<string> keywords)
        {
            foreach (string keyword in keywords)
            {
                if (text.Contains(keyword, StringComparison.OrdinalIgnoreCase))
                {
                    return true;
                }
            }
            return false;
        }
    }
}

The gist prints only a boolean. You can enhance reporting and filtering while keeping the same OCR core.

3.a Filter Multiple Image Types

// Replace the single GetFiles with this multi-pattern approach
string[] patterns = new[] { "*.jpg", "*.jpeg", "*.png", "*.tif", "*.tiff", "*.bmp" };
var imageFiles = new List<string>();
foreach (var pattern in patterns)
    imageFiles.AddRange(Directory.GetFiles(imageDirectory, pattern, SearchOption.TopDirectoryOnly));

3.b Capture Which Keywords Matched / Missed

// After OCR:
var matched = new List<string>();
var missing = new List<string>();

foreach (var k in keywords)
    (extractedText.IndexOf(k, StringComparison.OrdinalIgnoreCase) >= 0 ? matched : missing).Add(k);

Console.WriteLine($"Image: {Path.GetFileName(imagePath)} | Matched: [{string.Join(", ", matched)}] | Missing: [{string.Join(", ", missing)}]");

3.c Write a CSV Report

string reportPath = Path.Combine(imageDirectory, "audit-report.csv");
bool writeHeader = !File.Exists(reportPath);

using (var sw = new StreamWriter(reportPath, append: true))
{
    if (writeHeader)
        sw.WriteLine("Image,ContainsKeywords,Matched,Missing");

    sw.WriteLine($"\"{Path.GetFileName(imagePath)}\",{matched.Count > 0},\"{string.Join(";", matched)}\",\"{string.Join(";", missing)}\"");
}

Step 4 — Run from PowerShell or Batch

Create a simple PowerShell runner run-audit.ps1:

# Adjust paths as needed
$solutionRoot = "C:\Path\To\ImageArchiveKeywordAudit"
$imageDir     = "C:\Path\To\ImageArchive"

# Build and run
dotnet build "$solutionRoot" -c Release
& "$solutionRoot\bin\Release\net8.0\ImageArchiveKeywordAudit.exe"

Optional: If you modify the program to accept arguments, run it as: ImageArchiveKeywordAudit.exe "C:\Images" "C:\keywords.txt"


Step 5 — Schedule Recurring Audits (Windows Task Scheduler)

Use schtasks to run daily at 2am:

schtasks /Create /TN "ImageKeywordAudit" /TR "\"C:\Path\To\ImageArchiveKeywordAudit\bin\Release\net8.0\ImageArchiveKeywordAudit.exe\"" /SC DAILY /ST 02:00

Log output to file by wrapping the command in a .cmd that redirects stdout/stderr: ImageArchiveKeywordAudit.exe >> C:\Path\To\Logs\audit-%DATE%.log 2>&1


Best Practices

  • Keep a canonical keyword source. Store your list in Git or a CMDB; review quarterly.
  • Normalize OCR text. Trim whitespace, unify hyphens and Unicode look-alikes before matching.
  • Tune performance. Batch by folders; add parallelism only after measuring I/O and CPU.
  • Quality in, quality out. Clean scans (deskew/denoise) markedly improve match rates.
  • Audit scope. Consider separate keyword sets per collection (e.g., “landscape”, “product”, “forms”).
  • Traceability. Keep CSV reports with timestamps for change history and quick diffing.

Troubleshooting

  • Empty OCR output: Verify image orientation and contrast; try another format (*.png, *.tif).
  • False negatives: Add plural/stem variants or synonyms to your list (e.g., “beach”, “beaches”).
  • Throughput issues: Limit concurrent runs; avoid scanning network shares over slow links.

More in this category