You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

146 lines
5.8 KiB
Markdown

---
draft: false
title: "Validating YAML frontmatter with JSONSchema"
aliases: ["Validating YAML frontmatter with JSONSchema"]
series: []
date: "2023-06-01"
author: "Nick Dumas"
cover: ""
keywords: ["", ""]
summary: "As a collection of Markdown documents grows organically, maintaining consistency is important. JSONSchema offers a way to automatically ensure frontmatter stays up to spec."
showFullContent: false
tags:
- yaml
- jsonschema
- golang
- obsidian
-
---
## Consistency is hard
Over my time using Obsidian, I've independently authored around 400 notes. Over time I've had a relatively consistent schema for my tags and frontmatter attributes:
```markdown
---
publish: false
summary: ""
aliases: []
title: ""
source: []
tags:
- Status/New
---
```
Getting too deep into what all of these mean is outside the scope of this post. For now, it's enough to know that for any Obsidian note, these properties must be present in order for my pipelines to do their job.
## Manually Managed Metadata
Until now, I managed my note frontmatter by hand, or with `sed`/`grep`. I've got a bit of experience using these tools to manipulate text files, so it's been relatively comfortable but extremely manual.
## Configuration Drift
The problem is that over time, humans get sloppy, forget things, decide to do things differently. In practice, this doesn't impact the usage of my vault in Obsidian; I access most of my notes via the Quick Switcher so filenames and aliases are the things I really focus on.
A place where consistency does matter is when you're automating tasks. Tools that work with Markdown like static site generators care a lot about frontmatter metadata.
For these tools to work the way I expect and need them to, I need to **guarantee** that my notes are configured correctly.
## What are the options?
This is a project I've been meditating on for a long time. The specific problem I had is that most markdown frontmatter is YAML. I'd done cursory searching and come up with no satisfying results for a "YAML schema engine", something to formally validate the structure and content of a YAML document.
I was a fool. For years I'd know that YAML was a superset of JSON, and I'd assume that the superset part meant that no tool that expects JSON could ever be guaranteed work on YAML and that's not acceptable for automation.
The detail that matters is that only the *syntax* is a superset of JSON. The underlying data types: null, bool, integer, string, array, and object, still map onto JSON 1 to 1. With that revelation, my work could finally begin.
## golang and jsonschema
My implementation language of choice is Go, naturally. Speed, type-safety, and cross-compilation all make for a great pipeline.
```go
import (
"fmt"
"io"
"github.com/santhosh-tekuri/jsonschema/v5"
_ "github.com/santhosh-tekuri/jsonschema/v5/httploader"
"gopkg.in/yaml.v3"
)
func Validate(schemaURL string, r io.Reader) error {
var m interface{}
dec := yaml.NewDecoder(r)
err := dec.Decode(&m)
if err != nil {
return fmt.Errorf("error decoding YAML: %w", err)
}
compiler := jsonschema.NewCompiler()
schema, err := compiler.Compile(schemaURL)
if err != nil {
return fmt.Errorf("error compiling schema: %w", err)
}
if err := schema.Validate(m); err != nil {
return fmt.Errorf("error validating target: %w", err)
}
return nil
}
```
`Validate()` is basically all you need in terms of Go code. The [full code repo](https://code.ndumas.com/ndumas/obsidian-pipeline) has a bit more complexity because I'm wiring things through Cobra and stuff, but here's some sample output:
```
go run cmd/obp/*.go validate -s https://schemas.ndumas.com/obsidian/note.schema.json -t Resources/blog/published/
2023/06/01 10:31:27 scanning "mapping-aardwolf.md"
2023/06/01 10:31:27 scanning "schema-bad.md"
2023/06/01 10:31:27 validation error: &fmt.wrapError{msg:"error validating target: jsonschema: '' does not validate with https://schemas.ndumas.com/obsidian/note.schema.json#/required: missing properties: 'title', 'summary', 'tags'", err:(*jsonschema.ValidationError)(0xc0000b3740)}
2023/06/01 10:31:27 error count for "schema-bad.md": 1
2023/06/01 10:31:27 scanning "schema-good.md"
```
You get a relatively detailed summary of why validation failed and a non-zero exit code, exactly what you need to prevent malformed data from entering your pipeline.
### how to schema library?
You might notice that when I specify a schema, it's hosted at `schemas.ndumas.com`. [Here](https://code.ndumas.com/ndumas/json-schemas) you can find the repository powering that domain.
It's pretty simple, just a handful of folders and the following Drone pipeline:
```yaml
kind: pipeline
name: publish-schemas
clone:
depth: 1
steps:
- name: publish
image: drillster/drone-rsync
settings:
key:
from_secret: BLOG_DEPLOY_KEY
user: blog
port: 22
delete: true
recursive: true
hosts: ["schemas.ndumas.com"]
source: /drone/src/
target: /var/www/schemas.ndumas.com/
include: ["*.schema.json"]
exclude: ["**.*"]
```
and this Caddy configuration block:
```caddy
schemas.ndumas.com {
encode gzip
file_server {
browse
}
root * /var/www/schemas.ndumas.com
}
```
Feel free to browse around the [schema site](https://schemas.ndumas.com).
## Success Story???
At time of writing, I haven't folded this into any pipelines. This code is basically my proof-of-concept for only a small small part of a larger rewrite of my pipeline.
### Future Use Cases
The one use-case that seemed really relevant was for users of the Breadcrumbs plugin. That one uses YAML metadata extensively to create complex hierarchies and relationships. Perfect candidate for a schema validation tool.