--- draft: false title: "Validating YAML frontmatter with JSONSchema" aliases: ["Validating YAML frontmatter with JSONSchema"] series: [] date: "2023-06-01" author: "Nick Dumas" cover: "" keywords: ["", ""] summary: "As a collection of Markdown documents grows organically, maintaining consistency is important. JSONSchema offers a way to automatically ensure frontmatter stays up to spec." showFullContent: false tags: - yaml - jsonschema - golang - obsidian - --- ## Consistency is hard Over my time using Obsidian, I've independently authored around 400 notes. Over time I've had a relatively consistent schema for my tags and frontmatter attributes: ```markdown --- publish: false summary: "" aliases: [] title: "" source: [] tags: - Status/New --- ``` Getting too deep into what all of these mean is outside the scope of this post. For now, it's enough to know that for any Obsidian note, these properties must be present in order for my pipelines to do their job. ## Manually Managed Metadata Until now, I managed my note frontmatter by hand, or with `sed`/`grep`. I've got a bit of experience using these tools to manipulate text files, so it's been relatively comfortable but extremely manual. ## Configuration Drift The problem is that over time, humans get sloppy, forget things, decide to do things differently. In practice, this doesn't impact the usage of my vault in Obsidian; I access most of my notes via the Quick Switcher so filenames and aliases are the things I really focus on. A place where consistency does matter is when you're automating tasks. Tools that work with Markdown like static site generators care a lot about frontmatter metadata. For these tools to work the way I expect and need them to, I need to **guarantee** that my notes are configured correctly. ## What are the options? This is a project I've been meditating on for a long time. The specific problem I had is that most markdown frontmatter is YAML. I'd done cursory searching and come up with no satisfying results for a "YAML schema engine", something to formally validate the structure and content of a YAML document. I was a fool. For years I'd know that YAML was a superset of JSON, and I'd assume that the superset part meant that no tool that expects JSON could ever be guaranteed work on YAML and that's not acceptable for automation. The detail that matters is that only the *syntax* is a superset of JSON. The underlying data types: null, bool, integer, string, array, and object, still map onto JSON 1 to 1. With that revelation, my work could finally begin. ## golang and jsonschema My implementation language of choice is Go, naturally. Speed, type-safety, and cross-compilation all make for a great pipeline. ```go import ( "fmt" "io" "github.com/santhosh-tekuri/jsonschema/v5" _ "github.com/santhosh-tekuri/jsonschema/v5/httploader" "gopkg.in/yaml.v3" ) func Validate(schemaURL string, r io.Reader) error { var m interface{} dec := yaml.NewDecoder(r) err := dec.Decode(&m) if err != nil { return fmt.Errorf("error decoding YAML: %w", err) } compiler := jsonschema.NewCompiler() schema, err := compiler.Compile(schemaURL) if err != nil { return fmt.Errorf("error compiling schema: %w", err) } if err := schema.Validate(m); err != nil { return fmt.Errorf("error validating target: %w", err) } return nil } ``` `Validate()` is basically all you need in terms of Go code. The [full code repo](https://code.ndumas.com/ndumas/obsidian-pipeline) has a bit more complexity because I'm wiring things through Cobra and stuff, but here's some sample output: ``` go run cmd/obp/*.go validate -s https://schemas.ndumas.com/obsidian/note.schema.json -t Resources/blog/published/ 2023/06/01 10:31:27 scanning "mapping-aardwolf.md" 2023/06/01 10:31:27 scanning "schema-bad.md" 2023/06/01 10:31:27 validation error: &fmt.wrapError{msg:"error validating target: jsonschema: '' does not validate with https://schemas.ndumas.com/obsidian/note.schema.json#/required: missing properties: 'title', 'summary', 'tags'", err:(*jsonschema.ValidationError)(0xc0000b3740)} 2023/06/01 10:31:27 error count for "schema-bad.md": 1 2023/06/01 10:31:27 scanning "schema-good.md" ``` You get a relatively detailed summary of why validation failed and a non-zero exit code, exactly what you need to prevent malformed data from entering your pipeline. ### how to schema library? You might notice that when I specify a schema, it's hosted at `schemas.ndumas.com`. [Here](https://code.ndumas.com/ndumas/json-schemas) you can find the repository powering that domain. It's pretty simple, just a handful of folders and the following Drone pipeline: ```yaml kind: pipeline name: publish-schemas clone: depth: 1 steps: - name: publish image: drillster/drone-rsync settings: key: from_secret: BLOG_DEPLOY_KEY user: blog port: 22 delete: true recursive: true hosts: ["schemas.ndumas.com"] source: /drone/src/ target: /var/www/schemas.ndumas.com/ include: ["*.schema.json"] exclude: ["**.*"] ``` and this Caddy configuration block: ```caddy schemas.ndumas.com { encode gzip file_server { browse } root * /var/www/schemas.ndumas.com } ``` Feel free to browse around the [schema site](https://schemas.ndumas.com). ## Success Story??? At time of writing, I haven't folded this into any pipelines. This code is basically my proof-of-concept for only a small small part of a larger rewrite of my pipeline. ### Future Use Cases The one use-case that seemed really relevant was for users of the Breadcrumbs plugin. That one uses YAML metadata extensively to create complex hierarchies and relationships. Perfect candidate for a schema validation tool.