Table of Contents

Structured Data Linter Project

Loading ...
Failed to load data

project to enhance

Structured Data Linter

by Ankita Dhandha

Presentation Notes

 

This is a living document. “Lorem ipsum” is placeholder for future content.

Open links on buttons & in footers.

See footer ↓ for page detail.

Table of Contents is upper right – navigate from there ⭜.

Right‐side content previews target content & usually is scrollable.

Presentation works on 💻 & 📱.

Scope Notes

This is a note under the subject heading that explains and clarifies what is meant and what is not meant in the definition of the term and in its use as a subject heading.


Scope Note Items

Carbs, Sugars, Total Fats, Saturates
Less than 25% of recommended intake for a meal
Between 25% and 75% of recommended intake for a meal
More than 75% of recommended intake for a meal

Additional Information

These falafels are prepared with mashed sweet potatoes, black beans, toasted cumin, and coriander seeds. They are a healthy variation of the classic recipe, as they are oven-baked rather than fried. Serve them with tahini sauce for a tasty appetizer or stuff them in warm pitta bread with salad and tomatoes for a satisfying yet balanced meal.


Additional Information Items

Fibers, Proteins
More than 75% of recommended intake for a meal
Between 25% and 75% of recommended intake for a meal
Less than 25% of recommended intake for a meal

Wikipedia about Living Documents

SDL Announcements

 

       

 

 

SDL Agenda and Product Plan

 

Release–0.3.9:  SDL 3.9 baseline from Gregg Kellogg

Release–1.0:  SDL 3.9 running under AWS/Lambda

Release–2.0:  SDL running under AWS/EC2

Release–3.0:  SDL with new client‐side features and UX

Release–4.0:  SDL with new server‐side support for SHACL/ShEx

Release–5.0:  SDL with new server‐side support for ontologies

Release 1.0 Design Goals

 

Implement SDL on AWS/Lambda

Learn Lambda application limits to configure SDL as a Lambda application

Learn SDL/Lambda processing limits to determine graph size and complexity for SDL analysis

Test SDL/Lambda jobs that are too large/complex to run on SDL/Heroku (Gregg’s native implemetation)

Begin to prepare SDL customers for two platforms: one for simple jobs on AWS/Lambda; one for complex jobs (future release)

Example 1-21

 

Define a simple JSON–LD @Graph

Test simple @Graph on Schema.org Markup Validator (SMV)

Use A/B Testing: compare SMV and SDL reports using identical @Graph (SMV/Graph sameAs SDL/Graph) (≡)

On SMV report, click an @Type (PublicationIssue and/or ScholarlyArticle) to see report detail

Click to execute Schema Markup Validator

Example 1-22

 

begin A/B testing

JSON–LD test on AWS/Lambda running SDL

SDL “search results preview” is same information embedded in SMV analysis but served in human-readable format

On SDL report, scroll down to see JSON–LD graph processing and analysis

Click to process the JSON–LD graph on AWS/Lambda

Example 1-31

 

continuing A/B testing

JSON–LD graph defines Natural Languages used to present content to readers (and intelligent devices) in language of their choice

@Language graph is more complex than previous A/B test

Select SMV report about @Types [e.g. Class (33 items) and/or DefinedTerm (3 items)] for detail

Click to execute Schema Markup Validator

Example 1-32

 

continuing A/B testing

@Language graph on AWS/Lambda

SDL “search results preview” is same information in SMV but served in human‐readable format

Scroll down to see JSON–LD graph processing and analysis

When processed on AWS/Lambda, SDL generates report about 2,204 “triples” defined in @Language graph

Click to execute AWS/Lambda report (be patient …)

Example 1-41

 

continuing A/B testing

This JSON–LD graph is the Ontomatica Knowledge Graph

Knowledge Graph uses 〜 20 @Type objects such as @Corporation, @Product, @Offer and @Dataset

SMV intergrates (links) valid @Type and @Property relationships to create a single view — “Corporation”

Click to execute Schema Markup Validator

Example 1-42

 

continuing A/B testing

@Language graph on AWS/Lambda

SDL “search results preview” is same information in SMV but served in human‐readable format

Scroll down to see JSON–LD graph processing and analysis

When processed on AWS/Lambda, SDL generates report about 2,204 “triples” defined in @Language graph

Click to execute AWS/Lambda report (be patient …)

Example 1-51

 

continuing A/B testing

Force Directed Graph (FDG) of Ontomatica’s Knowledge Graph

Facts (entities & relationships) in FDG are identical to JSON–LD facts in SMV & SDL reports

Rotate/zoom/move FDG to see specific entities & relationships

Link highlighted in red features main entities on pages [mainEntityOfPage]

Open Knowledge Graph full page

Release 1.0 Issues

 

AWS/Lambda “duration window” limits file size for SDL processing

AWS/Lambda “size window” limits integration of optional SDL features

SDL/AWS/Lambda will process larger & more complex graphs than SDL/Heroku (Gregg’s SDL platform)

SDM server is faster than default AWS/Lambda server & will process graphs files up to 2.5MB

Add comments about Release 1.0 Issues on GitHub

Release 2.0 Design Goals

 

Use client–side methods to add features to SDL reports

Use CSS grid to create cells for specific SDL features

Use CSS lightbox to preview “cell + content”

Upon cell selection, lightbox displays “cell + content” preview in full screen

Build-out “cell + content” design with existing SDL features such as table analysis, error messages and reasoner messages

Build-out “cell + content” design with new features such as graph visualization

Lightbox
Lightbox is a JavaScript library that displays images and videos by filling the screen, and dimming out the rest of the web page.
Reasoner
A semantic reasoner, reasoning engine, rules engine, or simply a reasoner, is a piece of software able to infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms to work with. The inference rules are commonly specified by means of an ontology language, and often a description logic language.
Third item
Lorem ipsum

Release 2.0 Design Notes





 

Jarno van Driel proposed new SDL features

New features are presented & discussed on Google Docs

Preview document using link below

Open Jarno’s Google Docs full page

Example 2-21

 

Jarno–inspired CSS grid with six cells

Feature SDL table (current example injects sample data from Wikidata)

Feature one or more visualized graphs using processors e.g. D3.JS

Feature hierachical view of structured data—similar to Schema Markup Validator

Feature parser statistics

Feature reasoner analysis (snippets)

Feature warnings & errors (here preview shows ~50% of full page content)

Open Release 2.0 prototype full page

Example 2-22

 

Production version of Jarno design

Sample uses simple case from SDL/AWS/Lambda Example 1-22

Cells feature:  (1) search results preview  (2) RDF  (3) TTL  (4) RDFa  (5) JSON–LD beautified  (6) RDF Grapher  (7) tabular report  (8) parser statistics  (9) linter message from reasoner

Footer includes link to SDL Release 2.0 prototype running on AWS/Lambda

https://t8oykz4bta.execute-api.us-east-1.amazonaws.com/Prod/?url=http:%2F%2Flinter.structured-data.org%2Fexamples%2Fschema.org%2Feg-0399-jsonld.html

Open production copy of Release 2.0 prototype full page

Example 2-31

 

On following pages are seven views of a single JSON data source

Example 2-31: Circle Packing

Example 2-32: Sunburst

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Open Circle Packing diagram full screen

Example 2-32

 

Example 2-32: Sunburst

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Open Sunburst diagram full screen

Example 2-33

 

Sunburst Zoom with LabelsSunburst

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Open Sunburst Zoom diagram full screen

Example 2-34

 

Collapsible Boxes

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Open Collapsible Boxes diagram full screen

Example 2-35

 

Node-Link Tree

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Open Node-Link Tree diagram full screen

Example 2-36

 

Treemap

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Open Treemap diagram full screen

Example 2-41

 

A force directed graph (FDG) visualizes schema.org @Type and @Property specifications & relationships

Source data conforms to subject–predicate–object (?s ?p ?o) format

In contrast, flare.json structure (Examples 2-31 2-37) uses hierarchical structure based on RDFS:subClassOf

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Open schema.org FDG full page

Item 2-51

 

Develop consensus in SDL community & among interested parties about final design for Release 2.0 interface

Will need SDL server–side changes to generate JSON structure for D3.JS processing

Will need SDL server–side changes to generate JSON structure for Force Directed Graph processing

Add comments about Release 2.0 Issues on GitHub

Release 3.0 Design Goals

 

Create SDL preparation methods & production platform to analyze large graphs

SDL processor & reasoner objective:
analyze @Graph with 10 millions statements (“triples”)

Example 3-21

 

Refactored USDA National Agricultural Library Thesaurus (NALT) in schema.org

NALT/JSON–LD size: 6.84 MB

NALT/JSON–LD exceeds SMV 2.5 MB limit (no A/B analysis)

Alternative: configured SDL on AWS/EC2 server

SDL/NALT report size: 31.6 MB

SDL/NALT/AWS/EC2 processing time: 5 hours

NALT “triples”: 515,530

Open SDL/NALT report full page (be patient …)

Example 3-22

 

Ontomatica’s Web Enabled Directed Graph Engine (WEDGE) Reference Library is an application of National Agriculture Library Thesaurus

Research papers are mapped to schema.org JSON–LD structure in SDL report

Research papers are annotated using schema.org @Type and @Property grammar

WEDGE Reference Library contains information about 200,000+ papers

NALT “triples”: 515,530

Open WEDGE Library full page

Example 3-23

 

Visualization does not include Taxa which is included in SDL report (Example 3-21)

Visualization uses same JSON–LD structure as used in SDL Release 2.0 design and prototype

Open National Agricultural Library Thesaurus visualization full page

Example 3-31

 

Refactored US NIH National Cancer Institute Thesaurus (NCIT) in schema.org

NCIT/JSON–LD size: 13.7 MB
(no A/B analysis with Schema Markup Validator)

SDL/NCIT report size: 76.9 MB

SDL/NCIT/AWS/EC2 processing time: 9 hours

NCIT “triples”: 946,520

Open SDL National Cancer Institute Thesaurus report full page (be very patient …)

Example 3-32

 

ChEMATIC (Chemical Entities with Medical Applications, Therapeutic Indications & Consequences) is an application of data from NIH NCIT & NIH Medical Subject Headings (MeSH)

Several other ontologies complement NCIT & MeSH JSON-LD structures

Biochemicals are mapped to hierarchical JSON-LD structures

Total ChEMATIC “triples” (structures and object maps): 700+ million

Open WEDGE ChEMATIC full page

Release 3.0 Issues

 

SDL/AWS/ECS is configured as a Docker container but improved methods will be needed to install SDL on best–available AWS/EC2 server

To reduce processing duration, need methods to use multiple CPU cores

SDL/AWS/EC2 is expensive to run — need to implement a business model to offset operating expenses

Add comments about Release 3.0 Issues on GitHub

Release 4.0 Design Goals

 

Support Shapes Constraint Language (SHACL) — a specification for validating graph–based data against a set of conditions

Support Shape Expressions (ShEx) — an RDF language for identifying predicates and their associated cardinalities and datatypes

Open Schemarama CORE, SHACL & ShEx discussion on GitHub

Item 4-21

 

Tim Berners‐Lee on SHACL & ShEx:

Shapes explain to machines what data should look like, independently of how that data is displayed to a user

Forms are a user interface allowing people to read and write data in a specific shape

Footprints explain to machines where new data should be stored

Open Tim Berners‐Lee discussion full page

Item 4-22

 

Ruben Verborgh on Shapes & Linked Data:

Apps should be coded against shapes [and] Linked Data so other apps can reuse them

[Where] vocabularies provide a list of possible attributes, shapes mandate a specific structure for data, combining attributes from vocabularies in a certain way

Footprints explain to machines where new data should be stored

Open Ruben Verborgh article full page

Item 4-23

 

Key findings in the US PubMed/NCBI article “Automatic Generation of SHACL Shapes from Ontologies”

OWL and SHACL are not equivalent in their interpretation

There are differences in how OWL interprets restrictions (for inferencing) and how SHACL interprets constraints (for validation)

Open PubMed/NCBI SHACL article full page

Item 4-31

 

Glucosinolates are natural components of many pungent plants such as brocolli, mustard, cabbage, and horseradish

US NIH NCI review of links between cruciferous vegetable intake & lung cancer risk concluded that high intake may decrease risk in a range of 17 ‐ 23 %

Other studies report similar risk reductions for colorectal, breast, kidney, esophageal, & oropharyngeal (mouth & throat) cancers

Open National Library of Medicine report full page

Example 4-32

 

American Food Data Systems Institute (AFDSI) & Ontomatica participate in food & agriculture research projects

One WEDGE project integrated & synthesized glucosinolate data from many studies

WEDGE–Glucosinolates enables Principal Investigators & researchers to visualize relationships that otherwise are difficult to understand & analyze

Open WEDGE–Glucosinolates full page

Example 4-33

 

With an objective of creating a Knowledge Graph, glucosinolate data was difficult to synthesize & integrate

Observations & measurement methods were irregular

Plant taxa & genetic variety data was regular, but ‘part of plant’ designations were irregular

Research process would have been easier & more accurate if shape data had been enforced during preparations & observations

Open Glucosinolate “fingerprint data”

Example 4-34

 

Force Directed Graph represents integration of data specifications (from ontologies) & data constraints (to ensure data quality)

“Ontology part” of graph (taxa & ‘part of plant’) is visable in WEDGE–Glucosinolates

“Shape part” of graph (represented as SHACL in TTL format) is in footer

Open Force Directed Graph full page

Item 4-41

 

Diabetes is a debilitating & life threatening disease

Research about & remedies for diabetes depend on precise information where “the devil is in the details”

This NCBI article is an overview

Open NCBI Diabetes Type 2 article full page

Example 4-42

 

ChEMATIC is a WEDGE application to visualize relationships among biochemistry, factor inputs & human conditions

ChEMATIC does not document opinions (something is good or bad); it only documents items & their relationships

Medical & nutrition experts use ChEMATIC information to express opinions & advice

This graph visualizes data about Diabetes Mellitus, Type 2

Graphs show relationships among:
human genes and chemicals
diseases (bacterial, digestive, stomatognathic, nervous system, urogenital, cardiovascular, nutritional and metabolic, etc.)
chemistry (inorganic and organic chemicals; heterocyclic and polycyclic compounds; macromolecular substances; hormones; enzymes and coenzymes; carbohydrates, lipids, amino acids, peptides, and proteins; nucleic acids, nucleotides, and nucleosides; and complex mixtures)
pharmaceutical preparations
Open ChEMATIC Diabetes Mellitus graph full page

Example 4-43

 

Diabetes observation & monitoring are key parts of a personalized remedy

First we need to specify the shape of glucose observations

Then we need to integrate observation shapes with monitored glucose data

Example 4-44 illustrates an observation graph for glucose

Example 4-45 integrates

Example 4-46 integrates

Example 4-47 integrates

Example 4-44: Graph of ShEx Observation for glucose

Example 4-45

 

Visualizing Dexcom Observation Data - Hourly

Open visualization full page

Example 4-46

 

Visualizing Dexcom Observation Data - Daily

Open visualization full page

Example 4-47

 

Visualizing Dexcom Observation Data - Histogram

Open visualization full page

Twitter: ShEx SHACL Announcements

 

       

 

Item 4-61

 

Develop specification & design for implementing SHACL & ShEx in SDL

Simplify workflow that involves at least 2 source files (ontology & shape) & possibly more than one data structure (JSON-LD & TTL)

Explain at least three conditions: ontology messages, shape messages, & ontology/shape integration messages

Reconcile irregularity between ontology constraints & shape constraints

Release 5.0 Design Goals

 

Support other ontologies — in addition to schema.org

In addition to @Context registration of vocabulary terms, support reasoning about ontology–specific grammar

Enable vocabulary & reasoning for SKOS–based datasets

Enable vocabulary & reasoning for OWL–based datasets

Example 5-21

 

UN FAO AgroVoc is a SKOS–based dataset

AgroVoc is a multilingual controlled vocabulary covering all areas of interest to the Food & Agriculture Organization of the United Nations, including food, nutrition, agriculture, fisheries, forestry & the environment.

Open UN FAO description of Brassica full page

Example 5-22

 

US Library of Congress is a SKOS–based dataset

The Library of Congress Subject Headings (LCSH) comprise a thesaurus (controlled vocabulary) of subject headings, maintained by the United States Library of Congress, for use in bibliographic records

Open Library of Congress description of herbicide full page

Example 5-31

 

Plant Ontology is an OWL–based dataset

“archegonium head” is referenced in WEDGE–Glucosinolates

Open Plant Ontology description of “archegonium head” full page

Example 5-32

 

Avocado Ontology is an OWL–based dataset

Avocado is a popular food & popular ingredient in other foods

Open ontology‐based Avocado description full page

Example 5-41

 

US NIH PubChem is a multi–ontology dataset

PubChem is a database of chemical molecules & their activities against biological assays

Author: National Center for Biotechnology Information (NCBI); partOf United States National Institutes of Health (NIH)

More than 80 database vendors contribute to PubChem

Open description of PubChem RDF full page

Example 5-51

 

Wedge–FNDDS (Food & Nutrient Database for Dietary Studies) is a multi–ontology dataset

FNDDS includes foods & beverages nutrition data reported in “What We Eat in America”

FNDDS is an application of OWL–based ontologies including AFDSI’s Vocal (acronym for the phrase “Vocabularium Alimentarum — Vocabulary of Food”)

Open foods made with brassicas on Wedge–FNDDS

Release 5.0 Issues

 

Production issues will be more complicated than Release 3.0

May be difficult to load an SDL–instance configured with
schema.org–based datasets
+
SKOS–based datasets
+
OWL–based datasets

Processing duration could be long (days!)

Provenance and Document Properties

author
Ankita Dhandha
organization
Ontomatica
date published
20-12-01
date modified
20-12-10
modification note
Added Section 4
date modified
20-12-16
modification note
Added Section 5
date modified
21-01-01
release note
Added information based on Gregg Kellogg email
date modified
21-07-07
release note
Updated to use Schema.org Markup Validator (SMV)
date modified
21-01-11
release note
more future information
date modified
21-01-11
release note
more future information