Yavaa · Slide | Sirko Schindler · Yavaa

Yavaa

Supporting Data Workflows
from Discovery to Visualization

Sirko Schindler

From Alpha to Omega (Static Visualization)

· Search · Access · Interpretation · … · Union / Join · Filter · Derived data · … · Selection · Mapping · Generation · … · Export · Update · Provenance · …

Dataset Search

Dataset Combination

  • Common search returns individual datasets
    • Complete answers might need multiple datasets, though
  • Heterogeneity in vocabularies
NUTS: PL22 (Śląskie) ISO 3166-2: PL-22 (Pomorskie)
  • Heterogeneity in granularity
vs.

Dataset Modification

  • Data barely in the required shape
    • Suitable operations?
  • Consistency across operations
    • Units of measurement easily overlooked
Mars Climate Orbiter Mishap Investigation Board Phase I Report NASA, 1999 André Barro, "Air Crash Investigation(s)" Season 5, Episode 2: "Gimli Glider" via Hellbus @ Wikipedia The Michigan Daily, 1985-06-20

Visualization

  • Plethora of visualizations available
    • Technically possible? Suitable?
  • Map Data to Visualization
    • Which data to which visual artifact?
?

Publication

  • Provenance
    • Inputs? Operations?
  • Accessibility
  • (Updates, Changes, &) Re-execution

Thesis Outline

Interlude: OLAP-Cubes

Columns / Variables / ...

Dimensions Conditions of observations. Measure(ment)s Observations made.

OLAP-Cube

Measurement Dimensions + Aggregates + Dimension Hierarchies

Requirements to Metadata Schema

General Information

  • Title
  • Description
  • Author / Publisher
i

Loading Data

  • Download location(s)
  • Media type
  • Inner structure

Primary Data Search

  • Column concepts
  • Data ranges
  • Semantic relationships

Data Integration

  • Codelists
  • Units of measurement
  • Roles of Variables

Metadata Schema

Requirements: General Information Loading Data Primary Data Search Data Integration dcat:Dataset qb:DataSet • dcterms:title • dcterms:description dcterms:Agent • rdf:label dcat:Distribution • dcat:downloadURL • dcat:mediaType qb:DataStructureDefinition qb:ComponentSpecification • rdfs:label • qb:order qb:ComponentProperty • yavaa:hasUnit yavaa:TimeFormat skos:Concept • skos:prefLabel • skos:notation • skos:exactMatch • skos:closeMatch qb:DimensionProperty qb:MeasureProperty qb:AttributeProperty • yavaa:hasValue qb:CodedProperty • qb:codeList skos:ConceptScheme yavaa:Range rdfs:Datatype • owl:onDatatype • owl:withRestriction dcat: http://www.w3.org/ns/dcat# dcterms: http://purl.org/dc/terms/ owl: http://www.w3.org/2002/07/owl# qb: http://purl.org/linked-data/cube# rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# skos: http://www.w3.org/2004/02/skos/core# yavaa: http://yavaa.org/ns/yavaa#

Metadata Usecase - Search

Primary Data Search across Sources

  • Multiple datasets to fulfill a query
  • Search → Search & Integration
    • Identify candidate datasets
    • Select subset to cover query
    • Harmonize data
    • Combine datasets
  • Change in search results
    • List of datasets → Integration Workflow
    • User-adjustable
    • Workflow executed upon user request

Query Structure

  • Keyword search → query by example
    • Description of requested structure
    • Column headers and possibly value ranges
  • Value ranges
    • Categorical: enumeration of values
    • Time & quantitative: lower and/or upper bound
  • Extent of value ranges
    • Finite / bounded: range given
    • Infinite / unbounded: no range given
    • Semi-finite / semi-bounded: only lower or upper bound
      (for time and quantitative columns)

Rephrasing the Task

A jigsaw puzzle in higher dimensions

Dataset Combination - Search for Candidates

  1. Search for candidates
  2. Select best candidate
  3. Split regions
  4. Apply Steps 2&3 recursively
  5. Assemble workflow
  6. Get user input
  • Search metadata repository
  • Criteria
    • ≥ 1 matching dimension
    • ≥ 1 matching measurement

Dataset Combination - Select best Candidate

  1. Search for candidates
  2. Select best candidate
  3. Split regions
  4. Apply Steps 2&3 recursively
  5. Assemble workflow
  6. Get user input
  • Order candidates wrt. query
  • Criteria
    • Coverage … overlap in values between dataset and query
    • Support … common columns between dataset and query
    • Excess … additional dimensions in dataset wrt. the query
$$ \text{Score}(~s,~q~)~= \begin{pmatrix} ~\text{Coverage}(~s,~q~)~\times~\text{Support}(~s,~q~)~\\ ~1-~\text{Excess}(~s,~q~)~ \end{pmatrix} $$

Dataset Combination - Split Regions

  1. Search for candidates
  2. Select best candidate
  3. Split regions
  4. Apply Steps 2&3 recursively
  5. Assemble workflow
  6. Get user input
  • Split query into regions
    • Already covered
    • So far uncovered
  • Maintain "rectangular" shape
  • Using conflict-avoiding strategy

Dataset Combination - Apply Steps 2&3 recursively

  1. Search for candidates
  2. Select best candidate
  3. Split regions
  4. Apply Steps 2&3 recursively
  5. Assemble workflow
  6. Get user input
  • Reuse candidate list from before
    • Drop candidates with a score of zero
    • No new query to the metadata repository needed
  • Terminate recursion if …
    • Entire (remaining) query is covered
    • No more candidates are left over

Dataset Combination - Assemble Workflow

  1. Search for candidates
  2. Select best candidate
  3. Split regions
  4. Apply Steps 2&3 recursively
  5. Assemble workflow
  6. Get user input
  • Adjust dataset schemata
    • Additional measurements → drop columns
    • Additional dimensions → user interaction
  • Combine partial solutions
    • Union- / join-operators

Dataset Combination - Get User Input

  1. Search for candidates
  2. Select best candidate
  3. Split regions
  4. Apply Steps 2&3 recursively
  5. Assemble workflow
  6. Get user input
🗸 / sum() max() avg()
  • Present result to user
    • Coverage wrt. query
    • Included data providers
    • Included datasets
  • Select aggregations functions if necessary
  • Refine search?

Dataset Combination - Summary

Search for candidates Select best candidate Split regions Apply recursively Assemble workflow Get user input

Evaluation

SQLite Python Yavaa Dataset Combination Keyword Search Yavaa MS Excel LibreOffice

Evaluation - Setup

6,283 🛇 Multiple units Unit not in OM﹡ Multiple time formats Download issues Templates 2,943 t2.medium · 2 vCPU · 4 GB memory · Amazon Linux
Many thanks to Maximilian Stiede for the help here!

Evaluation - Scenario

Your task is to create a dataset that holds the amount of sheep per inhabitant for the following European countries (the shortlist of vacation destinations of your superior - purely coincidental, of course) and period of time (previous five years):

  • Countries: Germany, Iceland, Ireland, Romania, Spain
  • Period of time: 2014 - 2019

After the dataset has been assembled, choose an adequate graph to present your results to your fellow colleagues and the general public. The suggested order of steps is as follows. Your personal workflow might deviate, though.

  1. Identify suitable datasets.
    While in general Eurostat has all the data you need, it is not provided as a single dataset to start with, so you will need to combine multiple ones.
  2. Prepare a single dataset.
    Eurostat's datasets contain more data than needed, so you will have to filter for the requested values. You may also need to join multiple source datasets.
  3. Calculate the desired metric.
    The requested metric is not included in Eurostat's raw data, so you will have to calculate it manually.
  4. Select a proper visualization.
    Once the dataset contains only the requested values, you can choose a suitable visualization.
  5. Export your results.
    Store your results (data and visualization) locally and then upload them on the next page.
tag00017 Number of sheep tps00001 Population on 1 January

Evaluation - Anticipated Strategies

Conventional Strategy Locate Datasets Import Datasets Join Datasets Filter Values & Columns Add Derived Column Visualize Export Enhanced Strategy Construct Dataset Add Derived Column Visualize Export

User Evaluation - Setup

  • Within-subject design with counterbalancing
    • Eurostat + Spreadsheet software
      (LibreOffice Calc, Microsoft Excel)
    • Yavaa
  • Tutorials provided for all tools
  • Conducted fully remote and unsupervised
    in Q1 / Q2 2019
  • Submissions
    • 92 total
    • 16 complete
Self-assessment: Prior experience.

User Evaluation I - Successful Task Execution

  • Manual assessment of submitted artifacts
  • Classes of Issues
    High-severity
    unsuitable for the task.
    Example(s): incorrect joins, missing data.
    Moderate-severity
    suitable, but violating some constraints.
    Example(s): additional data included, no unit conversion.
    Low-severity
    cosmetic issues.
    Example(s): countries referred to by abbreviation.
Issues per submission: Summary.

User Evaluation II - Time Taken

User Evaluation III - Usability

User assessment.
J. Brooke. SUS: a "quick and dirty" usability scale. In: Usability Evaluation in Industry. London: Taylor and Francis, 1996.
J. Brooke. SUS: A Retrospective. In: J. Usability Studies 8.2 (Feb. 2013), pp. 29–40.
A. Bangor, P. T. Kortum, and J. T. Miller. An Empirical Evaluation of the System Usability Scale. In: International Journal of Human-Computer Interaction 24.6 (July 2008), pp. 574–594. DOI: 10.1080/10447310802205776.
A. Bangor, P. Kortum, and J. Miller. Determining What Individual SUS Scores Mean: Adding an Adjective Rating Scale. In: J. Usability Studies 4.3 (May 2009), pp. 114–123.

User Evaluation IV - Difficulty

User assessment.
Relative user assessment.

Code & Supplement Availability

Recap

Backup Slides

GFX-Sources

?