Yavaa · Slide | Sirko Schindler · Yavaa

Yavaa

Supporting Data Workflows
from Discovery to Visualization

Sirko Schindler

From Alpha to Omega (Static Visualization)

Start: idea or message
End: vis ready for sharing
"in-between" is fuzzy

find data - transform it - convert to vis
sounds easy enough ...

just for static vis
dynamic vis needs UI/UX
will look at some of the obstacles now

For many users, when starting a visualization project only two things are clear: They have an idea in mind - some message or story to be told - and they expect some meaningful result, an image, for sharing or publication. In between there are some more steps, though, that often remain somewhat fuzzy at first:

You need to get our hands on some suitable data, you need transform it into the required shape, and you need pick a proper visualization. On a first glance, this still sounds easy enough. But when we're looking into each of those steps, things are not quite as simple.

In order to get the data, often you need to find and access it first. and when you have the file in hand, you need to make sense of its content, its abbreviations etc.

Next, you need to transform it into something that you can actually use. This can mean to combine some datasets, or adding and removing some data.

Once this is done, you can to create the visualization to get your message across. you need to pick one type and tell your system which piece of your data goes where in the visualization.

Finally, you're done and can export the result. But also here, you need to keep a few things in mind: in particular, provenance and reproducibilty: You may want to be able to document what you did and you might want to do it again later.

You might need to go back a step or two and redo stuff, but overall, this is a pretty stereotypical workflow for static visualizations. For dynamic ones, there would be one more step adding some interaction. But for this thesis, we'll stick to just static visualizations.

While the workflow seems pretty straight forward, we'll now look at some of the issues that might happen along the way.

Dataset Search

Myriads of providers
- Heterogeneity everywhere

Metadata search only
- Limited support of primary data search (if at all)

where to look for data?
- high number of providers - many opinions
  - single vs multiple datasets
  - categories?
  - what metadata?
  - different terminology
- semantic gap between producer and consumer
meta catalogs sometimes available
- least common denominator

traditional search engines
- index only metadata \cite{googleDatasetSearch}
- No real primary data search
- "not in metadata, not in search"
- rely on author opinion
- index content but losing structure information \cite{Zhang2018,Zhang2021}

major issue for aggregate portals
- information changed or lost

The first issue is already where to look for data. This is often scattered across a lot of different providers. And everybody has a different way of how to organize things, how to access data, and even how to name and label things. So for just government data in the EU, we already have at least 27 opinions how things should be done and each of them has made different decisions: Should this be one dataset or multiple ones? Which category should this go to? Or what metadata do we actually need?

Of course, there is the attempt to have this one big interface to rule them all. But with all the differences in the sources, this is pretty hard and, in my opinion, still leaves quite some room for improvement.

Then we have search interface itself. What we often get is only a search over metadata. So, we can look for titles, descriptions, or some keywords. What we do not get is the same thing for the primary data. So if the metadata lacks the detail or extent, we will not be able to find this dataset.

Especially for these aggregating portals, this is an issue. When harvesting from different sources and metadata schemata, some information will always get lost or changed and is thus not accessible anymore.

Dataset Combination

Common search returns individual datasets
- Complete answers might need multiple datasets, though

Heterogeneity in vocabularies

Heterogeneity in granularity

results: single datasets
user needs to combine manually

heterogeneity among sources
- NUTS: Nomenclature of Territorial Units for Statistics
  - EU standard for IDs to regions
- different abbreviations, same meaning (and other way around)
abbreviations vs full labels

different level of granularity
aggregation may be needed

technical challenges (spreadsheet software)

I mentioned before that sometimes you may need more than just one dataset. But search engines usually just provide individual ones. So when you need more, you have to collect all parts individually and then combine them manually on your own machine. When you do that, you have to deal with all kinds of heterogeneity.

For example, there are the codes to represent the values. Usually those look different enough. But as you see here, that is not always the case: ISO and NUTS have rather similar structure for their region codes. And sometimes it happens that the same code stands for different things. On other occasions, this might also be the other way around: Two quite different codes refer to the same thing. So, you really have to know which vocabulary you are dealing with.

Another issue is granularity. Data will be available on different levels. One source might have data on a country level, while the other one provides more detail. When integrating them, you have to aggregate the data and make sure every item goes into the right bucket.

Dataset Modification

Data barely in the required shape
- Suitable operations?

Consistency across operations
- Units of measurement easily overlooked

transform data to desired shape
- filter; data derivation; ...
- add additional datasets
which operations are suitable / meaningful right now?

units often ignored
accidents
- Mars Climate Orbiter
  - Lockheed Martin system uses pound-force seconds
  - NASA systems expects newton seconds
  - off by a factor of 4.45
  - wrong altitude computed
- Gimli Glider
  - while refueling used pound instead of kilogram
  - only about 45% of fuel available
  - airplane systems malfunctioned
  - ran out of fuel half way, had to land in Edmonton
- Discovery Laser Test
  - carried a reflective mirror for laser test
  - supposed to point towards Mt Mona Kea (ca 10k feet)
  - system expected nautical miles but crew entered feet
  - mirror in oriented upwards instead of downwards
  - conclusion: "laser would only work on a cooperative target" (SF Chronicle)
often not catastrophic
- but still important

So, now we have our hands on the data itself. But we still need to get it into the right shape. In general, this is kind of a cycle that can include multiple steps: filtering to remove some stuff, adding some derived values, adding more datasets, or any other operation one may think of. But often there is no help which operations actually make sense right now or is even applicable.

Another important thing that easily goes by unnoticed is units. Over time we have seen quite some accidents where a mixup in units was the underlying cause. In the Mars Climate Orbiter, systems used different units for propulsion. So when trying to land, there was a mistake in getting its position and the lander crashed as a result.

A rather funny accident happened with the Discovery. During a test, a mirror was supposed to look downwards and reflect a laser from Earth. However, the system mistook an elevation in feet for nautical miles. So instead of facing downwards, the mirror pointed upwards and the test failed. The laser was supposed to be an anti-missile-weapon and so the conclusion was that "it would only work on a cooperative target"

Most mistakes dont have such extreme consequences, but still unit mistakes remain a major issue.

Visualization

Plethora of visualizations available
- Technically possible? Suitable?

Map Data to Visualization
- Which data to which visual artifact?

not every vis for every data
- kind of data needed
- amount of data=
hard with no experience
- more fitting data to vis and not vis to data

similar challenge for binding data to vis

same for data to artifact mappings
categorical vs quantitative
other constraints: distinct values?
no black and white

relying mostly on user expertise

two extremes in approaches
- limit the number of choices (spreadsheet, …)
- allow full flexibility to code (coding libraries)

I already said you have to select a fitting visualization. But not every visualization works for every dataset. Each one has certain requirements about what kind of data can be displayed or how much of it. When you lack the overview, you might not know that there is something for your data and end up somehow fitting your data to some visualization you know.

A corresponding challenge is mapping parts of your data to the components of the visualization. Some components can only be used for quantitative values, others only for categorical ones. Or there are constraints like how many distinct values can reasonably be shown. Not all of these criteria are black and white, so what might work in one case, does not in another.

Nevertheless, even when we ignore details like how to use a specific tool, choosing proper visualizations requires quite some experience. Not everybody has this kind of experience and so we see quite some strange results every now and then.

Publication

Provenance
- Inputs? Operations?

Accessibility

(Updates, Changes, &) Re-execution

keep provenance
- documentation
- repeating
- includes operations applied
info not available
- not collected
- different tools
- docu manually, so dropped

default vis for prov often not useful
too complicate to understand

reproducibilty
execute for newer data
not repeat manually

We need to keep track of what we're doing. Sometimes to repeat a workflow or to justify what we did. The problem is, that more often than not, that information is not available. You might be using different tools for different steps or your tool does not track what you did at all. In both cases, the documentation would need to happen manually. And with many things done manually, it is easy to forget.

Another issue can be seen here already. This is the default output of a common provenance visualization tool. In my experience, this only works for showing rather short or simple workflows. For larger or more complex ones, the generated graphs quickly deteriorate and are barely usable. Especially for inexperienced user we need something they can easily understand and see what happened.

Finally, we want some documentation that can be executed once again. So, I dont want to have to sit down and repeat each and every step manually. Instead, I just want to tell a system "repeat this workflow with new data" and expect an result. This will not only be faster, but also leaves less room for errors like missing on step or entering some wrong value somewhere.

Thesis Outline

some of previous issues addressed in thesis

metadata schema
prerequisite for dataset combination
- search result is not a single dataset but a combination

unit consistency during operations
minimize number of conversions in transformations

vis recommender
- based on data characteristics
- "can be used at all" vs "is a good fit"
create vis

tracking provenance along the way
specialized PROV graph layout

focus on metadata and dataset combination

So we have seen that the way to a visualization is not without challenges. In my thesis I tried to develop a system that addresses some of them.

For search I extended metadata schemes with more information to allow for queries that can span multiple datasets.

During data transformations, units are taken care of automatically. This ensures the consistency of units and we can minimize the unit conversions necessary.

When it comes to visualization itself, I developed a recommender to help selecting visualizations and actually create them later on.

Finally, provenance is tracked across the entire workflow, so all operations documented and can be repeated if necessary. I also developed a layout algorithm to display this provenance data in a more accessible way.

For lack of time, we can not talk about all the bits and pieces here, so we'll focus on the first part, dataset search, for now.

Interlude: OLAP-Cubes

Columns / Variables / ...

OLAP-Cube

Requirements to Metadata Schema

General Information

Title
Description
Author / Publisher
…

Loading Data

Download location(s)
Media type
Inner structure

Primary Data Search

Column concepts
Data ranges
Semantic relationships

Data Integration

Codelists
Units of measurement
Roles of Variables

General information
- common attributes
- found in many standards

Loading data
- allows to automatically process
- inner structure: data arrangement in file

Search
- support primary data search
- balance between full indexing and no indexing
- semantic relations for semantic search

Data Integration
- interoperability between datasets
codelists and units to harmonize
roles for how to combine

Now to the search engine. First, we need to define a metadata schema or at least the requirements to one.

Common things in such schemas are title, description, authors and the like. This is pretty much standard in all schemes and we'll need that as well.

Second, we need some information that allows us to actually access the data. This is also rather common and includes stuff like download locations and the filetypes we get. When we want to do automatic processing, however, we'll need to know how the data is arranged within those files, their inner structure.

Then, we wanted to enable primary data search. We can not index all the datasets directly, so we need to summarize them. The things included here are concepts for columns and their ranges. If this should be a semantic search, we also need some relations between the concepts.

And then finally, we need some information to actually combine multiple datasets. Codelists and units allows us to harmonize datasets, whereas we need roles, dimension or measurement, tell us how to combine them.

Metadata Schema

color coding as before
reuse
- Dublin Core
- RDF data cubes
- data catalogue
- SKOS
- RDFS / OWL

* left: dataset
* right: columns

* dim vs meas
* yavaa:TimeFormat
  * interpret the format used in datasets
* skos:Concept
  * `skos:notation` used to map between value codes and concepts

qb
- qb:ComponentSpecification - specific to a structure
- qb:ComponentProperty - reusable across structures

And this is how the metadata schema I used looks like. Where possible, I tried to reuse existing stuff.

In particular, the schema uses RDF datacubes rather extensively. However, I dropped the parts that contain the primary data and just keep one for describing the structure of datasets.

The left side here shows the general information about datasets. The right hand side is about individual columns. You can see the distinction between dimensions and measurements.

The time format allows me to define the structure of date or time strings.

Similarly, I use the notation property of concepts to translate from whatever abbreviation is used in the dataset to something I can use.

The rest is rather straight forward.

Metadata Usecase - Search

Semantic Search

Supporting the -nyms
- Hypernyms, hyponyms, synonyms, …
- Disambiguating homonyms
Multilinguality
…

Primary Data Search

Values in context
Mediate between encoding schemes
Execute complex queries on metadata
- No fetching of (much larger) primary data
- Combine criteria

benefits of a semantic links
- overlap with FAIR criteria
- hypernym, hyponyms etc
- multi language
not the main focus, though

primary data
complex queries
- temperature above 20
can combine
- temperature above 20 and rain

values in context: keep connection between value and header
- "where was Germany the recipient of development aid?"
- simple indexing loses this
range information for primary data queries without full indexing
mediate: specific notations to general entities

So what do we get from such a schema?

Of course, there is the conventional semantic search stuff. We can support all kinds of hypernym, hyponyms and alike, we can support multiple language etc. Roughly we're on the level of the FAIR principles.

But we also have access to primary data descriptions. So we're not bound to the keywords the authors specified, but can search over the content.

And here the values are not isolated but put into context. So, we can go beyond whatever keyword search and use complex queries We can search not just for datasets including temperature, but for those with temperatures above 20 degrees. And we can further combine such criteria and look for datasets with temperatures above 20 degrees and some rain.

Primary Data Search across Sources

Multiple datasets to fulfill a query
Search → Search & Integration
- Identify candidate datasets
- Select subset to cover query
- Harmonize data
- Combine datasets
Change in search results
- List of datasets → Integration Workflow
- User-adjustable
- Workflow executed upon user request

complex queries → need more datasets
change in search workflow
- find data
- filter
- harmonize
- combine
result is workflow
users need to verify
before: done by human
- trial and error
now: done semi-automatically

But the more complex the queries get, the less likely we can answer them from a single dataset.

So we have to move from a mere search to search and integration process. Where previously we only had to identify proper datasets, we now have to go through multiple steps.

We still have to identify contributing datasets, but that's not the end.

We need to select a subset of those to actually answer the query.

We need to harmonize their content.

And we need to combine them.

The result is no longer a list of datasets that may or may not fit the query, but an entire workflow. Users can then check if that's what they actually wanted and if yes, then just execute the workflow and continue working with the data.

Usually, this involved a lot of manual work and trial and error. Users had to download datasets, check if they fit and then integrate them. But now, we can do this entire process within the search engine without downloading a single dataset.

Query Structure

Keyword search → query by example
- Description of requested structure
- Column headers and possibly value ranges
Value ranges
- Categorical: enumeration of values
- Time & quantitative: lower and/or upper bound
Extent of value ranges
- Finite / bounded: range given
- Infinite / unbounded: no range given
- Semi-finite / semi-bounded: only lower or upper bound
  (for time and quantitative columns)

describe the table you want
- query by example
no keyword search, but table structure
column headings
value ranges
- categorical: list
- quantitative: min / max

finite vs infinite
semi-finite

So how do these complex queries actually look like?

This is somewhat similar to what's commonly known as query-by-example. We're not searching for keywords anymore, but describe table structure. So, we have to specify column headings and potentially value ranges.

These ranges can take two forms: For categorical columns they are lists of values, For quantitative ones it's min and max values.

But these ranges are optional, so we end up with three cases: If a range is given, this column is finite or bounded. Similarly, if no range is given, it's infinite or unbounded. But for quantitative columns, we have a third case where only one of lower or upper bound is given. For now, we'll call that semi-finite or semi-bounded.

Rephrasing the Task

A jigsaw puzzle in higher dimensions

simplified view: 2D OLAP cube
given: query
- 2 dim + 1 meas
- bounded dims : given: collection of datasets
jigsaw puzzle
- not all pieces are needed
assumption: no holes
- removes a lot of complexity
no overlap
- less computationally expensive (join, conflicts, ...)
- in contrast: use all suitable datasets
  - much more data to be loaded
  - conflict resolution needed

intuition: cover with disjoint (parts of) datasets
- datasets consistent in themselves
  - any integration may cause issues

To get an intuition for how the approach works, first let's go back to the OLAP cube I mentioned before.

A simple query might look like this. We have two bounded dimensions, and are looking for a single measurement.

We also have a bunch of datasets available. Some cover the query better, some do worse.

What we have to do now, is kind of a jigsaw puzzle.

One after the other, we have to pick the dataset that currently covers our query best. Then we do the same for the remainder and so on. This way, step by step, we are building an answer to the query.

At this point, I have to add a few notes:

First, we assume that datasets have no holes. So, for each combination of dimensions there is also a measurement or at least we dont care if there's something missing.

Second, there is no overlap in the datasets we pick or we remove such an overlap beforehand. If we would allow for overlap, the workflow later would have to download all datasets that somewhat fit. In that case we would not only have to download much more data, but we would also need to somehow solve the conflicts. Which value would we use if we get more than one option? To prevent these issues, we'll go with disjoint datasets.

Of course, this leads to the question which datasets to choose, but I'll come back to that in a few slides.

Dataset Combination - Search for Candidates

Search for candidates
Select best candidate
Split regions
Apply Steps 2&3 recursively
Assemble workflow
Get user input

Search metadata repository
Criteria
- ≥ 1 matching dimension
- ≥ 1 matching measurement

repository of descriptions
- using previous schema
- might be federated
  - not implemented though
criteria
- need some content
- no dimensions: no context
- no measurements: no addition to result

Now we'll walk through the process with a little more detail.

In the first step, we need to find proper candidates. For this, we have a repository with descriptions using the schema from before. This might also be federated repository, but this doesnt matter for now.

Here, we first pick all the datasets that have some overlap with our query. Overlap means they share at least one dimension and one measurement.

So why do we need at least one each?

Without dimensions, we dont have any context for the measurements. So we dont know, if they actually fit the query.

And without measurements, we're not adding anything to our result.

Dataset Combination - Select best Candidate

Search for candidates
Select best candidate
Split regions
Apply Steps 2&3 recursively
Assemble workflow
Get user input

Order candidates wrt. query
Criteria
- Coverage … overlap in values between dataset and query
- Support … common columns between dataset and query
- Excess … additional dimensions in dataset wrt. the query

$$ \text{Score}(~s,~q~)~= \begin{pmatrix} ~\text{Coverage}(~s,~q~)~\times~\text{Support}(~s,~q~)~\\ ~1-~\text{Excess}(~s,~q~)~ \end{pmatrix} $$

best dataset to pick
- provides the most values
- most columns in the query
  - closest to the probable definition

criteria
- coverage
  - how much overlap?
  - no holes, so = overlap of dims
- support
  - how cols shared?
  - more cols, more similar
  - fewer cols, coarser granularity
- excess
  - how many cols to drop?
  - need to aggregate
  - not as important, but still
coverage and support main criteria
support only to break ties

Next we need to select one candidate. For this, we have to somehow order our list of candidates.

The intuition here is to pick the dataset that covers most of our query with the least effort.

So we define three criteria:

First, coverage - this is the overlap between the dataset and the query. With the assumption that we have no holes in the datasets, this is just a matter of getting the overlap per dimension and then multiply those together.

Next is support. Here we measure how many columns are shared between query and dataset. The more columns match, the closer we are to the query. If some columns are missing, this means our dataset collected data on a coarser level. This is still better than nothing, but we would like to prevent that if possible.

Last and this time also least, is excess. Here we measure how many additional dimensions are in the dataset. In this case the dataset was collected for a finer detail, so we have to aggregate to match the query. This is better than the other way around, but we would still prefer a better match if possible.

In the final metric we use both coverage and support as the main criteria and excess just as a tie-breaker. As I said before, we can more easily cope with additional data in the dataset than we can with some data missing.

Dataset Combination - Split Regions

Search for candidates
Select best candidate
Split regions
Apply Steps 2&3 recursively
Assemble workflow
Get user input

Split query into regions
- Already covered
- So far uncovered
Maintain "rectangular" shape
Using conflict-avoiding strategy

one dataset picked
split query
rectangular shape
- preserve the "no holes assumption"
multiple strategies possible for splitting
- symmetric overlap
  - wanted no overlap - see slide before
- asymmetric disjoint
  - needs order of dimensions
  - worst deteriorates more than the others
- symmetric disjoint
  - least decisions to be made
  - predictable behavior
  - more subqueries in this step, but doesnt matter

At this stage we have picked one dataset. Next, we need to figure out which part of the query is still uncovered so where do we still have to look for datasets.

The metrics on the previous slide assumed a rectangular shape of the query. For the subqueries we have to maintain this constraint.

In general, there are multiple strategies again.

We can split along each dimension individually and keep the others as is. That's the easiest option, but this would give us an overlap in queries with the same consequences we tried to avoid before.

So we want to go for disjoint queries. This still leaves two options. In one option, we would give preference to specific dimensions and first split according to them, before going to the next. But then we would need to define some order of dimensions which may fail for some cases.

And lastly we have the option shown on the slide. We split each dimension individually and then construct all combinations where the query is still not covered. This results in more subqueries than the other strategies, but gives the most predictable results and the number of subquries shouldnt matter that much. So we go with this option.

Dataset Combination - Apply Steps 2&3 recursively

Search for candidates
Select best candidate
Split regions
Apply Steps 2&3 recursively
Assemble workflow
Get user input

Reuse candidate list from before
- Drop candidates with a score of zero
- No new query to the metadata repository needed
Terminate recursion if …
- Entire (remaining) query is covered
- No more candidates are left over

subset of candidates still valid
- there can not be new candidates
single dataset might be solution to multiple recursive calls
- unify later again
- no redundant loading
Terminate
- query covered: only applies to finite queries

Now we can apply the same process recursively to subqueries as well.

We dont have to get new candidates, though. Any candidate of a subquery has already been a candidate of the initial query, so we already have all candidates we need. When scoring the remaining candidates against the subqueries, some will return a score of zero as they do not match the subquery anymore. These candidates can not only be ignored for this step, but can also be dropped from further recursions on this path.

We have two ways, the recursion can end: There might be no subqueries anymore, so we were able to answer everything or we might be running out of candidates. As we get no new candidates and only drop some of them in each recursion step, at least this condition guarantees the recursion to stop at some point.

Dataset Combination - Assemble Workflow

Search for candidates
Select best candidate
Split regions
Apply Steps 2&3 recursively
Assemble workflow
Get user input

Adjust dataset schemata
- Additional measurements → drop columns
- Additional dimensions → user interaction
Combine partial solutions
- Union- / join-operators

combine parts
- inverting the previous recursion
adjust schema
- add measurements:
  - no context → can be removed
- add dimensions:
  - higher granularity → need aggregation function
  - agg function can not be determined automatically
  - example: money - average over all payments or sum of all payments?
Combine
- union: same measurement for different dim values
- join: different measurement for same dim values

When we have answered all subqueries, we need to stitch the results together again to get a single workflow.

This is also done in two steps:

First, we adjust the schema of each dataset to match the query. If there are additional measurements, we can just drop them. They provide no context and are not asked for, so there is no effect anyways.

If we have additional dimensions, however, it's a little more complex. Additional dimensions mean data was collected on a finer granularity and we have to aggregate. But in general we can not know which aggregation function is the right one. Just imagine some data about sales. Does the user want the average sales or the sum of all sales? So for now, we add just a placeholder and have the user fill it in later.

When the dataset schemas are adjusted we can combine them in the reverse order of our previous recursion. Depending whether the schemata already match or not, we'll either use a UNION or OUTER JOIN to combine them.

The result is a tree-like workflow as shown in the middle here.

Dataset Combination - Get User Input

Search for candidates
Select best candidate
Split regions
Apply Steps 2&3 recursively
Assemble workflow
Get user input

Present result to user
- Coverage wrt. query
- Included data providers
- Included datasets
Select aggregations functions if necessary
Refine search?

single result (not list of options)
user review
- may exclude:
  - data providers
  - datasets
- needs re-evaluation then
result is executable workflow

So now we have the workflow more or less ready and can present it to the user.

We can also show them how much of their query we could answer and which datasets or providers are involved. If they want, they could now remove some dataset or provider for some reason. In this case we would need to rerun the process and remove the excluded datasets from the candidate list.

If in the previous step we added some placeholders for aggregations, the user would also need to say which one they want to have. This completes the workflow.

The user may now hit some execute button, and the system can start to run the workflow. This would download all involved datasets and combine them. So in the end, the user would get a complete dataset and can go on in their work.

Dataset Combination - Summary

Evaluation

* 3 aspects * compute engine: performance * conventional wisdom: JS is not capable of heavy lifting * SQLite ... RDBMS ... specialized software * Python ... common scripting language * search engine: performance * compared to traditional keyword search * user evaluation: * usability etc * using Eurostat * focus on user evaluation

Now we come to the evaluation. I implemented the algorithm we just saw and all the other parts I mentioned before in a webapp. With this implementation, we can now try to compare my ideas to others. Overall, I looked at 3 different aspects: First, is the overall performance of the system or is it actually fast enough. After all, everybody says that JavaScript is too slow for something like this. I compared my system against SQLite and Python and it turns out I'm only worse by a factor of 2, but the absolute times are still ok. Second, I looked at the performance of the search itself. We've just seen the combination requires a few more steps, so how much slower than just a keyword search is it? Same result as before: Compared to keywords search, we can serve about half as many request per minute, but this is still more than a thousand requests per minute from a single machine. The last thing was to have a few users actually have a go at it. The default choice for such tasks is still Spreadsheet software, so we use Excel and LibreOffice as a baseline here - actually with the help of the Eurostat search. For the remainder of the time, we'll now have a closer look into this last part.

Evaluation - Setup

Many thanks to Maximilian Stiede for the help here!

data preparation
Eurostat
- 6k datasets
- 3k picked
exclusion
- multiple units
- unit not in OM - some units manually added
- multiple time formats - e.g., mixed data for months and quarters
- Download Issues - server error 500
to RDF by templates
same setup used for all evaluations
thank Max Stiede

For the evaluation we need some data first.

Here I used Eurostat. At the time, it listed about 6 thousand datasets.

Some of these I had to omit for different reasons: For example, they included multiple units or time formats

something the metadata model cant handle.

Afterwards, about 3 thousand datasets remained. For those I generated the descriptions and put them into a GraphDb triple store.

You can also see the specs of the machine I used. This is not a high performance machine, but a rather small one. Still worked out just fine.

Quick sidenote: Max Stiede helped quite a lot with the implementation during his internship. So many thanks to him again.

Evaluation - Scenario

Your task is to create a dataset that holds the amount of sheep per inhabitant for the following European countries (the shortlist of vacation destinations of your superior - purely coincidental, of course) and period of time (previous five years):

Countries: Germany, Iceland, Ireland, Romania, Spain
Period of time: 2014 - 2019

After the dataset has been assembled, choose an adequate graph to present your results to your fellow colleagues and the general public. The suggested order of steps is as follows. Your personal workflow might deviate, though.

Identify suitable datasets.
While in general Eurostat has all the data you need, it is not provided as a single dataset to start with, so you will need to combine multiple ones.
Prepare a single dataset.
Eurostat's datasets contain more data than needed, so you will have to filter for the requested values. You may also need to join multiple source datasets.
Calculate the desired metric.
The requested metric is not included in Eurostat's raw data, so you will have to calculate it manually.
Select a proper visualization.
Once the dataset contains only the requested values, you can choose a suitable visualization.
Export your results.
Store your results (data and visualization) locally and then upload them on the next page.

Evaluation - Anticipated Strategies

User Evaluation - Setup

Within-subject design with counterbalancing
- Eurostat + Spreadsheet software
  (LibreOffice Calc, Microsoft Excel)
- Yavaa
Tutorials provided for all tools
Conducted fully remote and unsupervised
in Q1 / Q2 2019
Submissions
- 92 total
- 16 complete

Self-assessment: Prior experience.

Within-subject design with counterbalancing
- everybody doing both tasks
- order of systems varied
- was randomized, but all successful submissions placed Yavaa first
  - should favor spreadsheet due to more experience with the subject
tutorials available for all tools: Excel, LibreOffice, Yavaa
Remote execution
- reduce observer bias
- allow participants to pick place and time
low number of responses
- individual interviews might have been better approach

self assessment
- language skills: good enough, so everybody should understand instructions
- remainder: balanced distribution (given number of responses)

For the study itself, I asked everybody to do the task two times: once with my tool and once with excel or LibreOffice The order was actually supposed to be randomized, but all submission had Yavaa first and then the spreadseets. If anything, this should give an advantage to spreadsheet software as people already knew the data somewhat at that point.

For all tools I also created tutorials that included all the stuff needed.

The survey was fully remote. The intention was to let people pick place and time when they can do it. This also add less of my influence to the survey.

It ran in early 2019. In hindsight maybe not the best of times, so I only got 16 submissions that I could actually use.

The responses I got come from rather good spectrum of experience. As you can see here, we have people with next to no experience as well as some quite some experience.

User Evaluation I - Successful Task Execution

Manual assessment of submitted artifacts
Classes of Issues

High-severity

unsuitable for the task.

Example(s): incorrect joins, missing data.

Moderate-severity

suitable, but violating some constraints.

Example(s): additional data included, no unit conversion.

Low-severity

cosmetic issues.

Example(s): countries referred to by abbreviation.

Issues per submission: Summary.

manual revision of all submitted artifacts
- both visualization and data
- compared to an "ideal solution"
iteratively collected all issues appearing
aggregated to some issue classes
evaluated impact

no "perfect result" (though some artifacts were not uploaded)
general fewer issues yor Yavaa

First I looked at the quality of the results. I had an ideal solution, but of course all submissions were different.

So, I made a list of differences and categorized their impact into three categories. High severity actually make the result unusable. So some data was missing or the join was messed up On the other hand, low severity doesnt really matter. these are minor cosmetic things like leaving the abbreviations for countries. The moderate issues are in between. the results are are more or less correct, but not what the task asked for. so for example, some people added more countries than requested.

When we look at the distribution, we can see that there are fewer issues with Yavaa. In particular, we have far fewer result with multiple issues.

So it seems that at least Yavaa helps you to avoid quite some mistakes along the way.

User Evaluation II - Time Taken

overall times
- Yavva only half the time
details more complicated
- Yavaa better at joins and vis
- rest similar
enhanced combines three steps
- clear advantage for enhanced

timing: start of survey page until submission
Yavaa strategies split for first 3 steps
timings: Average (Median)

Yavaa (total) Conventional Enhanced Spreadsheet

22.7 (21.5) 21.7 (21.0) 24.6 (24) 42.4 (35)

splitting: self assessment by participants
- "share of time for each task"
distinguish conventional vs enhanced strategies
- logged all user interactions in Yavaa

detailed timings: Median

Yavaa (total) Conventional Enhanced Spreadsheet

Search & Load 5.8 9.2 4.8

Filter 4.1 0 4.2

Join 1.6 0 6.6

Transform & Adapt 4.6 4.5

Visualization 3.6 4.8

Export 1.4 1.7

data preparation: Median

Yavaa (total) Conventional Enhanced Spreadsheet

10.8 10.9 9.2 18.4

Next is the question about time - so how long did it take for people to finish the task.

In the overall summary, we can see that on average it took about twice as long with Excel or Libreoffice.

When we look into the details, it's a little more difficult to see where this advantage comes from. We can see that Yavaa was faster for joining the datasets and for visualizing them but roughly similar or even a little slower for the rest. The enhanced search seems to be especially slow in the search & load part.

But as you recall, that strategy actually combines a few steps, So for a fair comparison, we have to combine them.

This is what we can see here. This includes all the steps until the dataset is ready. And here we see a clear advantage for Yavaa and also the enhanced strategy. So overall, this indicates, that this combination of datasets can indeed speed up the process.

User Evaluation III - Usability

User assessment.

J. Brooke. SUS: a "quick and dirty" usability scale. In: Usability Evaluation in Industry. London: Taylor and Francis, 1996.
J. Brooke. SUS: A Retrospective. In: J. Usability Studies 8.2 (Feb. 2013), pp. 29–40.
A. Bangor, P. T. Kortum, and J. T. Miller. An Empirical Evaluation of the System Usability Scale. In: International Journal of Human-Computer Interaction 24.6 (July 2008), pp. 574–594. DOI: 10.1080/10447310802205776.
A. Bangor, P. Kortum, and J. Miller. Determining What Individual SUS Scores Mean: Adding an Adjective Rating Scale. In: J. Usability Studies 4.3 (May 2009), pp. 114–123.

User Evaluation IV - Difficulty

User assessment.

Relative user assessment.

Code & Supplement Availability

Code: https://github.com/Yavaa-Vis

Yavaa https://github.com/Yavaa-Vis/Yavaa
Eurostat Crawler https://github.com/Yavaa-Vis/Yavaa_Eurostat_Crawler

Yavaa 10.5281/zenodo.5204516
Eurostat Crawler 10.5281/zenodo.5204518
Evaluation Materials 10.5281/zenodo.4589337
Evaluation User Survey Results 10.5281/zenodo.5171103
Evaluation Performance Benchmark 10.5281/zenodo.4514808

	Yavaa (total)	Conventional	Enhanced	Spreadsheet
Search & Load		5.8	9.2	4.8
Filter		4.1	0	4.2
Join		1.6	0	6.6
Transform & Adapt	4.6			4.5
Visualization	3.6			4.8
Export	1.4			1.7

Yavaa

Supporting Data Workflows from Discovery to Visualization

Sirko Schindler

From Alpha to Omega (Static Visualization)

Dataset Search

Dataset Combination

Dataset Modification

Visualization

Publication

Thesis Outline

Interlude: OLAP-Cubes

Columns / Variables / ...

OLAP-Cube

Requirements to Metadata Schema

General Information

Loading Data

Primary Data Search

Data Integration

Metadata Schema

Metadata Usecase - Search

Semantic Search

Primary Data Search

Primary Data Search across Sources

Query Structure

Rephrasing the Task

Dataset Combination - Search for Candidates

Dataset Combination - Select best Candidate

Dataset Combination - Split Regions

Dataset Combination - Apply Steps 2&3 recursively

Dataset Combination - Assemble Workflow

Dataset Combination - Get User Input

Dataset Combination - Summary

Evaluation

Evaluation - Setup

Evaluation - Scenario

Evaluation - Anticipated Strategies

User Evaluation - Setup

User Evaluation I - Successful Task Execution

User Evaluation II - Time Taken

User Evaluation III - Usability

User Evaluation IV - Difficulty

Code & Supplement Availability

Recap

Backup Slides

GFX-Sources

Supporting Data Workflows
from Discovery to Visualization