Here is the presentation that Martin gave on using the VESpeR visual tool to spot data quality issues in Darwin Core Archives at the TDWG 2013 conference.
A paper discussing the design and capabilities of Vesper was published in Elsevier’s Ecological Informatics journal recently.
A final version of the demonstration software is now available. Please take a look and let us know what you think.
A new version of the demonstration software is now available. Please take a look and let us know what you think.
Ok, some quick instructions for the VESPER demo page.
First off, make sure you have an up-to-date web browser. Your best bet is the latest release of Chrome, Opera (now WebKit based), Safari (on Mac), or Firefox, and at a push IE9. What I haven’t tested it on: IE 10, or any mobile browser.
Next, choose the DWCA you want to see. There are five examples in the top bar and if you’re using Chrome/Opera (which have implemented HTML5 local filesystems) there’s a button next to those to load your own DWCA (a file browser should pop up).
Let it load the zip file and then it’ll present you with a small panel showing your next choices. Pick a field you’d like to see used as a “name” in the taxonomy and map components – you may not have a choice here. Then pick which of the visualisations VESPER has deduced from the DWCA meta.xml that the DWCA should be able to populate. You can press “advanced” if you want a glimpse of which fields in the meta file your current selections are picking, but I wouldn’t spend too long poking about in there
In the taxonomy/timeline/taxa distribution components, left-click navigates and right-click selects. The map, powered by leaflet.js, has it’s own UI interactions, which are pretty standard for most map interfaces – press the circle, square, polygon icons to start selecting groups. Each visualisation panel has an “x” in the corner to delete that view. Pressing the “x” in the corner of the “master view” kills all associated views for that data set.
Then, have an explore about and spot areas where the data looks good or problematic.
Big break in blog postings there…
Made Zip loading as efficient as can be now. Still massive differences between browsers. Opera seems to store every single byte it ever comes across (peak at 1.0GB+ of memory for the ENA dataset, which is err 1.0GB unzipped) whereas Chrome doesn’t use even half of that.
However, some of the browsers I’m trying to code to are losing relevance. Safari isn’t getting updates for windows any more at least, and Opera is to swap to the WebKit engine (used by Chrome and Safari). So really i have three targets now, Chrome/Safari(Mac), Firefox, and IE. I need to get a machine to test IE10 on too, at the moment I’m testing against IE9.
As far as the visualisation goes I need to do some work now. Leaflet-js covers the map stuff pretty well, there’s a toolkit out there based on D3 calledRickshaw which may be able to do the graphing elements and I can pretty much build my own taxonomy browser. After that it really is as case of how much data I can handle.
Continuing to try and crunch the massive 970,000 taxa data set into some HTML 5 compliant browsers.
May have to jettison IE9 as it doesn’t support ArrayBuffer and other HTML5 stuff making it inefficient, and well, just not usable.
Discovered that not turning every de-zipped character into a string ready for potential concatenation saves a bit of memory and is a lot faster. Rather I hold the dezipped charcode values for each character and knock them into strings when my parser deems it necessary.
Trying to use a handmade hashtable for index values didn’t work, was a lot slower and took as much memory as just using the index values as property names in a big JSON object, so chrome and firefox are optimised there at least.
Now i can view the 970,000 taxa tree in Firefox.
Following on from the previous post, I’ve now got the zip library working as well as possible IMO.
I saved some memory initially by reading in the zip file as an ArrayBuffer rather than a String, which uses 1 byte per file byte rather than 2. Then I changed the zip depacking routines to work on an ArrayBuffer rather than a String, which mainly involved mimicking a couple of String functions, indexOf and the like.
I built a rudimentary UI that lets me pick which fields in each file I want to read (and which files I want full stop). When I unzip the files, I now divert to one of my own functions which specifically work on DWCA files (i.e. delimited value files). I build up a temporary String array of unzipped data until I detect a line delimiter and then pack that array off to another function. There, I look for/swap in UTF-8 codes, ignore header lines, and then I search for field delimiters. When the field delimiter count in conjunction with the selected fields in the UI tell me I should keep this section of data, I .join a String out of that section of array (between the last and current field delimiter). This means anything I want to ignore doesn’t get as far as it’s own string, and the temporary String array keeps getting wiped and reused for each line in the files. Further, if I’ve decided a field is a categorical field composed of a limited set of values, I repoint to a String in a set pool of values if that value already exists rather than repeatedly store say “Species” half a million times.
In doing so, I can now read in a taxonomy of 970,000 nodes into Chrome at least.
The next step was to visualise this taxonomy which used a massive amount of memory. Investigation showed that D3′s hierarchy calculating routines liked to attach multiple coordinate properties (x, y, dx, dy) to each and every node whether it got shown or not. For nearly a million nodes that’s nearly four million properties and values. Analysis shows that the tree layouts could only display a few hundred nodes at any one time, so it was wasting time and memory. I rejigged one of the existing layout algorithms to accommodate this fact, making sure I didn’t calculate any coordinates for objects too small to render (i.e. < 1px in height/width) – we’d previously realised drawing these objects was pretty pointless and slowed everything down tremendously. This saved a massive amount of memory (monitored >75MB saving in chrome)
Basically, loading in and de-packing a 10MB file causes some browsers to fall over completely with memory bloat.
I’ve been using a library called JSZIP to do the hard work of finding/parsing/decompressing the zip files, and it does its job but has a fair bit of overhead which I’ve been trying to iron out to make it more memory efficient.
Memory efficiency in this case seems to boil down to three concepts:
- Only unzip the files I need to unzip
- Chuck away parts of files I unzip I think I don’t need
- Try to avoid making temporary strings as they gobble up memory like nobodies business in some browsers
The third point I’ve been tackling (through premature micro-optimisation probably) by changing the string building functions in JSZip to use joins on arrays of characters rather than string concatenation. It is proven to be a better option for older browsers though (IE). Chrome seems smarter and uses views on existing strings if asked to substring portions of a larger string.
The first point involves changing JSZip so it doesn’t just blindly unzip everything it finds. I now make it read the directory of what’s in the zip file, and then I can pick and choose what to unzip afterwards. In practice for the DWCA data, this means unzipping the meta.xml, and then generating a UI that lets users pick and choose which further extension files and fields they wish to look at and unzip.
The second point follows on from the first. If I let users choose which fields in a file they’re are interested in I can scan for tab delimiters as the unzipping occurs and discard characters: i.e. if I don’t want the Nth column, then I discard all characters between the (N-1)th and Nth field delimiters in a line – (EOL detected by waiting for the line delimiter to turn up.) I do this right in the inflate method so it doesn’t get a chance to form any large temporary strings with the data.
So far it does the job for Chrome, I just have to see how the other browsers buy it.
One of the components in Vesper should be a map component showing plots of specimen occurrences.
I have got Google Visualization’s map component working, but it’s slow, and lacks both scalability and interaction. I have a feeling anything that runs fast may have to blast data onto a html5 canvas object, thousands of svg point plots is going to go bad.
Another approach is to look in StackOverflow to see what has been recommendedby others, but the site doesn’t like “what is the best….?” style questions as they are opinion-based.
Working from what answers are there, leaflet (leafletjs) is recommended, and is still being actively developed, so will have a look at that next.
Leaflet-JS seems to work great, integrates with OpenStreetMap data, and has a clustering marker plugin that stops the interface slowing down trying to draw thousands of markers at once. Works much quicker than Google’s MAP visualisation class.