vesper demo

Ok, some quick instructions for the VESPER demo page.

First off, make sure you have an up-to-date web browser. Your best bet is the latest release of Chrome, Opera (now WebKit based), Safari (on Mac), or Firefox, and at a push IE9. What I haven’t tested it on: IE 10, or any mobile browser.

Next, choose the DWCA you want to see. There are five examples in the top bar and if you’re using Chrome/Opera (which have implemented HTML5 local filesystems) there’s a button next to those to load your own DWCA (a file browser should pop up).

Let it load the zip file and then it’ll present you with a small panel showing your next choices. Pick a field you’d like to see used as a “name” in the taxonomy and map components – you may not have a choice here. Then pick which of the visualisations VESPER has deduced from the DWCA meta.xml that the DWCA should be able to populate. You can press “advanced” if you want a glimpse of which fields in the meta file your current selections are picking, but I wouldn’t spend too long poking about in there :-)

When you’re happy, press “Load” and wait anything from a second or two (HIBG) to 30 seconds or so (the ENA file) for the panel to be replaced with another panel (I’ll eventually try and come up with a progress bar but javascript isn’t that amenable.) This is your “master view” for that DWCA. Select any of the visualisation options it offers to spawn a visualisation panel of that type for the DWCA.

In the taxonomy/timeline/taxa distribution components, left-click navigates and right-click selects. The map, powered by leaflet.js, has it’s own UI interactions, which are pretty standard for most map interfaces – press the circle, square, polygon icons to start selecting groups. Each visualisation panel has an “x” in the corner to delete that view. Pressing the “x” in the corner of the “master view” kills all associated views for that data set.

Then, have an explore about and spot areas where the data looks good or problematic.

i return

Big break in blog postings there…

Made Zip loading as efficient as can be now. Still massive differences between browsers. Opera seems to store every single byte it ever comes across (peak at 1.0GB+ of memory for the ENA dataset, which is err 1.0GB unzipped) whereas Chrome doesn’t use even half of that.

However, some of the browsers I’m trying to code to are losing relevance. Safari isn’t getting updates for windows any more at least, and Opera is to swap to the WebKit engine (used by Chrome and Safari). So really i have three targets now, Chrome/Safari(Mac), Firefox, and IE. I need to get a machine to test IE10 on too, at the moment I’m testing against IE9.

As far as the visualisation goes I need to do some work now. Leaflet-js covers the map stuff pretty well, there’s a toolkit out there based on D3 calledRickshaw which may be able to do the graphing elements and I can pretty much build my own taxonomy browser. After that it really is as case of how much data I can handle.

Other javascript oddities include the fact I have to hack a small part of D3 so I can access some tree calculation code it has. There may be a way of grabbing that via d3′s rebind command, but we’ll have to see.

more efficiency

Continuing to try and crunch the massive 970,000 taxa data set into some HTML 5 compliant browsers.

May have to jettison IE9 as it doesn’t support ArrayBuffer and other HTML5 stuff making it inefficient, and well, just not usable.

Discovered that not turning every de-zipped character into a string ready for potential concatenation saves a bit of memory and is a lot faster. Rather I hold the dezipped charcode values for each character and knock them into strings when my parser deems it necessary.

Trying to use a handmade hashtable for index values didn’t work, was a lot slower and took as much memory as just using the index values as property names in a big JSON object, so chrome and firefox are optimised there at least.

Now i can view the 970,000 taxa tree in Firefox.

zip 2 and d3 efficiency

Following on from the previous post, I’ve now got the zip library working as well as possible IMO.

I saved some memory initially by reading in the zip file as an ArrayBuffer rather than a String, which uses 1 byte per file byte rather than 2. Then I changed the zip depacking routines to work on an ArrayBuffer rather than a String, which mainly involved mimicking a couple of String functions, indexOf and the like.

I built a rudimentary UI that lets me pick which fields in each file I want to read (and which files I want full stop). When I unzip the files, I now divert to one of my own functions which specifically work on DWCA files (i.e. delimited value files). I build up a temporary String array of unzipped data until I detect a line delimiter and then pack that array off to another function. There, I look for/swap in UTF-8 codes, ignore header lines, and then I search for field delimiters. When the field delimiter count in conjunction with the selected fields in the UI tell me I should keep this section of data, I .join a String out of that section of array (between the last and current field delimiter). This means anything I want to ignore doesn’t get as far as it’s own string, and the temporary String array keeps getting wiped and reused for each line in the files. Further, if I’ve decided a field is a categorical field composed of a limited set of values, I repoint to a String in a set pool of values if that value already exists rather than repeatedly store say “Species” half a million times.

In doing so, I can now read in a taxonomy of 970,000 nodes into Chrome at least.

The next step was to visualise this taxonomy which used a massive amount of memory. Investigation showed that D3′s hierarchy calculating routines liked to attach multiple coordinate properties (x, y, dx, dy) to each and every node whether it got shown or not. For nearly a million nodes that’s nearly four million properties and values. Analysis shows that the tree layouts could only display a few hundred nodes at any one time, so it was wasting time and memory. I rejigged one of the existing layout algorithms to accommodate this fact, making sure I didn’t calculate any coordinates for objects too small to render (i.e. < 1px in height/width) – we’d previously realised drawing these objects was pretty pointless and slowed everything down tremendously. This saved a massive amount of memory (monitored >75MB saving in chrome)


Trying to load in the DWCA files (which are basically zips) has been an adventure in javascript.

Basically, loading in and de-packing a 10MB file causes some browsers to fall over completely with memory bloat.

I’ve been using a library called JSZIP to do the hard work of finding/parsing/decompressing the zip files, and it does its job but has a fair bit of overhead which I’ve been trying to iron out to make it more memory efficient.

Memory efficiency in this case seems to boil down to three concepts:

  1. Only unzip the files I need to unzip
  2. Chuck away parts of files I unzip I think I don’t need
  3. Try to avoid making temporary strings as they gobble up memory like nobodies business in some browsers

The third point I’ve been tackling (through premature micro-optimisation probably) by changing the string building functions in JSZip to use joins on arrays of characters rather than string concatenation. It is proven to be a better option for older browsers though (IE). Chrome seems smarter and uses views on existing strings if asked to substring portions of a larger string.

The first point involves changing JSZip so it doesn’t just blindly unzip everything it finds. I now make it read the directory of what’s in the zip file, and then I can pick and choose what to unzip afterwards. In practice for the DWCA data, this means unzipping the meta.xml, and then generating a UI that lets users pick and choose which further extension files and fields they wish to look at and unzip.

The second point follows on from the first. If I let users choose which fields in a file they’re are interested in I can scan for tab delimiters as the unzipping occurs and discard characters: i.e. if I don’t want the Nth column, then I discard all characters between the (N-1)th and Nth field delimiters in a line – (EOL detected by waiting for the line delimiter to turn up.) I do this right in the inflate method so it doesn’t get a chance to form any large temporary strings with the data.

So far it does the job for Chrome, I just have to see how the other browsers buy it.

There is also the issue that javascript internalises strings with 2 bytes per character, so an ASCII-only UTF-8 file gets doubled in size in memory. I don’t know if there’s much to be done about that.



geo vis

One of the components in Vesper should be a map component showing plots of specimen occurrences.

My thinking is that there must be a map visualisation out there we can reuse rather than roll our own, even if we have the caveat it must be javascript-based.

I have got Google Visualization’s map component working, but it’s slow, and lacks both scalability and interaction. I have a feeling anything that runs fast may have to blast data onto a html5 canvas object, thousands of svg point plots is going to go bad.

Found a link here that lists current web-based geo-visualisation toolkits. Taking the naive view that anything involving php involves access to servers, and I know nothing about python, I’ve whittled it down to javascript/html only solutions.

Package Map Tech Licence
dbox MapServer OS
geoExt OpenLayers BSD
leaflet WMS BSD
mapbuilder WMS/WFS LGPL
mapQuery OpenLayers MIT
msCross MapServer GPL
OpenLayers (many) BSD
PolyMaps (many) OS
tile5 (many) MIT

Another approach is to look in StackOverflow to see what has been recommendedby others, but the site doesn’t like “what is the best….?” style questions as they are opinion-based.

Working from what answers are there, leaflet (leafletjs) is recommended, and is still being actively developed, so will have a look at that next.


Leaflet-JS seems to work great, integrates with OpenStreetMap data, and has a clustering marker plugin that stops the interface slowing down trying to draw thousands of markers at once. Works much quicker than Google’s MAP visualisation class.