In order to better integrate my blog with my website, better manage comment spam, and reduce my dependence on Google, this blog has moved to In order to avoid broken links I won't be deleting content from here, but no new content will be added, so please update your bookmarks and feeds.

Saturday 5 October 2013


For a while I've wanted to transfer my blog to a WordPress platform on my own domain, for a few reasons:
  • it's nice owning one's own house;
  • it keeps all my stuff together in my own control;
  • it reduces my dependence on Google; and
  • the comment spam on Blogger is driving me up the wall - Google is startlingly bad at managing it, so I get email notifications for all of it and haven't worked out how to filter in my inbox either.
So after a certain amount of procrastination, it's now done. Tweaking the theme took a while, but the import process went pretty smoothly with just a couple of things I had to fix by hand. So now all posts and comments have been duplicated at and the new RSS feed is Please update any bookmarks or feeds accordingly!

(In order to avoid broken links I won't be deleting content from here, so it should remain as long as Google allows; however no new content will be added and in due course I'll disable commenting.)

Friday 4 October 2013

Open access and peer review

We’re likely to be hearing about John Bohannon's new article in Science, "Who's afraid of peer review?" Essentially the author created 304 fake papers with bad science and submitted one each to an 'author-pays' open access journal to test their peer review. 157 of the journals accepted it, 98 rejected it; other journals were abandoned websites or still have/had the paper under review at time of analysis. (Some details are interesting. PLOS ONE provided some of the most rigorous peer review and rejected it; OA titles from Sage and Elsevier and some scholarly societies accepted it.)

Sounds pretty damning, except...

Peter Suber and Martin Eve each write a takedown of the study, both well worth reading. They list many problems with the methodology and conclusions. (For example, over two-thirds of open access journals listed on DOAJ aren't "author-pays" so it's odd to exclude them.)

But the key flaw is even more obvious than the flaws in the fake articles: his experiment was done without any kind of control. He only submitted to open access journals, not to traditionally-published journals, so we don’t know whether their peer review would have performed any better. As Mike Taylor and Michael Eisen point out, this isn't the first paper with egregiously bad science that's slipped through Science's peer review process either.

Tuesday 9 July 2013

Institutional repositories for data?

Via my Twitter feed:

(And discussion ensuing.)

I'm not an expert in data management. A year ago it was top of my list of Things That Are Clearly Very Important But Also Extremely Scary, Can Someone Else Please Handle It? But then I got a cool job which includes (among other things) investigating what this data management stuff is all about, so I set about investigating.

Sometime in the last half year I dropped the assumption that we needed to be working towards an institutional data repository. In fact, I now believe we need to be working away from that idea. Instead, I think we should be encouraging researchers to deposit their datasets in the discipline-specific (or generalist) data repositories that already exist.

I have a number of reasons for this:
  • My colleague and I, with a certain amount of outsourcing, already have to run a catalogue, the whole rickety edifice of databases and federated searching and link resolving and proxy authentication, library website and social media account, institutional repository, community archive, open journal system, etc etc. Do we look like we need another system to maintain?
  • An institutional archive is great kind of serviceable for pdfs. But datasets come in xls, csv, txt, doc, html, xml, mp3, mp4, and a thousand more formats, no, I'm not exaggerating. They can be maps, interviews, 3D models, spectral images, anything. They can be a few kilobytes or a few petabytes. Yeah, you can throw this stuff into DSpace, but that doesn't mean you should. That's like throwing your textbooks, volumes of abstracts, Kindles, Betamax, newspapers, murals, jigsaw puzzles, mustard seeds, and Broadway musicals (not a recording, the actual theatre performance) onto a single shelf in a locked glass display cabinet and making people browse by the spine labels.
  • If you want a system that can do justice to the variety of datasets out there, you'd better have the resources of UC3 or DCC or Australia or PRISM. Because you're either going to have to build it or you're going to have to pay someone to build it, and then you're going to have to maintain it. And you're going to have to pay for storage and you're going to have to run checksums for data integrity and you're going to have to think about migrating the datasets as time marches on and people forget what the current shiny formats are. And you're going to have to wonder if and how Google Scholar indexes it (and hope Google Scholar lasts longer than Google Reader did) or no-one will ever find it. And a whole lot more else.
  • If anything's in it. Do you know how hard it is to get researchers to put their conference papers into institutional repositories? My own brother flatly refuses. He points out that his papers are already available via his discipline's open access repository. That's where people in his discipline will look for it. It's indexed by Google. Why put it anywhere else? I conceded the point for the sake of our family dinner, and I haven't brought it up again because on reflection he's right. (He's ten years younger than me; he has no business being right, dammit.) And because it's hard enough to get researchers to put their conference papers into institutional repositories even when their copy is the only one in existence.
  • Do you know how hard it is to convince most researchers that they should put their datasets anywhere online other than a private Dropbox account? (Shameless plug: Last week another colleague and I did a talk responding to 8 'myths' or reasons why many researchers hesitate - slides and semi-transcript here. That's summarised from a list we made of 23 reasons, and other people have come up with more objections and necessary counters.) The lack of an institutional repository for data doesn't even rate.
No, forget creating institutional data repositories. What we need to be doing is getting familiar with the discipline data repositories and data practices that already exist, so when we talk to a researcher we can say "Look at what other researchers in your discipline are doing!"

This makes it way easier to prove that this data publishing thing isn't just for Those Other Disciplines, and that there are ways for them to deal with [confidentiality|IP issues|credibility|credit]. And it makes sure the dataset is where other researchers in that discipline are searching for it. And it makes sure the datasets are deposited according to that discipline's standards and that discipline's needs, not according to the standards and needs of whoever was foremost in mind of the developer who created the generic institutional data repository - so the search interface will be more likely to work reasonable for that discipline. And it means the types of data will be at least a little more homogenous (in some cases a lot more) so there's more potential for someone to do cool stuff with linked open data.

And it means we can focus on what we do best, which is helping people find and search and understand and use and cite and publish these resources. Trust me, there is plenty more to do in data management than just setting up an institutional data repository.

Thursday 4 July 2013

NeSI; publishing data; open licenses #nzes

Connecting Genetics Researchers to NeSI
James Boocock & David Eyers, University of Otago
Phil Wilcox, Tony Merriman & Mik Black, Virtual Institute of Statistical Genetics (VISG) & University of Otago

Theme of conference "eResearch as an enabler" - show researchers that eresearch can benefit them and enabling them.
There's been a genomic data explosion - genomic, microarray, sequencing data. Genetics researchers need to use computers more and more. Computational cost increasing, need to use shared resources. "Compute first, ask questions later".

Galaxy aims to be web-based platform for computational biomedical research - accessible, reproducible, transparent. Has a bunch of interfaces. Recommends shared file system and splitting jobs into smaller tasks to take advantage of HPC.

Goal to create an interface between NeSI and Galaxy. Galaxy job > a job splitter > subtasks performed at NeSI then 'zipped up' and returned to Galaxy. Not just file spliting by lines, but by genetic distance. Gives different sized files.

Used git/github to track changes, and Sphynx for python documentation. Investigating Shibboleth for authentication. Some bugs they're working on. Further looking at efficiency measures for parallelization, building machine-learning approach do doing this.

Myths vs Realities: the truth about open data
Deborah Fitchett & Erin-Talia Skinner, Lincoln University
Our slides and notes available at the Lincoln University Research Archive

Some rights reserved: Copyright Licensing on our Scholarly record
Richard Hosking & Mark Gahegan, The University of Auckland

Copyright law has effect on reuse of data. Copyright = bundle of exclusive rights you get for creating work, to prevent others using it. Licensing is legal tool to transfer rights. Variety of licensing approaches, not created equal.

Linked data, combining sources with different licenses, makes licensing unclear - interoperability challenges.

* Lack of license - obvious problem
* Copyleft clauses (sharealike) - makes interoperability hard
* Proliferation of semi-custom terms - difficulties of interpretation
* Non-open public licenses (eg noncommercial) - more difficulties of interpretation

Technical, semantic, and legal challenges.
Research aims to capture semantics of licenses in a machine-readable format to align with, and interpret in context of, research practice. Need to go beyond natural language legal text. License metadata: RDF is a useful tool - allows sharing and reasoning over implications. Lets us work out whether you can combine sources.

Mapping terminology in licenses to research jargon.
Eg "reproduce" <-> "making an exact Copy"
"collaborators" <-> "other Parties"

This won't help if there's no license, or legally vague, or for novel use cases where we're waiting for precedent (eg text mining over large corpuses)

Compatibility chart of Creative Commons licenses - some very restricted. "Pathological combinations of licenses". Computing this can help measure combinability of data, degree of openness. Help understanding of propagation of rights and obligations.

Discussion of licensing choices should go beyond personal/institutional policies.

Comment: PhD student writing thesis and reusing figures from publications. For anything published by IEEE legally had to ask for permission to reuse figures he'd created himself. Not just about datasets but anything you put out.

Comment: "Best way to hide data is to publish a PhD thesis".

Q: Have you started implementing?
A: Yes but still early on coding as RDF structure and asking simple questions. Want to dig deeper.

Q: Get in trouble with practicing law - always told by institution to send questions to IP lawyers etc. Has anyone got mad at you yet?
A: I do want to talk to a lawyer at some point. Can get complex fast especially pulling in cross-jurisdiction.
Comment: This will save time (=$$$) when talking to lawyer.
A: There's a lot of situations where you don't need a lawyer - that's more for fringe cases.

U of Washington eScience Institute #nzes

eScience and Data Science at the University of Washington eScience Institute
"Hangover" Keynote by Bill Howe, Director of Research, Scalable Data Analytics, eScience Institute Affiliate Assistant Professor, Department of Computer Science & Engineering, University of Washington

Scientific process getting reduced to database problem - instead of querying the world we download the world and query the database...

UoW eScience Inst to get in the forefront of research in eScience techniques/technology, and in fields that depend on them.

3Vs of big data:
volume - this gets lots of attention but
variety - this is the bigger challenge

Sources a longtail image from Carol Goble showing lots of data in Excel spreadsheets, lab books, etc, is just lost.
Types of data stored - especially data data and some text. 87% of time is on "my computer"; 66% a hard drive...
Mostly people are still in the gigabytes range, or megabytes, less so in terabytes (but a few in petabytes).
No obvious relationship between funding and productivity. Need to support small innovators, not just the science stars.

Problem - how much time do you spend handling data as opposed to doing science? General answer is 90%.
May be spending a week doing manual copy-paste to match data because not familiar with tools that would allow a simple SQL JOIN query in seconds.
Sloan Digital Sky Survey incredibly productive because they put the data online in database format and thousands of other people could run queries against it.

SQLShare: Query as a service
Want people to upload data "as is". Cloud-hosted. Immediately start writing queries, share results, others write their queries on top of your queries. Various access methods - REST API -> R, Python, Excel Addin, Spreadsheet crawler, VizDeck, App on EC2.

Has been recommending throwing non-clean data up there. Claims that comprehensive metadata standards represent a shared consensus about the world but at the frontier of research this shared consensus by definition doesn't exist, or will change frequently, and data found in the wild will typically not conform to standards. So modifies Maslow's Needs Hierarchy:
Usually storage > sharing > curation > query > analytics
Recommends: storage > sharing > query > analytics > curation
Everything can be done in views - cleaning, renaming columns, integrating data from different sources while retaining provenance.

Bring the computation to the data. Don't want just fetch-and-retrieve - need a rich query service, not a data cemetary. "Share the soup and curate incrementally as a side-effect of using the data".

Convert scripts to SQL and lots of problems go away. Tested this by sending postdoc to a meeting and doing "SQL stenography" - real-time analytics as discussion went on. Not a controlled study - didn't have someone trying to do it in Python or R at same time - but would challenge someone to do it as quickly! Quotes (a student?) "Now we can accomplish a 10minute 100line script in 1 line of SQL." Non-programmers can write very complex queries rather than relying on staff programmers and feeling 'locked out'.

Data science
Taught an intro to data science MooC with tens of thousands of students. (Power of discussion forum to fix sloppy assignment!)

Lots of students more interested in building things than publishing, and are lost to industry. So working on 'incubator' projects, reverse internships pulling people back in from industry.

Q: Have you experimented with auto-generating views to cleanup?
A: Yes, but less with cleaning and more deriving schemas and recommending likely queries people will want. Google tool "Data wrangler".

Q: Once again people using this will think of themselves as 'not programmers' - isn't this actually a downside?
A: Originally humans wrote queries, then apps wrote queries, now humans are doing it again and there's no good support for development in SQL. Risk that giving people power but not teaching programming. But mostly trying to get people more productive right now.

Wednesday 3 July 2013

HuNI; NZ humanities eResearch; flux in scientific knowledge #nzes

Humanities Networked Infrastructure (HuNI) Virtual Laboratory: Discover | Analyse | Share
Deb Verheven, Deakin University
Conal Tuohy and Richard Rothwell, VeRSI
Ingrid Mason, Intersect Australia

Richard Rothwell presenting. I've previously heard Ingrid Mason talk about HuNI at NDF2012.

Idea of a virtual laboratory as a container for data (from variety of disciplines) and a number of tools. But many existing tools are like virtual laboratories themselves, often specific to disciplines.

Have a .9EFTS ontologist. Also project manager, technical coordinator, web page designer, tools coordinator and software developer.

Defined project as linked open data project. Humanities data into HuNI triple store (using RDF), embedded in HuNI virtual lab to create user interface. Embellishments include to provide linked open data in SPARQL, and publish via OAI-PMH; and to use AAF (Shibboleth) authentication; to use SOLR search server for virtual lab.

Have ideas of research use-cases (basic and advanced eg SPARQL queries) and desired features, eg custom analysis tools. The challenge is to get internal bridging relationships between datasets and global interoperability. Aggregating doesn't solve siloisation.

"Technology-driven projects don't make for good client outcomes."

Q: What response from broader humanities community?
A: Did some user research, not as much as wanted. Impediment is that when building database tend to have more contact with people creating collections than people using them. Trying to build framework/container first and idea is that researchers will come to them and say "We want this tool" and they'll build it. Funding set aside for further development.

Q: You compared this to Galaxy, but you've built from ground-up where Galaxy is more fluid. A person with command-line can create tools in Galaxy but with HuNI you'd have to do it yourself.
A: Bioinformatics folk tend to be competent with Python - but we're not sure what competencies our researchers will have, less likely to be able to develop for themselves.

Requirements for a New Zealand Humanities eResearch Infrastructure
James Smithies, University of Canterbury
Vast amounts of cultural heritage being digitised or being born online. Humanities researchers will never be engineers but need to work through the issues.

International context:
Humanities computing's been around for decades but still in its infancy. US, UK, even Aus have ongoing strategic conversations, which helps build roadmaps. NZ is quite far behind these (though have used punchcards where necessary). "Digging into Data Challenge" overseas but we're missing out because of lackk of infrastructure and lack of awareness.

Fundamentals of humanities eresearch:
HuNI provides a good model. Need a shift from thinking of sources as objects to viewing them as data. Big paradigm shift. Not all will work like this. But programmatic access will become more important.

National context:
19th century ship's logs, medical records from leper colonies. Hard to read, incomplete, possibly accurate. Have traditional methods to deal with these but problems multipy when ported into digital formats. Big problem is lack of awareness of what opportunities exist. So capabilities and infrastructure is low. Decisions often outsourced to social sciences.
At the same time, DigitalNZ, National Digital Heritage Archive, Timeframes archive, AJHR, PapersPast, etc are fantastic resources that could be leveraged if we come up with a central strategy.

  • Need to develop training schemes
  • Capability building. Lots of ideas out there but people don't know where to start. Need to look at peer review, PBRF - how to measure quality and reward it.
  • International collaboration
  • Requirements elicitation and definition
  • Funding for all of the above including experimentation

Q: Data isn't just data, it's situated in a context. Being technology-led and using RDF is one thing. But how do we give richness to a collection?
A: Classic example would be researcher wanting access to object properly marked up and contribute to the conversation by adding scholarly comments, engage with other marginalia. Eg ancient greek text corpus (is I think describing the Perseus Digital Library). Want both a simple interface and programmatic access.

Q: Need to make explicit the value of an NZ corpus. Have some pieces but need to join up. Need to work with DigitalNZ. Once we have corpus can look at tools.
A: Yes, need to get key stakeholders around table and talk about what we need.

Capturing the flux in Scientific Knowledge
Prashant Gupta & Mark Gahegan, The University of Auckland
Everything changes - whether the physical world itself or our understanding of the world:
* new observation or data
* new understanding
* societal drivers
How can we deal with change and make our tools and systems more dynamic to deal with change?

Ontology evolution - have done lots of work on this. Researchers have updated knowledge structure and incorporated in forms of provenance or change logs. Tells us "knowledge that" eg What is the change, when it happened, who did it, to what, etc. But we still don't capture "knowledge how" or "knowledge why".

Life cycle of a category:
Processes, context, researchers' knowledge are involved in birth of a category - but these tend to be lost when the category's formed. We're left with the category's intension, extension, and place in the conceptual hierarchy. Lots of information not captured.

"We focus on products of science and ignore process of science".

Proposes connecting static categories and the process of science to get a better understanding. Could act as a fourth facet to a category's representation. Can help address interoperability problem and help track evolution of categories.

Process model:
Process of science gives birth to conceptual change modifies scientific artifacts connected as linked science improves process of science.

If change not captured, network of artifacts will become inconsistent and linked science will fail.

Proposes building a computational framework that captures and analyses changes, creating a category-versioning system.

Comment from James Smithies: would fit well in humanities context.
Comment: drawing parallel with software development changeset management.

NZ e-Infrastructures Panel #nzes

NZ e-Infrastructures Panel
Nick Jones, New Zealand eScience Infrastructure
Steve Cotter, REANNZ
Andrew Rohl, Curtin University, ex ED iVEC
Tony Lough, NZ Genomics Ltd
Don Smith, NZ Synchrotron Group Ltd
Rhys Francis, eResearch Coordination Project

How we doing and how can we work better with Australia?
* NJ: Have been working closer recently, but big gaps in data especially, and unevenness in various disciplines.
* SC: Working to identify gaps and work across organisations. REANNZ working closer with AARnet than have in the past which is bearing fruit re bandwidth.
* Political overlay - need to be able to say we've got the scientific partnership working.
* RF: Fair amount of partnership. But have found that governance separates things. "I don't believe in uninterpreted data." Need to figure out combo of data and tools to get results.
* Plenty of opportunity to work with Australia. Useful to look at infrastructures and what they've done right and haven't done right - lessons to be learn.
* AR: Problems faced here are not unique so you can avoid our mistakes and make your own instead. :-)

National Science Challenge signals government would like to roll framework out further. How do researchers engage with this?
* NJ: At many workshops people already know what they want to work on; at others there's range of possibilities. Need to build networks so not everyone has to be at table.
* RF: eResearch and IT isn't mentioned in challenges - but these are embedded in everything. If you want to be world-class at X, you need to be good at computer science.

How would you benchmark and measure return on investment?
* AR: Instance where in early days govt felt that if people wanted to keep investing, it must be valuable. This is changing now that investments are bigger. Hesitant about benchmarking because don't really want to be doing the same as anyone else.
* RF: How do you go from 0 to world's best supercomputer overnight? No idea how to measure that. It's a commitment to the advancement of knowledge but the govt doesn't have a KPI about that...

NZ had to set up Tuakiri because differences in law meant we couldn't use Australia's system. What other things the two countries might have to do to overcome differences in legislation?
* (Other audience member) - Yes there are differences so have needed to build systems that deal with both privacy acts and have been successful.
* (Anne Berryman) - Have started conversation with counterparts overseas and chief science advisors in Aus/NZ have a line of communication. There are platforms and issues we can deal with.

One goal is to achieve self-sustainability, eg user charging, member contributions. What's the Australian experience in user-pays and sustainability?
* RF: Financial benefits are overwhelming. If went to commercial provider it'd cost more and do less. Sustainability needs constant flow of funds to keep supercomputing running. There is a sustainability cliff. Govt keeps putting money in.
* SC: MBIE have removed self-sustainability requirement. Charging to make sure researchers have skin in the game does prove that service is needed; but not everyone can participate who should be.