Connecting Genetics Researchers to NeSI
James Boocock & David Eyers, University of Otago
Phil Wilcox, Tony Merriman & Mik Black, Virtual Institute of Statistical Genetics (VISG) & University of Otago
Theme of conference "eResearch as an enabler" - show researchers that eresearch can benefit them and enabling them.
There's been a genomic data explosion - genomic, microarray, sequencing data. Genetics researchers need to use computers more and more. Computational cost increasing, need to use shared resources. "Compute first, ask questions later".
Galaxy aims to be web-based platform for computational biomedical research - accessible, reproducible, transparent. Has a bunch of interfaces. Recommends shared file system and splitting jobs into smaller tasks to take advantage of HPC.
Goal to create an interface between NeSI and Galaxy. Galaxy job > a job splitter > subtasks performed at NeSI then 'zipped up' and returned to Galaxy. Not just file spliting by lines, but by genetic distance. Gives different sized files.
Used git/github to track changes, and Sphynx for python documentation. Investigating Shibboleth for authentication. Some bugs they're working on. Further looking at efficiency measures for parallelization, building machine-learning approach do doing this.
Myths vs Realities: the truth about open data
Deborah Fitchett & Erin-Talia Skinner, Lincoln University
Our slides and notes available at the Lincoln University Research Archive
Some rights reserved: Copyright Licensing on our Scholarly record
Richard Hosking & Mark Gahegan, The University of Auckland
Copyright law has effect on reuse of data. Copyright = bundle of exclusive rights you get for creating work, to prevent others using it. Licensing is legal tool to transfer rights. Variety of licensing approaches, not created equal.
Linked data, combining sources with different licenses, makes licensing unclear - interoperability challenges.
* Lack of license - obvious problem
* Copyleft clauses (sharealike) - makes interoperability hard
* Proliferation of semi-custom terms - difficulties of interpretation
* Non-open public licenses (eg noncommercial) - more difficulties of interpretation
Technical, semantic, and legal challenges.
Research aims to capture semantics of licenses in a machine-readable format to align with, and interpret in context of, research practice. Need to go beyond natural language legal text. License metadata: RDF is a useful tool - allows sharing and reasoning over implications. Lets us work out whether you can combine sources.
Mapping terminology in licenses to research jargon.
Eg "reproduce" <-> "making an exact Copy"
"collaborators" <-> "other Parties"
This won't help if there's no license, or legally vague, or for novel use cases where we're waiting for precedent (eg text mining over large corpuses)
Compatibility chart of Creative Commons licenses - some very restricted. "Pathological combinations of licenses". Computing this can help measure combinability of data, degree of openness. Help understanding of propagation of rights and obligations.
Discussion of licensing choices should go beyond personal/institutional policies.
Comment: PhD student writing thesis and reusing figures from publications. For anything published by IEEE legally had to ask for permission to reuse figures he'd created himself. Not just about datasets but anything you put out.
Comment: "Best way to hide data is to publish a PhD thesis".
Q: Have you started implementing?
A: Yes but still early on coding as RDF structure and asking simple questions. Want to dig deeper.
Q: Get in trouble with practicing law - always told by institution to send questions to IP lawyers etc. Has anyone got mad at you yet?
A: I do want to talk to a lawyer at some point. Can get complex fast especially pulling in cross-jurisdiction.
Comment: This will save time (=$$$) when talking to lawyer.
A: There's a lot of situations where you don't need a lawyer - that's more for fringe cases.