In order to better integrate my blog with my website, better manage comment spam, and reduce my dependence on Google, this blog has moved to In order to avoid broken links I won't be deleting content from here, but no new content will be added, so please update your bookmarks and feeds.

Sunday, 13 February 2011

Converting a plaintext bibliography to Endnote/RIS format with help from Linux/Terminal

[Update 16/7/2011: See my more recent post on the topic, Launching Ref2RIS - convert your typed bibliography to Endnote format, which makes things even easier.]

You won't want to do this unless you've got literally hundreds of references. Any less, and these suggestions are way easier.

1. Format references so they're each on their own line - no blank lines.

2. Use Word's "Find Special" capabilities to replace a phrase in italics with {it}a phrase in italics{endit} and a phrase in bold with {b}a phrase in bold{endb}.  (Similarly if the citations contain underlines.)

3. Save as plaintext - say, source.txt.  Now the fun begins...  My own source text contains 600-odd lines in ACS style, like this:
Bamford, C. H.; Tipper, C. F. H. {it}Comprehensive Chemical Kinetics{endit}; Elsevier: New York, {b}1977{endb}. 
House, D. A.{it}Chem. Rev.{endit} {b}1962{endb}, {it}62{endit}, 185

4. Open up Terminal or some other Linux command line.

5. Endnote records are separated by a line
ER  - 
- that's two spaces before the hyphen and one after.  (All these details come from Endnote's help pages.) This is the easy part: type in
sed -e 's/^\(.*\)/\1ER  - /' source.txt > source1.txt

6. The start of each Endnote record tells you what kind of citation it is - eg a book, journal etc.  To find every line that includes a colon (ie separating the publisher from the city published in) type in
sed -e 's/^\(.*:\)/TY  - BOOK@@\1/' source1.txt > source2.txt
Note 1: The "@@" is in there as a sign that you'll need to replace this with a new line later; but we want to keep everything on one line for now.
Note 2: This is a good example of why this whole method is highly suspect, because it'll also catch citations which have a colon in the article title or in a typo or whatever.  So if you can think of a better sign that a citation is a book then use that instead of the colon.

Alternatively, you could type in
sed -e 's/^\(.*{it}[0-9]*{endit}\)/TY  - JOUR@@\1/' source1.txt > source2.txt
to find every line that contains {it}[some number]{endit} which, in my source, is the best indicator that I'm dealing with a journal.  The same caveats apply - you'll get both false positives and false negatives.

Anyway, keep doing what seems best given your source, and fix up the inevitable mistakes by hand until each line starts with TY  - something.  If you want to give up and just assume that everything that isn't already assigned as something must be a journal then try
sed -e 's/^\([^(TY  - )].*$\)/TY  - JOUR@@\1/' source2.txt > source3.txt

I now have source looking like:
TY  - BOOK@@Bamford, C. H.; Tipper, C. F. H. {it}Comprehensive Chemical Kinetics{endit}; Elsevier: New York, {b}1977{endb}. 
ER  -
TY  - JOUR@@House, D. A.{it}Chem. Rev.{endit} {b}1962{endb}, {it}62{endit}, 185
ER  -

7. Now we keep playing with patterns.  (You may be able to do large chunks of this with regular find/replace, but for illustrative purposes I'll keep using Terminal.)

For example, in my source the authors are nicely set off: they come after "@@" and before the first "{it}" (or "in {it}"), and if there's more than one of them they're separated by ";".  So a few commands:
sed -e 's/@@\(.* in {it}\)/@@A1  - \1/' source3.txt > source4.txt
sed -e 's/@@\(.* {it}\)/@@A1  - \1/' source3.txt > source4.txt
sed -e 's/;\(.*;\)/@@A1  - \1/' source5.txt > source6.txt (This one I had to repeat a few times depending how many authors could be cited in one reference; there's supposed to be a way to do it globally but my unix fu is not strong.)
sed -e 's/;\(.*{it}\)/@@A1  - \1/' source8.txt > source9.txt

Journal titles:
sed -e 's/^\(TY  - JOUR.*\)\({it}.*{endit} {b}\)/\1@@JO  - \2/' source9.txt > source10.txt

sed -e 's/\({b}[0-9]*{endb}\)/@@Y1  - \1/' source10.txt > source11.txt

And so forth.  You pretty soon start to see why the first suggestion on most lists of ways to convert plaintext citations into RIS format is always "Just type it in / search for it again by hand".  The method above is really only suitable if you've got literally hundreds of citations. (I have 639, plus or minus.)

8. Eventually you'll be at a point where you can do a simple find/replace to change @@ to a new line and nuke all the {it} and so forth.  This will be a great relief.

9. Rename your final saved file from source12.txt to source12.ris and open with Endnote.

10. Bonus material:  if this was a bibliography to a paper using numbered citations in order using eg [1], then in that paper you can do a find/replace on [ -> { and ] -> }, then tell the Endnote plugin to format citations, and voila, the best magic ever.  (If the paper uses author/date citations then you'll have to link them by hand, sorry.)