Friday, 1 November 2013

Bibpup: a Perl script for updating Bibtex files using Inspire

Bibtex is a nice way to manage bibliographies, by collecting the bibliographical data of many articles in one Bibtex file, and then citing only some of these articles in a given document. However, the bibliographical data of an article can change when the article gets published in a journal. If the article was initially entered in the Bibtex file as a preprint, the Bibtex file must then be updated.
Here I propose a Perl script which does this automatically using the search engine Inspire, which is the standard search engine for high-energy physics and related fields.

Of course I do not believe that it is very relevant or useful to know whether an article is published and in which journal, but journals typically require such data to be displayed in bibliographies. In any case, the decision to display publication data or not is done at the level of the bibliography style file -- the Bibtex file itself is only a database, which should be as complete as possible.

So what does the script do? the script takes a Bibtex file, and for each entry which has no publication data, but an eprint field with an Arxiv number, a request is sent to Inspire. If publication data are found, then the entry is modified to include them. (The rest of the entry, and in particular the key, are not changed.)

For example, with the following Bibtex file called test1.bib,
@article{cer12,
      author         = "Chekhov, Leonid and Eynard, Bertrand and Ribault,
                        Sylvain",
      title          = "{Seiberg-Witten equations and non-commutative spectral curves in Liouville theory}",
      year           = "2012",
      eprint         = "1209.3984",
      archivePrefix  = "arXiv",
      primaryClass   = "hep-th",
      reportNumber   = "IPHT-T12-075",
      SLACcitation   = "%%CITATION = ARXIV:1209.3984;%%",
}

@Article{fr10,
     author    = "Fateev, Vladimir and Ribault, Sylvain",
     title     = "{Conformal Toda theory with a boundary}",
     journal   = "JHEP",
     volume    = "12",
     year      = "2010",
     pages     = "089",
     eprint    = "1007.1293",
     archivePrefix = "arXiv",
     primaryClass  =  "hep-th",
     doi       = "10.1007/JHEP12(2010)089",
     SLACcitation  = "%%CITATION = 1007.1293;%%"
}
@Article{rib03b,
     author    = "Ribault, Sylvain",
     title     = "Strings and D-branes in curved space-times. (In French)",
     year      = "2003",
     eprint    = "hep-th/0309272",
     SLACcitation  = "%%CITATION = HEP-TH 0309272;%%"
}
the script runs as follows:
bibpup.pl -e test1.bib
Using Inspire for the entry cer12,
Using Inspire for the entry rib03b,
No valid update found in Inspire for the labels: rib03b,
Labels whose entries were updated: cer12,
After that, the Bibtex file looks like this:
@article{cer12,
      author         = "Chekhov, Leonid and Eynard, Bertrand and Ribault,
                        Sylvain",
      title          = "{Seiberg-Witten equations and non-commutative spectral curves in Liouville theory}",
      journal        = "J.Math.Phys.",
      volume         = "54",
      pages          = "022306",
      doi            = "10.1063/1.4792241",
      year           = "2013",
      eprint         = "1209.3984",
      archivePrefix  = "arXiv",
      primaryClass   = "hep-th",
      reportNumber   = "IPHT-T12-075",
      SLACcitation   = "%%CITATION = ARXIV:1209.3984;%%",
}
@Article{fr10,
     author    = "Fateev, Vladimir and Ribault, Sylvain",
     title     = "{Conformal Toda theory with a boundary}",
     journal   = "JHEP",
     volume    = "12",
     year      = "2010",
     pages     = "089",
     eprint    = "1007.1293",
     archivePrefix = "arXiv",
     primaryClass  =  "hep-th",
     doi       = "10.1007/JHEP12(2010)089",
     SLACcitation  = "%%CITATION = 1007.1293;%%"
}
@Article{rib03b,
     author    = "Ribault, Sylvain",
     title     = "Strings and D-branes in curved space-times. (In French)",
     year      = "2003",
     eprint    = "hep-th/0309272",
     SLACcitation  = "%%CITATION = HEP-TH 0309272;%%"
}
The script is given at the end of this post. To install, save it in the bin directory, and make sure it is executable
chmod +x bibpup.pl
It is possible to write a default name for the Bibtex file into the script, in which case the command for executing is simply
bibpup.pl
So here is the script:
eval '(exit $?0)' && eval 'exec perl -x -S $0 ${1+"$@"}' &&
eval 'exec perl -x -S  $0 $argv:q'
if 0;
#!/usr/local/bin/perl -w

use strict;
use File::Copy;

#Note: using & at the end of unix commands called with 'system' sometimes produces
#strange results, namely later commands being ignored!

#Settings
my $myname = "bibpup.pl"; #name of this program
my $version_num = '1.01';
my $version_details = "$myname $version_num, by Sylvain Ribault";

my $tmpfile = "tmp.bib"; #name of the file where Inspire data are written
my $thecmd = "wget -q -O $tmpfile"; #the command
my $theurl = "http://inspirehep.net/search?ln=en&p="; #relevant URL
my $beginsearch = "find+bb+"; #search command, as suggested by the source of the Inspire webpage
my $endsearch = "&of=hx&action_search=Search"; #end of the search command
my $outfield = "journal"; #BIBTEX field we want to update
my $infield = "eprint"; #BIBTEX field used for finding articles
my $entryend = "^\}"; #string used to detect the end of a BIBTEX entry
my @replacefield = ("year", "eprint", "SLACcitation"); #BIBTEX fields where updating can start

my $bibdefault = "test1.bib"; #default BIB file to be updated


#Global variables
my $bibfile = ""; #BIB file to be updated
my $readfile = ""; #Same file, but now read-only
my $bakfile = ""; #name of the backup copy
my $do_help = 0;
my $do_version = 0;
my $do_enter = 0;


#========================================================================

$bibfile = $bibdefault;
while ($_ = $ARGV[0]) {
    if ( /^-/ ){
    if ( /h/ ){
        $do_help = 1;
    }
    if ( /e/ ){
        $do_enter = 1;
    }
    if ( /v/ ){
        $do_version = 1;
    }
    }
    elsif ( $do_enter == 0 ){
    if ( /\S/ ){
        print "$myname: \"Superfluous arguments. I ignore them and proceed.\"\n";
    }
    last;
    }
    else {
    $bibfile = $ARGV[0];
    last;
    }
    shift;
}
if ( $do_version == 1 ){
    print "Running $version_details\n";
}
if ( $do_help == 1 ){
    print "$myname: \"Just printing help:\"\n";
    print_help();
    exit 0;
}
$bakfile = "$bibfile.bak";
rename($bibfile,$bakfile)
    or die "Cannot create backup file \"$bakfile\": file \"$bibfile\" may not exist\n";
$readfile = "$bibfile.read";
copy($bakfile,$readfile)
    or die "Cannot create reading file \"$readfile\": file \"$bakfile\" may not exist\n";

updatebib();
unlink $readfile;
#unlink $bakfile;
#unlink $tmpfile;
exit 0;

#=========================================================================
sub updatebib{
    my $entry = 0; #will become 1 if we find a BIB entry, 2 after #replacefield, 3 after #outfield
    my $replace = 0; #will become 1 if entry must be modified
    my $searchitem = ""; #value of $infield, used for finding articles
    my $label = ""; #label of articles
    my $spirestalk = "";
    my @labels = ();
    my @cleaned_labels =();
    my @dirty_labels =();
    my $begin_entry; #text of an entry up to $replacefield not included
    my $full_entry; #text of an entry
    local *IN;
    local *OUT;
    open( IN, "<$readfile" )
    or die "Cannot read \"$readfile\"\n";
    open( OUT, ">$bibfile" )
    or die "Cannot write on \"$bibfile\"\n";
    while (<IN>) {
    my $line = $_;
    if ( /^@\w+\{(\S+)/ ){
#        print "Entry found! \"$1\"\n";    # own label for the article
        $label = $1;
        push @labels , $label;
        $searchitem = "";
        $begin_entry = "";
        $full_entry = "";
        $entry = 1;
        $replace = 0;
    }
    $full_entry = $full_entry.$line;
    if ( $entry != 0 ){
        if ( /$outfield/ ){
        $entry = 3;
        }
        if ( /$infield[^\"]+\"([^\"]+)\"/ ){
        $searchitem = $1;
#        print "$searchitem\n";   # eprint number
        }
        for (my $i = 0; $i <= $#replacefield; $i++ ) {
        if ( /$replacefield[$i]/ && $entry == 1 ){
            $entry = 2;
        }
        }
    }
    if ( $line =~ /$entryend/ && $entry != 0 && $entry != 3 && $searchitem ne "" ){
        print "Using Inspire for the entry $label\n";
#        print "$searchitem\n";   # eprint number
        $spirestalk = getoutfield($searchitem);
#        print "What we found: $spirestalk\n";   # Journal data from Inspire
        if ( $spirestalk eq "" ){
        push @dirty_labels , $label;
        }
        else {
        push @cleaned_labels , $label;
        $replace = 1;
        }
        $searchitem = "";
    }
    if ( $entry == 0 ){
        print OUT $line;
    }
    if ( $line =~ /$entryend/ && $entry != 0 ){
        $entry = 0;
        if ( $replace == 1 ){
        $replace = 0;
        print OUT $begin_entry;
        print OUT $spirestalk;
        }
        else {
        print OUT $full_entry;
        }
    }
    if ( $entry == 1 ){
        $begin_entry = $begin_entry.$line;
    }
    }
    print "No valid update found in Inspire for the labels: ";
    for (my $i = 0; $i <= $#dirty_labels; $i++ ) {
    print "$dirty_labels[$i] ";
    }
    print "\n";
    print "Labels whose entries were updated: ";
    for (my $i = 0; $i <= $#cleaned_labels; $i++ ) {
    print "$cleaned_labels[$i] ";
    }
    print "\n";
    close IN;
    close OUT;
}

#====================================================================
sub getoutfield{
#    Given a $lookitem (arXive number), gets the $outfield (journal ref) from Inspire.
    my $lookitem = $_[0];
    my $outresult = "";
    my $fullcmd = $thecmd." \"".$theurl.$beginsearch.$lookitem.$endsearch."\"";
    my $inentry = 0;
 #   print "$fullcmd\n";   # search command given to Inspire
    system($fullcmd) == 0
    or die "\"system $fullcmd failed: $?\"";
    local *IN;
    open (IN, "<$tmpfile" )
    or die "Cannot read \"$tmpfile\"\n";
    while (<IN>) {
    my $theline = $_;
    if ( /^@\w+\{(\S+)/ ){
        $inentry = 1;
#        print "Inspire label: $1\n";   # Inspire label for that article
    }
    if ( $inentry == 1 && /$outfield/ ){
        $inentry = 2;
    }
    if ( $inentry == 2 ){
        $outresult = $outresult.$theline;
    }
    if ( /$entryend/ ){
        $inentry = 0;
    }
    }              
    close IN;
    unlink $tmpfile;
#    print "Just found with Inspire: $outresult\n";
    return $outresult;
}

#==================================================================
sub print_help{
    my $replacelist = "";
    for (my $i = 0; $i <= $#replacefield; $i++ ) {
    $replacelist = $replacelist."'$replacefield[$i]', ";
    }
    print <<HELP;
Usage:

    $myname [options] [file]

where it is necessary to enter a file name only if option -e is present,
otherwise default file name is used.

Browses the BIBTEX file, looking for entries with no '$outfield'.
If an '$infield' is nevertheless given, uses it to search Inspire for
possible '$outfield' data. If data are found, replaces the part of the
BIBTEX entry subsequent to a field
$replacelist
with the new entry from Inspire subsequent to '$outfield'. In particular
this preserves the label. The original file is saved as a '.bak' file.

Options:
    -v  indicates -version number.
    -h  prints the present -help and dies.
    -e  expects names of the file to be -entered explicitly.
HELP
}