Base R Scripting Excercise : Converting a RIS file to a well-formed XML document.

Published in

Towards Data Science

5 min readJun 12, 2019

Introduction

The goal of this post is to share a scripting problem with which I was challenged. Not having much experience with these types of challenges I thought it would be a great opportunity to share and look for feedback.

The Challenge

Write a script that will convert any .RIS file into a well-formed XML document.

Include assumptions and detailed notes.
The script must not have any dependencies outside base language.
The script should be robust against accidental misuse.
Each record should be placed within an article tag.

Although I know that other languages are probably better suited to these types of processes, I chose to write the script in R to explore the process.

Part -I- Understanding the .RIS and .XML Formats

RIS (Research Information Systems)

The ris file format is a tag format that was created to standardize bibliographic information. For example, the free bibliographic database of life sciences and biomedical information, MEDLINE, uses the ris format for citations.

Let’s see what it looks like:

As you can see the basic structure is : Each line contains a tag and a value seperated by “ -”. Each RIS record starts with a “TY” tag (Type of Reference) and ends with an ER tag (End of Reference). More on the ris format here.

XML (Extensible Markup Language)

XML is essentially just like any other markup language (HTML) annotating a document by use of <tags> </tags>. XML however is Extensible, meaning the user can create his/her own tags (i.e instead of <H1> you could use <blog title> ). This shows the main advantage of XML. The tags can do a better job describing the information that they hold. This is advantageous when trying to do anything other than displaying that information in a web browser. Due to its simplicity and standardization XML has also been popular for data exchange and storage. More on XML here.

Let’s take a quick look at an example:

Now let’s take a look at the format constraints, so that I can ensure a properly formatted file. These are the primary constraints:

XML Document may begin with an XML declaration (i.e. <?xml version=”1.0" encoding=”UTF-8"?>)
None of the special characters (“<>&’) may appear except when performing their roles.
The start-tag, end-tag are correctly nested
Tag names are case-sensitive; the start-tag and end-tag must match exactly.
Tag names cannot contain any of the characters !”#$%&’()*+,/;<=>?@[\]^`{|}~, nor a space character, and cannot begin with “-”, “.”, or a numeric digit.
A single root element contains all the other elements.

More on a well-formed XML document and constraints.

Part -II- The Script

Assumptions Made:
The RIS file follows the standard ris tag format including:

Each RIS record ends with an ‘end record’ tag : “ER -”
Each RIS tag and value are seperated by “ -”
Each RIS tag and value are on their own line

# naming function and function argumentsris2xml <- function(x = "", root = "root_element", save = FALSE, file_name = "new_file"){# read lines
  ris2 <- readLines(x, encoding = "UTF-8")# seperating articles
  df <- data.frame(article = NA, ris = ris2, stringsAsFactors = FALSE)
  end_of_records <- rev(grep(pattern = '^ER  -', ris2))
  article = length(end_of_records)
  for(i in end_of_records){
    df$article[1:i] <- article
    article = article - 1
  }# removing ER lines and empty Lines
  df <- df[-end_of_records,]
  df <- df[grep(pattern = '\\S', df$ris ),]# splitting tags and values
  split <- strsplit(df$ris, split = "  -")
  df$tags <- sapply(split, "[[", 1)
  df$values <- sapply(split, "[[", 2)# trim any extra white space
  df$tags <- trimws(df$tags, which = "both")
  df$values <- trimws(df$values, which = "both")
  
# xml special character restraints
  df$values <- gsub('<', '&lt;', df$values)
  df$values <- gsub('>', '&gt;', df$values)
  df$values <- gsub('&', '&amp;', df$values)
  df$values <- gsub("'", '&apos;', df$values)
  df$values <- gsub('"', '&quot;', df$values)######## Function for finishing steps after tag constraints #########
  finish <- function(){
    # putting values in tags
    for(i in 1:nrow(df)){
      df$tagged[i] <- paste0('<', df$tags[i], '>', df$value[i], '</', df$tags[i], '>', collapse = "" )
    }# adding article tags and I2E tag
    document <- data.frame(article= unique(df$article), body = NA)# article tags
    for(i in document$article){
      vect <- unlist(df[df$article == i, "tagged"])
      temp <-  paste0(vect, collapse = "")
      document[document$article == i, "body"] <- paste0("<Article>", temp, '</Article>', collapse = "")
    }# adding root tag & xml tag
    body <- paste0(document$body, collapse = "")
    document_final <- paste0('<?xml version="1.0"?>','<', root, '>',body,'</', root ,'>', collapse = "")
    
    # return xml document
    return(document_final)
    # save file if user chose to
    if(save == TRUE){
      write(document_final, file = paste0(file_name, ".xml"))
    }
  }
######################################################################
# Enforcing XML tag constraints#finding invalid tags
invalid <- grep("[\\[\\!\"#\\$%&'\\(\\)\\*\\+,/;<=>\\?@\\[\\\\\\]\\^`\\{\\|\\}~]|(^\\d)|(^-)|(^\\.)", df$tags, perl = TRUE)# if there are invalid tags: print warning and print the invalid tags
  
if(length(invalid) > 0){
    print('WARNING: there may be invalid tag names. please check to make sure the tags follow the xml constraints')
    print("The following tags are invalid: ")
    print(df$tags[invalid])# give the user the option to re-write the tags
    re_write <- readline(prompt = "to re-write these tags please do so in order and seperated by a comma: ")
    re_write <- unlist(strsplit(x = re_write, split = ","))
    re_write <- trimws(re_write, which = "both")# check user inputs for validity
    re_invalid <- grep("[\\[\\!\"#\\$%&'\\(\\)\\*\\+,/;<=>\\?@\\[\\\\\\]\\^`\\{\\|\\}~]|(^\\d)|(^-)|(^\\.)", re_write, perl = TRUE)# make sure the number of inputs match & inputs meet contsraints before finishing script
    if((length(re_write) == length(invalid)) & (length(re_invalid) == 0)){
      df[invalid, "tags"] <- re_write
    # constraints are met so finish script
    finish()
    # if contsraints not met, print error and exit.
    }else{
        print("Error: re-writing tags not valid. Please try function again")
    }
  }else{
  # constraints are met so finish script
  finish()
  }}

Arguments:

x = “ ”: file to be converted
root = “root_element” : the root tag which will hold the rest of the document
save = FALSE : T/F if the user would like to save the converted file.
file_name = “new_file” : the file name to save the converted file.

Conclusion

And that’s it. As far as testing, I ran the output files through some xml validators as well as used some XML parsers in R to make sure the formatting was correct. I can already think of a few more steps I could have added, like checking the RIS file to make sure that it met assumptions before continuing on. I know this script is far from perfect, but it was a fun way to work with R in a way I never had before. Id love to hear thoughts! Thanks for reading.