Thursday, April 14, 2011

Scripting Groovy to process csv files.

I have a lot of experience with um, "manhandling" flat file data into usable formats. Usually this "data" comes to me in the form of excel spreadsheets, crafted with presentation in mind. They look great but, have little flexibility when it comes to importing the data into relational databases. There are hundreds of solutions for doing this sort of manipulation in bulk and if that's your lot in life by all means beg the IT department to buy you a good data transformation tool!

For some reason my situations are frequently one offs of user data that is begging to get into a database but, as of yet is still running in the wild. My solution has come down to quick Groovy scripts and splicing spreed sheets into multiple CSV files Groovy to chew through. This can be done with Python and many other script languages but, I'm fully fluent in Java so Groovy is my fastest path to a database at this point in time. This is a great place to learn a new scripting language if you have the time but, I usually want this over in a flash. On that note, if you would like to just grab the code and run jump to the bottom to grab My CsvUtils code which takes care of this whole process in a more formal utility object.

Reading a file and working with it is pretty darn simple in Groovy, the following code will open up a CSV and move all the data into nicely structured arrays. (This provides the same results as calling CSVUtils.csvToList())

 //load room types from file
def f= new File('roomtypemap-min.csv');

//set up a List of 'Rows' which will contain a list of 'columns' or you can think of it as array[][]
roomTypes = [];
def lines = 0;
f.splitEachLine(',') {
    def row = [];
    row = it;
    roomTypes << row;
    lines++
}
println "Processed $lines Lines"

The above chewed through a CSV file with name,value pairs of data and placed them in a multi-dimensional array types as a Java list. If your not up on Java collections, it's a whole lot like an array but, better.

Now lets say we did the above code with a few files then process all that data into a new CSV file ready for your code. The code below shows the building of the final results. You'll notice that along with the roomTypes list imported from a CSV file there are now a few others resortList and amenities. All these were pulled from random table dumps and such and I turned them in to usable data.


def aId = 0;    
roomTypes.each{
  def indx = roomTypes.indexOf(it);
  def resortList = roomResortMap[indx];
  def roomType  = it[0];
  
  resortList.each() {
    def facIdx = it.toInteger();
    def facilityId = resorts[it.toInteger()];
    
    amenities.each() {
      def ai = amenities.indexOf(it);
      def avail = values[ai];
      if(avail[facIdx] == true) {
        def description = it;
        println "\"$facilityId\",$roomType,$aId,\"$description\"";
      aId++;
      }
    }
  }

There are a couple things to point out if your new to Groovy. While there is a whole lot of Java in Groovy it also looks a lot like JavaScript making Groovy pretty easy for JavaScript developers to pick up despite deep Java knowledge.

The first thing to note is the iteration I am performing with the .each callback. Personally, I think it looks a good bit like a JavaScript closure. Your simply passing in code to the .each(){} iterator to execute on each item in the List. OH the () is optional when creating this closure so add if you wish.

Second, in the version of Groovy I am using, 1.7.10, you have to do a little bit of work to get the current index position of the for loop. To do this simply call yourListName.indexOf(it). It looks like the upcoming 1.8 is going to have an additional closure called eachWithIndex() which will provide you it and i variables where the i will be the current index. I'm looking forward to that feature!

Finally I just want to point out one of my favorite things for keeping life simple, string replacement. Notice on the next to list line before the braces cascade I'm calling println but, the cool part is the replacements in the string using $fieldName. When your trying to print out complicated strings nothing is worse for readability and bugs then concatenating or calling tons of append() methods. It just leads to errors. You can do simple string replacement with variables using the $fieldName inside your string or for more complicated inserts ${} will work for things like ${Object.field}.

Hope this will help you to simplify the tedious task of data mining text files!

My CSVUtils Object:

/**
 * CSVUtils is released into the public domain, do what ya want with it.
 * By Lee Clarke
 */
public class CSVUtils {  
 
  /**   
   * Removed quotes from around imported csv values if present.
   */
  public static String stripQuotes(String strIn) {  
    def valOut = strIn;  
    if(valOut==null)  
      valOut == "";  
    if(valOut.startsWith("\""))   
      valOut = valOut.substring(1);  
    if(valOut.endsWith("\""))   
      valOut = valOut.substring(0,valOut.length()-1);  
    return valOut;  
  }
    
  /**   
   * Load the csv into a Map using indexed position of values for key and value. 
   * @param keyPos - col index of the key value in a row
   * @param valPos - pull value data from index/col position in row
   * @param csvFilePath - full path to file
   */  
  public static Map csvToMap(int keyPos, int valPos, String csvFilePath) {  
    def fp= new File(csvFilePath);  
    def rtnMap = [:];  
  
    def palines = 0;  
    fp.splitEachLine(',') {  
      if(palines > 0)//skip col header line  
      {  
        def row = [];  
                row = it;  
        def key = CSVUtils.stripQuotes( row[keyPos]);  
        def val = CSVUtils.stripQuotes( (valPos < 0)?row.last():row[valPos]);  
        rtnMap.putAt(key, val);  
      }  
      palines++  
    }  
    return rtnMap;  
  }  
  
  /**
   * loads values in csv file into multi-dimentional like List of rows and cols.
   * @param csvFilePath - full file path.
   * @param skipFirstRow - skip first row if it contains column names.
   */
  public static List csvToList( String csvFilePath, boolean skipFirstRow) {
    def fr= new File(csvFilePath);
    def rtnList = [];
    def rlines = 0;
    fr.splitEachLine(',') {
      if(!(skipFirstRow && rlines == 0)) {
        println "row= $it"
        def row = [];
        row = it;
        rtnList << row;
      }
      rlines++
    }
    return rtnList;
  }
  
  /**
   * loads values in csv file into multi-dimentional like List of rows and cols.
   * @param csvFilePath - full file path.
   */
  public static List csvToList( String csvFilePath) { 
    return csvToList(csvFilePath,true);
  }
}  

5 comments:

  1. Rock on, Groovy! It really grows on you.

    For your first example, try this:

    def roomTypes = new File("roomtypemap-min.csv").readLines()*.split(",")
    println "Processed ${roomTypes.size()} Lines"

    (It does use more memory than your original.)

    Regards

    John Hurst
    Wellington, New Zealand

    ReplyDelete
  2. Always nice to see more people using Groovy :) It's a great language for both one-offs and whole applications.

    Had to do do some csv-parsing myself last year and created a small library for it. Thought it might be interesting to you. Check out GroovyCSV http://xlson.com/groovycsv/

    Regards,
    Leo

    ReplyDelete
  3. @John your right it does grow on ya fast.. I'm always trying to come up with excuses for writing scripts in Groovy now. I would love to get on a project using Play Framework or Grails where I could go crazy with Groovy.

    Thanks for the shortcut example btw, so your saying that the more compacted line reading takes up more memory? Interesting.

    ReplyDelete
  4. @Leonard Hey nice! I love how easy you can grab the code and pull it onto Groovy, I need to read up on how that's working. Nice work on the groovycsv, I'll have to pull it and see what kind of cool things it can do, I'm sure there will be more csv files in my future.. It makes you appreciate databases at least! :)

    ReplyDelete
  5. @Grab is a great enabler, making scripting with Groovy a lot easier. It just basically goes out to Maven Central (or other repos of your choice) and fetches the dependency during compile-time of the script (afaik).

    Glad you like it. Take a look at the source on Github if you're curious, it's pretty basic. Just a thin wrapper around opencsv. Hoping to get some more functionality in there at one time or other but so far no use case has presented itself.

    ReplyDelete