Home > Uncategorized > Getting Groovy With The Maine Public Hearing Schedule

Getting Groovy With The Maine Public Hearing Schedule

March 23rd, 2008

I wrote a groovy script that grabs the Maine legislation public hearing schedule and outputs all the items. It kinda looks like this:



committee       : LVA
document number : LD-2261
date            : 2008-04-02 13:00
room            : Room 437 State House
bill title      : I.B. 3, An Act To Allow a Casino in Oxford County

committee       : ACF
document number : LD-2262
date            : 2008-03-26 14:00
room            : Room 206, Cross Building
bill title      : H.P. 1626, An Act Pertaining to the Definition of "Milk"

committee       : BEC
document number : LD-2257
date            : 2008-03-25 13:00
room            : Room 208 Cross Office Building
bill title      : H.P. 1619, An Act To Establish a Uniform Building and Energy Code



I'm going to push this stuff into a database or maybe directly into an RSS feed once I grab a few more bits of data to go with it.



the following is the entire script I used to produce the above output… it's nothing too exciting, but shows how I used groovy and tidy to grab some data from several entirely un-styled documents as a first step in providing some structure to that data for future applications.

import org.w3c.tidy.Tidy
import java.text.SimpleDateFormat

// some date formats we'll use to validate and reformat dates later
SimpleDateFormat dateInHtml = new SimpleDateFormat('EEE MMM dd, yyyy, h:mm a')
SimpleDateFormat dateWeLike = new SimpleDateFormat('yyyy-MM-dd HH:mm')

String base = 'http://www.mainelegislature.org'

// this url is for the index page for all public hearings..
// right now it's hard coded to look 180 days from today
String url = base + '/legis/lio/phSched.asp?DAYS=180'

// get the "index" file and tidy it.
download(url, 'out.xml')

// parse the index page so we can rip through it looking for links we care about.
def indexPage = new XmlParser().parse(new File('out.xml'))

// find all the links in the main hearing schedule document that
// link to specific committee schedules.
// TODO - once I find all the committee codes I can just go after them without this loop
indexPage.depthFirst().grep{ it.'@href'?.contains('phSched.asp') }.each {flag ->
    // find the committee code from the url... we will do something with this later
    String cmty = (flag.'@href' =~ /.*?CODE=(.*?)&.*?/)[0][1]

    // download and Tidy the HTML schedule for each committee
    // this url looks just like the index url with the addition of a
    // "CODE" param which contains a unique code for the committee
    download(base + flag.'@href', 'schedule.xml')
    // parse this sucker
    def scheduleNodes = new XmlParser().parse(new File('schedule.xml'))

    // there are no classes or ids to use to navigate this document..
    // so we have to go by structure...
    // we are looking for all table rows with 4 columns
    // where the first column does NOT contain the string "LD"
    scheduleNodes.breadthFirst().findAll {it.name().localPart == 'tr' &&
            it.children().size() == 4 &&
            !it.children()[0].value().contains('LD')
    }.each { row ->
        // each of the rows we found has the following 4 columns

        // the LD number (legislative document number)
        String ld = findText(row.children()[0])
        // the partial title of the bill... not much use as it is truncated.
        String title = findText(row.children()[1])
        // the date and time of the hearing
        String date = findText(row.children()[2])
        // the room and building in which the meeting will take place.
        String room = findText(row.children()[3])

        // for some reason the white space character right before the am/pm in
        // the html doesn't appear to be a space... so we'll replace it with one.
        // then parse the date into something we like more
        date = date.replaceAll('.pm$', ' PM').replaceAll('.am$', ' AM')

        // print out some details for now
        println "committee       : $cmty"
        println "document number : LD-$ld"
        println "date            : ${dateWeLike.format(dateInHtml.parse(date))}"
        println "room            : $room"
        println "bill title      : $title"
        println ""
    }
}
new File('schedule.xml').delete()
new File('out.xml').delete()

// pass this a node and it takes the first child until it finds a text node and returns that.
// also replaces line breaks with spaces... so... watch that.
private def findText(node) {
    def var = node;
    while (var.class.name != 'java.lang.String' && var.children().size() > 0) {
        var = var.children()[0]
    }
    return var.replaceAll('\n', ' ');
}

// this will grab an html page (url)
// then run it through Tidy to clean it up and save it to outFile.
def download(String url, String outFile) {
    // temporary file which will contain the html in need of a good tidy
    File tmpOutFile = File.createTempFile('out', '.html');

    // write the url to the tmp file
    def file = new FileOutputStream(tmpOutFile)
    def out = new BufferedOutputStream(file)
    out << new URL(url).openStream()
    out.close()
    file.close();

    // run tmp file through tidy
    Tidy tidy = new Tidy();
    tidy.setQuiet(true);
    tidy.setShowWarnings(false);
    tidy.setMakeClean(true);
    tidy.setXHTML(true);
    tidy.parseDOM(new FileInputStream(tmpOutFile),new FileOutputStream(outFile));

    // delete the temp file.
    tmpOutFile.delete();
}

click “expand source” to see the script.



This is one of several irons I have in the fire related to increasing transparency in my local government and providing a modicum of usability to our amazingly scattered and 1996 looking Maine state web services.


admin , , , , ,

blog comments powered by Disqus