« Previous day (Mar 21, 2008) | Main | Next day (Mar 23, 2008) »
Getting Groovy With The Maine Public Hearing Schedule
I wrote a groovy script that grabs the Maine legislation public hearing schedule and outputs all the items. It kinda looks like this:


committee       : LVA
document number : LD-2261
date            : 2008-04-02 13:00
room            : Room 437 State House
bill title      : I.B. 3, An Act To Allow a Casino in Oxford County

committee       : ACF
document number : LD-2262
date            : 2008-03-26 14:00
room            : Room 206, Cross Building
bill title      : H.P. 1626, An Act Pertaining to the Definition of "Milk"

committee       : BEC
document number : LD-2257
date            : 2008-03-25 13:00
room            : Room 208 Cross Office Building
bill title      : H.P. 1619, An Act To Establish a Uniform Building and Energy Code

I'm going to push this stuff into a database or maybe directly into an RSS feed once I grab a few more bits of data to go with it.

the following is the entire script I used to produce the above output... it's nothing too exciting, but shows how I used groovy and tidy to grab some data from several entirely un-styled documents as a first step in providing some structure to that data for future applications.
import org.w3c.tidy.Tidy
import java.text.SimpleDateFormat

// some date formats we'll use to validate and reformat dates later
SimpleDateFormat dateInHtml = new SimpleDateFormat('EEE MMM dd, yyyy, h:mm a')
SimpleDateFormat dateWeLike = new SimpleDateFormat('yyyy-MM-dd HH:mm')

String base = 'http://www.mainelegislature.org'

// this url is for the index page for all public hearings..
// right now it's hard coded to look 180 days from today
String url = base + '/legis/lio/phSched.asp?DAYS=180'

// get the "index" file and tidy it.
download(url, 'out.xml')

// parse the index page so we can rip through it looking for links we care about.
def indexPage = new XmlParser().parse(new File('out.xml'))

// find all the links in the main hearing schedule document that
// link to specific committee schedules.
// TODO - once I find all the committee codes I can just go after them without this loop
indexPage.depthFirst().grep{ it.'@href'?.contains('phSched.asp') }.each {flag ->
    // find the committee code from the url... we will do something with this later
    String cmty = (flag.'@href' =~ /.*?CODE=(.*?)&.*?/)[0][1]

    // download and Tidy the HTML schedule for each committee
    // this url looks just like the index url with the addition of a
    // "CODE" param which contains a unique code for the committee
    download(base + flag.'@href', 'schedule.xml')
    // parse this sucker
    def scheduleNodes = new XmlParser().parse(new File('schedule.xml'))

    // there are no classes or ids to use to navigate this document..
    // so we have to go by structure...
    // we are looking for all table rows with 4 columns
    // where the first column does NOT contain the string "LD"
    scheduleNodes.breadthFirst().findAll {it.name().localPart == 'tr' &&
            it.children().size() == 4 &&
            !it.children()[0].value().contains('LD')
    }.each { row ->
        // each of the rows we found has the following 4 columns

        // the LD number (legislative document number)
        String ld = findText(row.children()[0])
        // the partial title of the bill... not much use as it is truncated.
        String title = findText(row.children()[1])
        // the date and time of the hearing
        String date = findText(row.children()[2])
        // the room and building in which the meeting will take place.
        String room = findText(row.children()[3])

        // for some reason the white space character right before the am/pm in
        // the html doesn't appear to be a space... so we'll replace it with one.
        // then parse the date into something we like more
        date = date.replaceAll('.pm$', ' PM').replaceAll('.am$', ' AM')

        // print out some details for now
        println "committee       : $cmty"
        println "document number : LD-$ld"
        println "date            : ${dateWeLike.format(dateInHtml.parse(date))}"
        println "room            : $room"
        println "bill title      : $title"
        println ""
    }
}
new File('schedule.xml').delete()
new File('out.xml').delete()

// pass this a node and it takes the first child until it finds a text node and returns that.
// also replaces line breaks with spaces... so... watch that.
private def findText(node) {
    def var = node;
    while (var.class.name != 'java.lang.String' && var.children().size() > 0) {
        var = var.children()[0]
    }
    return var.replaceAll('\n', ' ');
}

// this will grab an html page (url)
// then run it through Tidy to clean it up and save it to outFile.
def download(String url, String outFile) {
    // temporary file which will contain the html in need of a good tidy
    File tmpOutFile = File.createTempFile('out', '.html');

    // write the url to the tmp file
    def file = new FileOutputStream(tmpOutFile)
    def out = new BufferedOutputStream(file)
    out << new URL(url).openStream()
    out.close()
    file.close();

    // run tmp file through tidy
    Tidy tidy = new Tidy();
    tidy.setQuiet(true);
    tidy.setShowWarnings(false);
    tidy.setMakeClean(true);
    tidy.setXHTML(true);
    tidy.parseDOM(new FileInputStream(tmpOutFile),new FileOutputStream(outFile));

    // delete the temp file.
    tmpOutFile.delete();
}
click "expand source" to see the script.

This is one of several irons I have in the fire related to increasing transparency in my local government and providing a modicum of usability to our amazingly scattered and 1996 looking Maine state web services.

 
 
 
 
Home
RSS Feed
About

My Projects


Tags

blog brother charity code colbert comic cpi cringly doctorow education funny g4 gov govrake groovy house inflation itunes legislation little maine obama podcast politics prediction president python roller rss sad search senate skeptoid syntaxhighlighter systems techtv ted transparency video wii


Last 40 Posts



© dan



login