I wrote a
groovy script that grabs the Maine legislation
public hearing schedule and outputs all the items. It kinda looks like this:
committee : LVA
document number : LD-2261
date : 2008-04-02 13:00
room : Room 437 State House
bill title : I.B. 3, An Act To Allow a Casino in Oxford County
committee : ACF
document number : LD-2262
date : 2008-03-26 14:00
room : Room 206, Cross Building
bill title : H.P. 1626, An Act Pertaining to the Definition of "Milk"
committee : BEC
document number : LD-2257
date : 2008-03-25 13:00
room : Room 208 Cross Office Building
bill title : H.P. 1619, An Act To Establish a Uniform Building and Energy Code
I'm going to push this stuff into a database or maybe directly into an RSS feed once I grab a few more bits of data to go with it.
the following is the entire script I used to produce the above output... it's nothing too exciting, but shows how I used groovy and tidy to grab some data from several entirely un-styled documents as a first step in providing some structure to that data for future applications.
import org.w3c.tidy.Tidy
import java.text.SimpleDateFormat
// some date formats we'll use to validate and reformat dates later
SimpleDateFormat dateInHtml = new SimpleDateFormat('EEE MMM dd, yyyy, h:mm a')
SimpleDateFormat dateWeLike = new SimpleDateFormat('yyyy-MM-dd HH:mm')
String base = 'http://www.mainelegislature.org'
// this url is for the index page for all public hearings..
// right now it's hard coded to look 180 days from today
String url = base + '/legis/lio/phSched.asp?DAYS=180'
// get the "index" file and tidy it.
download(url, 'out.xml')
// parse the index page so we can rip through it looking for links we care about.
def indexPage = new XmlParser().parse(new File('out.xml'))
// find all the links in the main hearing schedule document that
// link to specific committee schedules.
// TODO - once I find all the committee codes I can just go after them without this loop
indexPage.depthFirst().grep{ it.'@href'?.contains('phSched.asp') }.each {flag ->
// find the committee code from the url... we will do something with this later
String cmty = (flag.'@href' =~ /.*?CODE=(.*?)&.*?/)[0][1]
// download and Tidy the HTML schedule for each committee
// this url looks just like the index url with the addition of a
// "CODE" param which contains a unique code for the committee
download(base + flag.'@href', 'schedule.xml')
// parse this sucker
def scheduleNodes = new XmlParser().parse(new File('schedule.xml'))
// there are no classes or ids to use to navigate this document..
// so we have to go by structure...
// we are looking for all table rows with 4 columns
// where the first column does NOT contain the string "LD"
scheduleNodes.breadthFirst().findAll {it.name().localPart == 'tr' &&
it.children().size() == 4 &&
!it.children()[0].value().contains('LD')
}.each { row ->
// each of the rows we found has the following 4 columns
// the LD number (legislative document number)
String ld = findText(row.children()[0])
// the partial title of the bill... not much use as it is truncated.
String title = findText(row.children()[1])
// the date and time of the hearing
String date = findText(row.children()[2])
// the room and building in which the meeting will take place.
String room = findText(row.children()[3])
// for some reason the white space character right before the am/pm in
// the html doesn't appear to be a space... so we'll replace it with one.
// then parse the date into something we like more
date = date.replaceAll('.pm$', ' PM').replaceAll('.am$', ' AM')
// print out some details for now
println "committee : $cmty"
println "document number : LD-$ld"
println "date : ${dateWeLike.format(dateInHtml.parse(date))}"
println "room : $room"
println "bill title : $title"
println ""
}
}
new File('schedule.xml').delete()
new File('out.xml').delete()
// pass this a node and it takes the first child until it finds a text node and returns that.
// also replaces line breaks with spaces... so... watch that.
private def findText(node) {
def var = node;
while (var.class.name != 'java.lang.String' && var.children().size() > 0) {
var = var.children()[0]
}
return var.replaceAll('\n', ' ');
}
// this will grab an html page (url)
// then run it through Tidy to clean it up and save it to outFile.
def download(String url, String outFile) {
// temporary file which will contain the html in need of a good tidy
File tmpOutFile = File.createTempFile('out', '.html');
// write the url to the tmp file
def file = new FileOutputStream(tmpOutFile)
def out = new BufferedOutputStream(file)
out << new URL(url).openStream()
out.close()
file.close();
// run tmp file through tidy
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setMakeClean(true);
tidy.setXHTML(true);
tidy.parseDOM(new FileInputStream(tmpOutFile),new FileOutputStream(outFile));
// delete the temp file.
tmpOutFile.delete();
}
click "expand source" to see the script.
This is one of several irons I have in the fire related to increasing transparency in my local government and providing a modicum of usability to our amazingly scattered and 1996 looking Maine state web services.