Shortlink

Delurk, on hacking Parliamentary information

It was great to hear about Rewired State: Parliament. I have worked in this space in an unusual way for several years and it is tremendous to see it being embraced. I’m sorry I couldn’t be there myself. I’ve called this post delurk because I should have started playing well with others in the field long ago, and this is my attempt to make for that. All code is released under GPL v3, FWIW.

What this post covers:

  1. How do you make parliamentary information accessible for blind members? (Demo vid half way down).
  2. The XSLT stylesheet I wrote for bills to get semantic XML from mush. (My other motivation in writing is to push accessibility up the agenda a bit.) The code is at the bottom, if you want to skip to the good bit!

One final point: if you produce something this weekend you think should keep going and it needs a home, Full Fact may well be able to help and I’d love to hear from you by email: will@fullfact.org or on twitter @puzzlesthewill. If you’d like to help Full Fact with some of the techie projects we have coming up, I’d love to hear from you too.

Accessibility

I worked for three years for Colin Low, a completely blind independent crossbench member of the House of Lords.

There are basically two things become much harder if you’re blind in the corridors of power. One is making use of those corridors—seeing who is passing and taking the opportunity to have a word, or navigating a crowded reception and making sure you meet who you want to meet. That was part of my job: being a human guide dog.

The other is accessing the formal information of the House. Lord Low can access information in three ways: braille, audio, or simple electronic files including word documents via a BrailleNote, which is a kind of laptop with braille and audio output.

It all boils down to a need to take various complex parliamentary documents and get them into various simpler or more structured formats, such as plain text, a simple word document, or a DAISY Digital Talking Book.

The main advantage of DAISY is that it helps to solve the problem of not being readily able to skim-read, or glance at a page and see where you are in a document, because by it supports hierarchical structure. It is perfect for bills, for example.

There were eight main things I worried about making accessible. This is how we solved each problem (to the extent that we did).

  1. Annunciators. TV screens that show who is speaking, what about, and how long they have been speaking for—and also what is being voted on when the division bell goes. Thankfully the Lords ones (not the Commons’) had an intranet version, so it was a question of getting wifi access for Lord Low’s BrailleNote, which Parly ICT were very helpful about.
  2. House of Lords Business. The official order paper, which includes the list of questions tabled. Ignored when possible, and copy-pasted into a word document when not.
  3. Forthcoming Business. A word doc with dreadful formatting. Save as -> plain text.
  4. Speakers Lists, which show who will speak (and, on the day, in what order) in set-piece debates, which is almost everything except for debates on bills after the first debate (known as the second reading). Copy-paste into a text file.
  5. Hansard (or differently here). XSLT from the official XML into DAISY.
  6. Bills, i.e. draft laws. Round 1, I wrote a scraper in python to take the bill off the Parliamentary website and convert to DAISY. Ugh. Then they changed the website, as I knew was inevitable. Round 2, I got access to the XML dumps of bills and wrote an XSLT to convert them to a DAISY format, with the lowest individually navigable hierarchy elements being clauses and schedules.
  7. Amendments. Again, XSLT from the XML formats, for the four different amendment (sheet and marshalled in list each House).
  8. Groups. Amendments are not usually debated individually. They are grouped together by topic. The groups are determined in slightly different ways at different stages in the Commons and differently again in the Lords. Groups appear as lists of amendment numbers, one group per line. “1, 2, 208ZBA” might be a group. Technically, that’s a copy-paste job. But if you can’t access and inter-relate bills and amendments, they’re meaningless. So I would manually add a précis of what it was about and who was involved. That by way of saying that the most interesting problems still require humans to solve (c.f. publicwhip.org.uk).

Here is a little demo of the last Health and Social Care Bill and a House of Lords Marshalled List of Amendments being played through a PC’s DAISY reader. Nowadays you can even read DAISY books on your phone, or have it read them to you. Thanks to archive.org for hosting.

As well as being brilliantly accessible with the right software DAISY books are capable of being output in braille, large print, audio etc. without further work. This is why we all love XML.

Note the braille title after the printed one

Note the braille title after the printed one

Parsing Bills

At the time, the easiest way to create a DAISY book was to transform the raw XML Parliamentary (actually TSO) puts out into a structured XML format, and then run that through special software developed by the RNIB.

The XML that TSO provides is unbelievably hideous, the only reason why this was a hard job. All it was is the internal format of Adobe Framemaker (which gets output as something call MIF) translated directly into XML. The elements are all to do with pages, positioning, formatting and the like, and are very hard to relate to the semantic elements of a bill.

The input format I needed to create DAISY books looked a lot like this (not my fault):

<book> <body> <section> <subsect1> <subsect2> <subsect3> ...

So basically what the XML does is spit out the different components of a bill, marshalled list etc. as well-formed XML with this kind of structure. It is trivially adaptable to spit out more meaningfully-named elements. So trivial, I added it as an option, I see now.

Now for an example.

I use Saxon Home Edition to do the transformation, with Java. Put the first two in the same directory, unzip the raw XML, and run (on Windows):

java -Xmx1000000000 -jar c:\saxonb\saxon9he.jar -t -s:bill.xml -xsl:bills.xsl -o:dtb.xml styles=true

and you should get what I got. If you leave out styles=true it’s smaller and cleaner but less semantically useful.

If you do do something with this, I’d love to hear from you. And if you need a hand, just give me a shout (contacts at the top of this post).