PoliGrok: A Lens on Climate Policy
Laws are the software of society. They enable it to function. Laws are code, written by programmers (aka lawyers and legislators).
On a global level there’s a lot of duplication in legal texts across countries. There’s also a lot of embedded knowledge. That can be useful, especially in emerging legal branches like climate policy.
We want citizens to be able to reuse work that has already been created and proven. In order to do that we need to extract the knowledge from legal policies and render it in human readable form. As our challenge mentor says, we want to provide the evidence “for evidence-based policy making”.
We believe this project, in a future form, could fight Greenwashing not only at a communication level, but all the way to the implementation of laws.
What it does
PoliGrok is a tool to analyze a collection of policy documents. As material we were provided with a couple thousand polices, from countries around the world, in various languages. A team of editors had laboriously created summaries and added metadata. Our task is to dig deeper into the full text of the documents and build a tool to help the researchers find better answers to their questions, faster and without much hassle.
How we built it
PoliGrok is built using Ambar by RD17, an opensource document search engine. It combines a document crawler, a database and message queue, powerful text search and a frontend. While we spent time configuring and getting it working, this saved us a lot of time. We set up the app in Docker containers in an Azure VM.
We were able to download ~1000 PDF documents corresponding to half of the laws in the Climate Laws database with a quick Python script, which were then synced to the VM so the app could crawl through them, parsing text and using optical character recognition (OCR) to digitise the full documents. Ambar's crawler can by default handle a range of source documents.
Challenges we ran into
Getting Ambar up and running on Azure was a sub-challenge in itself. Required a lot of puzzling and trial and error with the config.
One of the provided CSV files, the one with the links to the full text docs, didn't have 'entity ids' which made it impossible to correlate with the other Events, Legislations, and Policies CSV files.
Non-english documents held up the parser, so more effort had to be made to categorise documents.
Some PDFs were very large, so were also filtered out – these could be reduced in size before processing.
Accomplishments that we’re proud of
Getting Ambar running on the VM – this was a labour of love.
We retrieved all the PDF policies from the provided list that we could.
Fun atmosphere and great collaboration within our team :)
What’s next for PoliGrok
Add additional metadata and tags for each document
Additional UI / timeline feature
Actionable feed to the end user about processing errors so documents with challenges can be addressed
PoliGrok forms the bedrock for analyzing these documents. By using complex PDF extraction and OCR techniques it's made these PDFs and images easily searchable, ready for annotation with tags, and machine readable. This will throw the doors open for further analysis, both by humans and machines with advanced ML and NLP paradigms.
Additionally, these documents can be logically represented as a graph (a Directed Acyclic Graph, to be apecific) either by time (as shown in our demo) or with other contextual relationships. The logical representation of these documents can also be stored in a graph database – file references will be the nodes, the relationships will be the edges, and the dates can be attributes. Plus most graph databases come with decent visualisation tools these days.