Urbanixm Content Classification

At Urbanixm we harvest documents from the web and extract information about the implementation and impact or urban policy and interventions. At the early stage of the content processing pipeline we judge whether a document is relevant to urbanism. At a later stage we curate relevant document paragraphs with information such as which urbanism topics are discussed, if it is about a policy implementation or the impact of an intervention, etc. We train a series of content classifiers using machine learning in order to automate the content curation for it to work at scale.

Urbanixm

Urbanixm is an urban knowledge discovery engine where the goal is to organize the global information about the impact of urban policy and interventions.

We use a web-crawler for harvesting content about urbanism. The web-pages are passed through an automated content curation pipeline where we extract relevant text snippets from the pages and annotate them with relevant metadata, such as, which urbanism topics are discussed, whether the text describes a policy implementation or the impact or infrastructure, etc. Finally we have a front-end system to browse the annotated content.

Urbanixm Content Classification

We develop a series of content classifiers for both entire documents (web-pages) and quotes (text snippets extracted from the documents).

The document classifiers estimate, among other things,

  • The probability of a web-page to be relevant to urbanism.
  • The probability of a web-page to be quotable, i.e. to containing relevant informative content as opposed to be a content archive or search results page.
  • The probability of a web-page or text-snipped being relevant to a particular urban topic, such as segregated cycle lanes or environmental gentrification.
  • The probability of a text-snippet describing a policy or infrastructure implementation.
  • The probability of a text-snipped describing the impact of a policy or infrastructure implementation.

The document classifiers were implemented as multiple boolean classifiers using scikit-learn.

The quote classifiers were implemented as multi-label classifiers using the Hugging Face pipeline and a RoBERTa base model.

The code can be found in the GitHub repository linked below, but the actual data used to build our production model is proprietary.

Technology

Python, PyTorch, Scikit Learn, Machine Leaning, Hugging Face

Links