Lessons Learnt in Developing and Supporting Infrastructures for Australian Urban and Built Environment Research

The Australian Urban Research Infrastructure Network (AURIN – www.aurin.org.au) commenced in 2010. The primary resource offered by AURIN (from 2010-2022) was the AURIN portal. This realised a single sign-on web application offering seamless and secure access to over 5,500 data sets from over 150 major organisations together with over 100 analytical tools. These data sets typically came from the definitive data providers of urban and built environment data for Australia, e.g. official Census data from the Australian Bureau of Statistics. The AURIN portal was designed, developed, and supported by the Melbourne eResearch Group (MeG - www.eresearch.unimelb.edu.au) at The University of Melbourne. The AURIN portal was ultimately decommissioned in 2022 having been used by over 20,000 users with over 300,000 user access sessions (data from the Australian Access Federation (www.aaf.edu.au). This paper describes the technical evolution of the AURIN platform over a 12-year period. The paper also introduces the Spatial Urban Data Observatory (SUDO – https://sudo.eresearch.unimelb.edu.au) – the free urban research platform for Australia that builds upon many years of experience in hosting and running the AURIN platform by MeG. We outline case studies in applying SUDO with focus on air quality and bushfires.


INTRODUCTION
The Australian population primarily lives in major cities.In 2023, the population reached 26.6 million people [1].This is expected to grow further with an annual increase in 2023 of 2.4%.The vast majority of the Australian population live in cities and especially the capital cities (Adelaide, Brisbane, Canberra, Darwin, Hobart, Melbourne, Perth, Sydney).Given this growth it is unsurprising that there are significant challenges facing these cities, since the cities were not designed from the outset to include such vast and growing populations, e.g. the road network of the cities was designed way before the population exploded.This gives rise to traffic and congestions issues.The same pressures arise for air quality, housing, education, health etc.The recent challenges associated with Covid-19 have also demonstrated the dangers of living in close proximity and potential for infectious disease outbreaks.
At the same time, more and more digital data is being created all around us: official data, e.g.statistics on the population of Australia from organisations such as the Australian Bureau of Statistics (ABS -www.abs.gov.au),through to ad hoc data, e.g.public opinion on social media from sources such as Twitter/X.The term big data has now entered the common vernacular [2].Big data is typically defined based upon its key characteristics.These include its size/volume, its speed/velocity of production, its variety, and the veracity of the data, i.e. can the data be trusted.Urban data is typically captured, processed and stored by many organisations covering Government agencies (local, State and Federal), by industry, e.g.housing data, and indeed by research and academic organisations.The data itself typically exists in multiple forms: point-based data, e.g.locations of public toilets; graph-based data, e.g. the road network or rivers; in polygons such as the boundaries of statistical areas/suburbs through to data cubes that have temporal dimensions.It is important to note that there is no agreed schema or ontology that has been defined and accepted for the vast majority of urban data set.Such data is typically longitudinal, e.g.cities evolve over time so analysing them should bear this in mind.Thus, whilst more crime may happen in major cities, researchers should bear in mind the population growth (and many other factors that evolve over time).
Urban and built environment research is hugely dependent on data.Ideally the data should have a high degree of veracity and hence be trustworthy.The official data for urban environments comes from many Government agencies.These have their own silos of data stored in different formats and with different access protocols.In this context, AURIN was tasked with developing and supporting an infrastructure to support access to as much definitive data from as many authoritative data providers across Australia.Importantly, rather than obtaining and making available an extract of the data from any given official organization, as typified by many open data initaitives such as (https://www.data.vic.gov.au/),AURIN was tasked with providing live programmatic access to the definitive data sets wherever possible.This required support for federated data access systems exploiting often bespoke solutions depending on the demands of the data provider.Thus, AURIN could not mandate that the ABS install a particular software stack to allow access to their data.Rather, the team had to develop solutions aligned with the technologies and requirements of the ABS.
This paper describes the history of the AURIN infrastructure development.It covers the development and lessons learnt from the early prototypes; the development and release of a more robust platform hosted on dedicated hardware running VMware, through to the Cloud-based solution that was eventually rolled out on the National eResearch Collaboration Tools and Resources (www.nectar.org.au) research cloud and the demands for large scale data back-up across multiple availability zones.
The rest of the paper is structured as follows.Section 2 provides an overview of the challenges facing urban research and the original architecture of the AURIN infrastructure and describes the key components.Section 3 introduces the next iteration of the infrastructure and how it leveraged container (Docker-based) technologies.Section 5 introduces the SUDO platform and shows how it can be used to tackle challenges such as national bushfire and the impact on air quality.Finally, in Section 5 we conclude the paper and identify future areas of work.

BACKGROUND TO URBAN RESEARCH
Urban research is data-driven.As with many countries, across Australia, multiple organisations hold data that is fundamental to urban research.Whilst several data providers provide access to some of their data for data download from their websites, e.g. as PDF documents or CSV files, this is challenging from many perspectives.Firstly, researchers need to find, access, download and then analyse a myriad of differently formatted data.In particular, accessing data in a given situational context, e.g.data for the suburbs of Melbourne, requires user to locate, access and subsequently download the relevant data.Furthermore, urban research is typically based on combining different data sets to answer particular urban research questions, e.g.does gambling mostly impact on richer or poorer suburbs.Answering such questions requires knowledge of where gambling takes place, e.g. the location of the pokies (using data from the Victorian Gambling and Casion Control Commission (https://www.vgccc.vic.gov.au/) and the average income of people in those areas (using data from the ABS) for example.
This typifies many of the immediate challenges facing urban researchers.Ideally such researchers should be able to discover, access, use and analyse many official data sets as simply and as intuitively as possible without needing to know the lower-level data formats for given data set.Such data has a spatial and temporal extent, e.g. the boundaries for suburbs, local government areas or even the States and Territories.These change over time, e.g. the boundary of Greater Melbourne has changed drastically in the last decade with increased growth and new developments on the city boundaries.Factoring in such spatial and temporal extent is important to understand the research challengesfacing cities.Such analysis requires many tools and capabilities -typically these include geographical information systems (GIS).However, many social scientists and urban researchers are unfamiliar with tools such as ArcGIS, hence there was a need for simplified web-based versions of many of these key tools.
This was the context of AURIN in 2010.After a year of requirements gathering involving visiting many major urban and built environment research groups across Australia and multiple meetings with key data providers and associated stakeholders, it was identified that there was a need for a common unifying e-Infrastructure.This e-Infrastructure should unlock the data from key data providers and offer a range of visualization and analytical capabilities in a user friendly and intuitive environment.Wherever possible, there should be no requirement for users to have to read any manuals or deal with intricacies of the platform.
A core technical team within the Melbourne eResearch Group (MeG -www.eresearch.unimelb.edu.au) at the University of Melbourne was tasked with implementing the e-Infrastructure to be used for unlocking the data and providing the web-based tool collections.

AURIN Architecture (2010)
The original architecture of the AURIN prototype platform is depicted in Figure 1 (this was originally described in [3]).The model was based around the establishment of a web-based platform that could be used for a spectrum of urban and built environment research questions and scenarios.As originally outlined in [3] the architecture was intended for many use cases: examples include energy, water, transport, housing etc.It was originally envisaged that each community (energy, water, transport researchers) would have their own user interface(s) realized as portlets, together with their own back-end services and databases providing access to the key data of interest to that community.Importantly, it was expected that the collection of these interfaces and data solutions would ultimately be used to tackle multi-faceted and inter-disciplinary research questions.
The original AURIN e-Infrastructure front-end was based upon the LifeRay (www.liferay.com)portal-based technology.LifeRay offered a portal-based solution that could be used for developing diverse applications and services.The core user interface feature of LifeRay was based around portlet technology.It also offered support for authentication and authorization.This front-end was deployed within the Australian Access Federation (www.aaf.edu.au).This allowed any Australia researcher within a given university in Australia to log in using their institutional credentials.Over time the AAF has allowed non-University collaborators to access services deployed within the federation.This includes support for the virtual home organization (VHO) to allow non-academics to access and use resources in the federation such as AURIN.
The AURIN project sought to leverage open standards wherever possible.The Open Geospatial Consortium (OGC -www.opengeospatial.org)established a range of interoperable services supporting the access to and use of spatial data.These included Web Feature Services (WFS) for data querying/manipulation; Web Map Services (WMS) for creation of spatial images/maps and Web Processing Services (WPS) for analytics and data processing scenarios.A GeoServer instance that was deployed in the original prototype.This supported a variety of spatial capabilities that were used to integrate and display spatial data.The back-end was largely based on a relational database technology (PostGIS -https://postgis.net).This system was deployed on the NeCTAR Research Cloud.
This version of the platform allowed users to navigate around Australian boundaries to discover, access and analyse urban data based on a variety of simple user-driven ways.Figure 2 shows a Figure 2: Scenario showing Unemployment vs English Ability for Melbourne [3] simple example of the AURIN portal at that time.This shows the correlation between unemployment and the fluency in the English language.

AURIN Architecture (2012-2020)
At the start of AURIN, it was planned that the e-Infrastructure would be hosted on the NeCTAR Research Cloud.However, at the time, the NeCTAR Cloud was considered as too unstable for a major national platform for urban research.As a result, dedicated hardware (servers/storage) was purchased, and VMware deployed as the virtualization technology.This was hosted in the University of Melbourne data centre.This VMware-based solution ran for 8-years with less than a single week outage.This infrastructure comprised 4 * servers with Intel Xeon processors with 256Gb RAM and one server with 384Gb RAM.The storage was delivered through EqualLogic and offered 15Tb storage relaised through 24x 600Gb Disks.
Building on the lessons learned in the first AURIN prototype, an enhanced architecture of the AURIN e-Infrastructure was designed and supported.This offered a range of core components as shown in Figure 3.These are summarized here as well as how they differed from the initial 2010-2012 solution.• AURIN User Interface: this was delivered through a targeted research front-end offering seamless and secure access to all of the data sets and tools of AURIN.This user interface utilized a range of Javascript libraries including Ext.js, Processing.js,Node.js and Ajax to provide a rich and interactive user experience.Importantly, users were assessed in how they could interact with the front end.This including inviting users with no background or experience with AURIN to explore the front end and and to try to answer some basic urban questions/scenarios.The front end was deployed as a service in the AAF.• AURIN Edge Service: was used to provide finer grained authentication and authorisation capabilities.This was essential since many of the data sets that the e-Infrastructure was able to unlock had licensing restrictions imposed, e.g.only those that have been approved for access should be able to access/use such data.Similarly, many data sets were only available for academic research only (and hence should be unavailable for anyone not from a university).• AURIN Public API : was used to deliver a dedicated front-end for the internal services and components realised within the e-Infrastructure.• AURIN Middleware: was used to support control flow (message bus) between the AURIN e-Infrastructure components.It supported targeted business logic for managing the userspecific interactions that took place when accessing and using data.• AURIN Message Queue: was used for asynchronous communications within the platform.This allowed multiple requests from multiple (concurrent) users to be buffered when the system was under load.• AURIN Data Registry: was used to provide detailed information on the data accessible through the system.This included both the data, e.g.variable names, and the associated metadata.• AURIN Reporting Service: was used for tracking the usage of the system.This included capturing basic user statistics as well as information on the data sets and tools accessed and used by urban researche community.• AURIN MapPrint: was used for capturing screen information.Typically, this involved capturing high resolution maps showing data displayed as choropleths.• AURIN Geoinfo: this service was used to capture geographical boundaries at multiple resolutions.It was important that this component was performant as the spatial information was voluminous and would majorly impact on the user experience when accessing and using the system.• AURIN Geoclassification: was used for spatial aggregation.
This included supporting the different graphs of spatial information and how they might change over time.It is important to note that many of these tools were developed by AURIN-funded collaborators.There was a need to re-engineer them so that they could work within the platform.In many cases the tools had to be re-engineered completely, e.g. they were delivered as R or Python scripts.
Figure 5 shows a representative scenario in the use of the AURIN platform.This shows data from the Victorian lung cancer registry (choropleth map).The darker polygons indicate areas (postcodes) with higher number of lung cancer patients; the centroids show Creating web-based visualisations and delivering the outputs to web browsers can have performance consequences since the spatial geometry information (the polygons in Figure 5) can be voluminous.To deal with this, the platform supported spatial generalizations.That is, the polygons representing the spatial areas were reduced in their accuracy when displayed in the browser, i.e. since highly accurate boundaries are often not required.The back-end database supports the most accurate spatial representations of the Australian regions, however.
The AURIN e-Infrastructure was through the federal Government's National Collaborative Research Infrastructure Strategy (NCRIS -https://www.education.gov.au/ncris).This was targeted to Australian researchers, i.e. those at universities, however the AURIN portal was increasingly adopted by Government-based researchers across Australia (at both the local, State and Federal level).Researchers from industry increasingly utilised the portal.Approximately 15% of the AURIN portal usage stemmed from non-University collaborators.
The VMware based production system was developed and tested on the NeCTAR Research Cloud through a staged software development and delivery process based on continuous integration and continuous delivery.The team established three environments: a development (prototyping) environment where software development primarily took place, a pre-production environment for testing by end user clients, and a final production environment accessible to the broader research community.
The technologies and processes used to support the CI/CD requirements for AURIN are described in more detail in [4].

AURIN ARCHITECTURE (2020-2022)
The VMware was a very reliable resource that ran without major incident for 8-years (24/7).Eventually the servers came to their end of life.The NeCTAR Cloud became far more stable and had rolled out container technologies (Docker).Public/commercial Clouds such as AWS and Azure were considered, but ultimately it was identified that these would be overly expensive.As such it was decided to update and migrate the AURIN platform to the NeCTAR Research Cloud, which had improved over time and become more robust.Due to the volume of users and community expectations on robustness, there was a need to refactor the AURIN e-Infrastructure.This included three key aspects.Firstly, supporting a horizontally scalable solution utilizing container technologies and their orchestration and management tools (Docker and Kubernetes) to scale the platform depending on system (user) load.Secondly, supporting clustering of the CouchDB back-end database, and thirdly supporting a failover deployment of the e-Infrastructure.Two NeCTAR Research Cloud availability zones were used for this purpose: Melbourne and Tasmania.The allocation made available by NeCTAR to AURIN comprised 510 vCPUs with a total of 2TB RAM and 100TB disk volume store.
From 2012-2020 a single instance of PostGIS and a single instance of CouchDB were used for the AURIN production environment.It was identified that a clustered solution would offer greater resilience, i.e. a given server might fail, but the platform would dynamically continue to function and service requests (whilst a new server and database node was created and added to the cluster).PostGIS, CouchDB and Elastic Search were clustered and used attached (persistent) storage offered through NeCTAR (volume store).
Figure 6 shows the architecture of the AURIN platform that was deployed in 2020.As seen, the platform was deployed across multiple nodes (servers) of the NeCTAR Research Cloud.To support the Docker deployment, a Weave network was established to create a virtual network of Docker containers across multiple hosts.The 2020+ version of the platform also included a range of targeted services to support scaling including for example load balancers.

INTRODUCING THE SPATIAL URBAN DATA OBSERVATORY
Unfortunately, despite the widespread adoption and use of the AU-RIN portal, the platform was decommissioned in 2022.Given the widespread adoption and utilisation of the AURIN portal, the MeG team have since established a new version of the platform: the Spatial Urban Data Observatory (SUDO -https://sudo.eresearch.unimelb.edu.au).This platform now comprises over 8500 data sets from over 150 major organisations.Some of the more recent data sets to be included into SUDO include the most recent Census data from the ABS (2021), air quality data from national air quality data providers, e.g. the Environmental Protection Agency in Victoria, and bushfire related data sets from official agencies such as Geoscience Australia, the Australian Institute of Health and Welfare.Due to the code that continues to have legacy issues, the platform is undergoing major redevelopment.This includes the use of GeoNode as the core technology for the SUDO platform.The MeG team are in the process of migrating the solution to GeoNode (https://geonode.org)-whichprovides a GIS-based solution that tackles many of the discovery and visualization demands facing urban researchers.This includes support for an OGC-compliant API.Development of notebooks, e.g. using Jupyter technology and the use of the Cloud-based Binder service is being rolled out to allow researchers to develop their own solutions with SUDO data, i.e. rather than the pre-canned set of capabilities offered through the AURIN portal.This will provide much richer analytical possibilities.
This SUDO system has been deployed within the Australian Access Federation again so that it is accessible to any Australian academic.It is also deployed on the Melbourne Research Cloud (https://dashboard.cloud.unimelb.edu.au), which is available (for free!) to University of Melbourne academics.Figure 7 shows an example of the user interface of GeoNode and the kinds of data sets that are made available.Specifically, Figure 7 shows data from the National Air Quality system where data from a multitude of State-based air quality agencies is aggregated and made available for research communities.These agencies include the ACT Government, the NSW Dept of Planning and Environment, the Queensland Government Department of Environment and Science, and the VIC Environment Protection Authority.Figure 7 shows the levels of pollution around the Black Summer Bushfire (from end December 2019 to end February 2021).This data is made accessible  Figure 8 shows an example of the kind of analytics that are possible using data made available through through the GeoNode API.In this case the focus is on analysis of the national bushfire data sets, where the size of bushfires (given as square kilometres) is shown alongside the prescribed burns across Australia.As can be seen, the number of prescribed burns is increasing, however the severity of the bushfires is also increasing.This data is also made available through the SUDO platform using an OGC-compliant API.

RELATED WORK
The AURIN e-Infrastructure was somewhat unique in its scope.Whilst there are many GIS based tools available, e.g.ArcGIS and QGIS, supporting live programmatic access to official/definitive data sets from the multitude of agencies and stakeholders across Australia was not supported at that time.
Several urban analytics platforms have been developed and supported by others internationally, however few have gained the sustained traction or provided access to as much online definitive data in a seamless and secure manner.Examples of such international platforms include the Spatial Urban Data System (SUDS) [5].The SUDS platform focused on data generated from traditional and emerging sources of urban data and esecially the generation and use of regularly updated spatially-activated urban area metrics using real or near-real time data sources.
Live Singapore [6] provides an open platform supporting collection and use of diverse real-time data originating in Singapore.This includes social media data.[7] provides a web-based platform to access data generated from diverse sensors installed in European cities.These are used to give a real time understanding of the city activities.[8] describes smart cities and attempts to classify what they are and what data and tools can be used to support them.
Social media is used extensively in the urban context.This is used to understand many diverse human and city behaviours, e.g.public sentiment and movement patterns [9].
The Australian Data Observatory platform (https://ado.eresearch.unimelb.edu.au)provides the social media aggregation and analytics platform for Australia.The ADO platform supports several key capabilities.These include social media aggregation at diverse spatial aggregation levels; topic modelling leveraging technologies such as the Bidirectional Encoder Representations from Transformers (BERT) [10].ADO also offers term search and analysis and support for subsequent data download and raw data access.The ADO platform focuses specifically upon Twitter (now X), Reddit, FlickR, FourSquare, YouTube comments and Mastodon.It also targeted Instagram posts however these APIs have since been removed by Facebook/Meta.Similarly, the Twitter (now X) APIs are no longer directly accessible.Nevertheless, such social media data can provide a real time pulse of what is happening in cities, compared for example to the more traditional Government data sets such as the national Census conducted by the ABS every 5-years.The volume and velocity of such data demands unique Cloud-capabilities that stretch the typical infrastructures.It is noted that licensing is also challenging in this context since ADO cannot provide direct programmatic access to the social media data.Instead access to handles (identifiers) to the original raw data is provided along with tools and guides to programmatically access the associated data from the social media platforms.
[11] provides an overview of urban data analytics platforms.An overview of mechanisms by which data reuse was supported through AURIN is presented in [12,13].
In terms of benchmarking, the AURIN workflow tools were benchmarked in [14].Here it was shown how virtual machine scaling could be supported based on the needs and demands of users.[15] demonstrated how one key component of the AURIN platform could scale horizontally under user load.Again this was based on creation and deployment of virtual machines.
More recently, it is widely accepted that container technologies are now de facto choice for auto-scaling, due to speed and performance considerations.[16] focused compared Docker Swarm and Kubernetes for auto-scaling the AURIN walkability tool.Importantly, it was demonstrated that the Docker-based walkability solution cope under load and importantly, support processing of larger data sets which would cause out of memory errors with other vertical scaling approaches.

CONCLUSIONS AND FUTURE WORK
This paper has presented an overview of the architectural evolution of a national urban and built environment research platform.This platform had a large user base with demands for continued robustness and delivery.To tackle the next generation of problems the project moved to a container-based solution and adoption of container management and orchestration tools.
The Cloud-based system was finalized and delivered in June 2020 but ultimately was decommissioned in 2022.It is noted that there were eight different Directors of AURIN over the time that the platform was running.The AURIN project now has it's tenth Director in place and is currently deciding on the future technologies and data sets needed by urban researchers.
The MeG continue to host and run the SUDO platform (https:// sudo.eresearch.unimelb.edu.au).This platform now provides access to over 8500 unique data sets from over 150 major organsations including the most recent ABS 2021 Census data.This system is used by many academics at The University of Melbourne for teaching and by many external academics as a source of data.
One major challenge facing the AURIN portal and hence SUDO is the code base.This comprises over 3million lines of code.Given this the MeG team are moving to GeoNode as the core platform.This provides key capabilities including data search and discovery, metadata management, and visualisation, but only basic support for data analytics.To tackle this the team are developing and delivering targeted notebooks delivered through the Binder service on the Melbourne Research Cloud.The GeoNode solution offers an OGC-compliant API that allows for rich and diverse analytics.An example of this is shown in Figure 8.

Figure 4 shows
an example depicting navigation of the Victorian Statistical Areas (SA4…SA1) for Melbourne.• AURIN Datastore: building on the experience of the early prototype and the limitations of PostGIS, the project moved to a NoSQL store based on CouchDB.Instead of dealing with relational data (as per PostGIS), the CouchDB solution was able to deal with far more heterogeneous data sets.• AURIN Data Provider Service: was used to interface to external data providers.It acted as a client to user requests when requesting access (shopping) for remote data.This component included support for ReST-based services, WFS-based services, and statistical services such as Statistical Markup Exchange Language (SDMX) offered by the ABS.• AURIN Workflow Engine: provided homogenised support for an extensive range of analytical statistical and visualization tools.

Figure 5 :
Figure 5: Lung Cancer Data, Polluting Companies and Traffic Volume visualized as a Choropleth Map, Centroids and Bar Chart (2012-2020)

Figure 7 :
Figure 7: National Air Quality Data Sets Accessed and Visualised through GeoNode

Figure 8 :
Figure 8: Analysis of Prescribed Burns and Bushfires based on National Bushfire Data using Jupyter notebooks deployed to the MRC Cloud