Open data journalism: a domain mapping review

Journalism is highly important in modern societies, both for shaping public opinion and also as gatekeeper of transparency and accountability, being considered as ‘the fourth constitutional power’. The increasing dissemination of digital technologies and data had a profound impact on journalism, giving rise to the development of ‘data journalism’. The advent of open data provides great opportunities to journalism, towards gaining a better insight into important social problems and government activities, and based on them develop high quality ‘evidence-based’ journalistic articles and work in general, giving rise to the development of ‘open data journalism’. Even though this is of great importance for our societies, and journalists are a very special and important user group in open data research, there is not much research literature about it. This paper conducts a systematic literature review on open data journalism scientific research, and provides a ‘map’ of this important domain, which includes the main research themes that have been investigated; based on them three main future research directions have been formulated, and for each of them appropriate theoretical foundations are proposed. The findings of this study reveal as a main issue the need for new competencies concerning digital technologies as well as data analysis that journalists will have to develop, as well as the necessity for co-operation with other actors for a higher exploitation and more sophisticated analysis of open data, which might result in the development of open data journalism ecosystems.


INTRODUCTION
Journalism is highly important in modern societies, both for shaping public opinion and also as gatekeeper of transparency and accountability, being considered as 'the fourth constitutional power' [1].The increasing dissemination of digital technologies and data had a profound impact on journalism, giving rise to the development of 'data journalism' [2].The importance of data in the journalistic process is pivotal as they enable journalists with the ability to discover and support their publications with unshakable evidence.By having sold evidence in their hands, they can be more effective in monitoring the government and other institutions and elevate their role as guardians of transparency and accountability in democratic societies.Data journalism has emerged as a new evolution of journalism in recent years; this new form of journalism is defined by Veglis and Bratsas (2017) [1] as the use of data in all the stages of the journalistic process, the extraction of information, the compilation and the visualisation in a comprehensive way.And as experts have reported, this new form of journalism has increased in popularity the recent years [2].
The first known use of data for journalistic purposes dates back to the 19th century when leaked data set in the form of a table for the state of schools in Manchester was published in the Manchester Guardian newspaper in 1821 [3].Also, in 1859, the first use of infographics in a news article about the mortality rate from infectious diseases in the British army during the Crimean War [4] was introduced.But the use of computers for journalistic purposes will still want a century to emerge, as computers were initially used in 1952 to forecast the results of the USA presidential election [5].This use of computers in the journalistic process was named Computerassisted reporting (CAR).Although this innovative attempt was successful, this new technology will take one more decade to be widely adopted.The next innovative step is attributed to Philip Meyer, a Detroit Free Press journalist who used computers to analyse the data from the riots in Detroit during the late 1960s [6].But it was during the 2000s that the use of computers integrated into the working environments, and this led to the emergence of a new term: Data Journalism.Data journalism was quoted for the first time by Simon Rogers in 2008 in the Guardian Insider Blog [7].While CAR is viewed as a technique for gathering and analysing data to enhance investigative journalism, data journalism refers to using data throughout the journalistic process [6].Another difference between them was the availability of data; while in the past, journalists were more focused on the collection of data and the creation of their own data sources, in the age of data journalism, the high availability of data shifted the focus the collection to the analysis and the presentation.
The abundance of data we experience today, which contributed to the emergence of data journalism, can be attributed, in part, to the rise of open data initiatives [8].Open data is defined as "Data that is available free of charge, openly licensed, and in an open, machine-readable format" [9], and it was quoted for the first time in the publication "On the Full and Open Exchange of Scientific Data" in 1995 [10].Although the movement of open data will need almost another decade from its first mention to get greater recognition, two were the important milestones for the wide adoption of open data.The Open Knowledge Foundation (https://okfn.org) that launched in 2004, and 2009 the launch of Data.gov by the USA government.After that, other countries joined the open data movement and started to release their data to increase innovation and transparency.This progression significantly increased the available data sources journalists could access [11].
Open data and journalism have been well-explored separately in academia, but there is a research gap concerning their combination: 'open data journalism' meant as data journalism using open data.
This lack of research on the intersection of these two topics is surprising since journalism and open data have several commonalities, making information more accessible to the public and enforcing transparency.Also, the increasing availability of data in the past few decades and their practical use from several big news organisations make the absence of research on open data journalism even more peculiar.As open data policies continue to be adopted by governments worldwide [12], journalism will have the opportunity benefit greatly by using them, and become an even more critical pillar for transparency and accountability in the years to come.Therefore, studying the joint research concerning open data and journalism is essential since these two topics create a synergy that can improve democratic function.This paper aims to support and facilitate this research, by conducting a systematic literature review (SLR) [13], [14] of open data journalism scientific research, and providing a 'map' of this important domain, which includes the main research themes that have been investigated.So, our research question is: RQ: "What are the main themes under examination in the published papers on open data journalism?" Based on the identified research themes existing research gaps are determined and future research directions are proposed as well as appropriate theoretical foundations for each of them.
Our paper consists of five sections.The following section, 2, describes the methodology of our literature review.In section 3 the results are presented, and then discussed in section 4, while in the final section 5 the conclusions are summarised, and future research directions are proposed.

METHODOLOGY
The literature review method used for the analysis of the previous literature on open data journalism and the identification of the main themes is a systematic literature review [13], [14] (SLR); so our The documents must peer-reviewed.

Language
The publications must be in the English language.Time frame The publications must be published between 2010 and 2022  review process includes formulating the research question, defining the inclusion and exclusion criteria for the literature search, and then searching for publications with a formal process with the use of strictly defined keywords and screening the papers according to the defined inclusion and exclusion criteria.

Defining the Inclusion and Exclusion Criteria
The inclusion and exclusion criteria aim to ensure that only publications relevant to the research topic will be found and analysed; we can see them in Table 1 the publication must have a focus on open data as well as on journalism.The publications must be peerreviewed to ensure our findings' quality; they also have to be in English so that we can understand their content.Finally, we selected a timeframe for the publications (2010-2022).

Discovering the Literature and Applying the Screening Process
The selected keywords for our literature search correspond to our two main focuses, journalism and open data, and are shown in Table 2 with respect to the former "journalism", "journalists", and "journalist" have been used (other keywords, such as "media" or "reporting, " were deemed unsuitable since the results were not related only to journalism); with respect to the latter the term "open government data" was also used along with "open data" (since the scientific papers databases we used the databases returned only exact matches, when using the "open data" keyword, all the publications with the term "open government data" were not included).
Using the above queries to these four the databases 131 publications were extracted.The next step was to remove the duplicates, and filter out the ones that were not in English, and we ended up with 82 publications (excluding 34 duplicates, and also 11 publications in Spanish, 2 in Portuguese, 1 in Turkish and 1 in French).Based on the titles and abstracts of the papers, we conducted the screening prosses and applied the rest of the exclusion criteria to these 82 documents; we ended up with 45 ones.The details of the whole process we followed are presented in Figure 1.
From these papers that met the inclusion criteria mentioned, which are shown in the second column of Table 3, we extracted the research themes they investigated using an open-coding approach [15].In particular, for each paper, we read the title and the abstract (and, if required, some parts of its text) to understand the central theme it investigates, and then we generalise it; next, we compare it with the previously identified themes, and if it belongs to one of them, we associate it with this theme, while if it is new, we express it in an abstract manner concisely using 3-5 words and add it to the list of the identified themes.

RESULTS
Using the methodology described in the previous section nine research themes were identified, which are shown in the following Table 3, and analysed in sub-sections 3.1 -3.9.

Tools for Journalists
The publications of this theme describe software tools that have been developed for journalists for enhancing their work using open data.Two main categories were detected, the tools that are focused on data analysis and support the journalist's process in its entirety and the tools that provide the ability to discover and evaluate news instantly.Most of the publications belonged to the first category, describing software tools that cover all the parts of the journalistic process, such as data collection, data analysis, visualisation and presentation.
An interesting study that focuses on visualisation concerns the case of the Indian elections [31].The goal of the tool that was developed and presented was to facilitate the display of voting in India in a way that could be easily understandable by the public.The main problems they faced were the immense volume of votes they had to display and, secondly, to provide a way to navigate it.Concerning another aspect of the journalistic process, we have publications focusing on tools designed to make the data analysis easier.An example is The Gamma [29], a low coding tool journalists can use to analyse public data without prior programming knowledge.Also, concerning the discoverability aspect, we have publications presenting tools like GovWild [26] that can integrate various data sources and provide the functionality of searching all of them.In particular, GovWill can combine and clean open government data sets and then produce one linked open dataset.On the other category of tools, the ones that are designed to discover news, we have the case described by Gottron et al. (2015) [24], where a system is presented that can help the journalist to find geospatial-related data in real time.

Impact of Open Data on Journalism
Only three papers focused on the impact of digital technologies on journalism.One paper focuses on the use of big data [34] and how this technology will change journalism, since it will provide the ability to access and process great volumes of information and, therefore, providing new deep insights into important social problems and government activities.Another paper focuses on interactive maps [35] and their potential to visualise complicated data topics comprehensively.Finally, we have one publication on the impact of Artificial intelligence in journalism [36]; in this publication, a variety of potential uses are presented, like fake news detection and voice-to-text software.

Open Data Journalism Practices
The theme of Journalism Practises contains publications focusing on practices of open data use in the day-to-day work of journalists.Almost all the papers on this theme mention the journalist's problem of lack of technical skills.An interesting finding related to that problem is that this lack of skills works as a barrier to the exploitation of the open data that governments publish and discourages journalists from getting involved with data-related activities [37].Another widely encountered problem mentioned is the availability of open data sources.In the examination of the state of data journalism in Italy, is mentioned the struggle to find data, and in many cases, the data they get are of bad quality [38].Finally, another interesting finding was in the examination of articles published during Women's Day in Brazil in 2017 [39].This publication focused on analysing all the published articles of that year from three major Brazilian newspapers to evaluate the use of data in journalism.The findings indicated that the journalists in most of the publications didn't provide data sources or mention where they found their infographics.

Journalists' Data Literacy
Four papers focused on journalists' skills and abilities as well as their education concerning the use of data, especially open data.In this theme, two papers focus on formal education.The first focuses on comparing educational programs on data journalism in European countries [42], and the second focuses on the reforms that must be made in the curricula for data analysis to accommodate the increase in data availability we are experiencing [43].An interesting publication was "On Some Russian Educational Projects in Open Data and Data Journalism" [44], where a series of data journalism workshops were conducted in Russia.It also mentions that the workshops were quite introductory in data journalism.It advocates that the workshops only cover introductory material to the subject and that more specialised workshops must be organised.Finally, we have a publication that is not dealing with formal education; its main argument is that technical skills are not enough, but also experience in statistics is required for someone to be open data literate [41].

Collaboration
The papers categorised in the collaboration theme can be divided into two categories: collaboration with the public and collaboration with other professionals.In the former category concerning collaboration with the public, the main argument in the publications was that this could promote accountability and transparency, since the citizens are directly involved in the data analysis as they can have more experience than journalists in specialised topics.In [47] is analysed an investigative journalism story run by The Guardian that combined open data, crowdsourcing and game mechanics with the purpose of engaging readers.In the latter category concerning collaboration with experts, the main focus is to address the data literacy problems of the journalists.The papers analyse several cases of collaboration of journalists with different professions and ways to find and make experts outside of journalism interested in data analysis and visualisation for journalistic purposes.Interesting cases are the use of hackathons to find experts in the technology sector [49] and the involvement of civil activists and open data hackers [50].

Legislation and Ethics
Two of the publications categorised in this theme focus on legislation comparison.The publication "The Impact of Public Transparency Infrastructure on Data Journalism" [52] presents a comparative study between counties with respect to the transparency level they achieve through different legislation for data accessibility; also, there is comparison between the Right for Information Act and the Open Data government.The research examines to what extent these acts are applied and their impact on the data available to journalists.The focus of the publication "The Right to Know Through the Freedom of Information and Open Data" [53] is a comparative analysis between the Freedom of information act and the Open Data regulations in five countries.Finally, we have a publication that focuses on the ethical issues that can arise when using open data [51], especially focusing on the potential harm from the possibility of de-anonymising data and exposing people's identities.

Cases of Open Data Use for Journalism
The four papers of this theme examine cases of journalistic articles that use open data in several counties, but the focus of each is different.In [56] the focus is on how the news media used data in their reporting during the NHS (National Health Service, UK) winter crisis of 2016-2017.Another interesting focus is the case of the open parliamentary data in Norway, Sweden and Denmark [55], where the journalist and other actors using open data are struggling to find appropriate valuable datasets; it also mentions the problem of fragmentation of data across various governmental agencies and the need for national open data repositories.

Communication Methods
Two papers were categorised in this theme.The first paper [59] uses the 'mediated data model of communication flow' to shed light on the current communication process between journalists/media and their initial sources of digital information, using big data as case study.The second paper focuses on the different visualisation techniques that can be applied to present and communicate complex and linked datasets [58].The difference between this publication and the others examining visualization techniques is that this focuses on the methodologies rather than technical specifications.

Software Tools for the Public
Only one paper was categorised in this theme [60].It presents a software tool that can be used by citizens for accessing open data about the legislation in Brazil concerning their political preferences and interests, and avoiding the filtering performed quite often by the journalists based on media agenda.It argues that journalists cannot be the gatekeepers of information as they can be prone to biases, and therefore the public has to bypass them and have direct access to data.

DISCUSSION
By examining the nine research themes we identified in this open data journalism domain, and the number of papers of each theme, which are shown in Table 3, the first conclusion that can be drawn is that this research domain is still in its infancy, as more than half of the 45 publications on open data journalism we found are either descriptions of software tools to be used by journalists (18 publications of the largest thematic category on 'Software Tools for Journalists') and also by simple citizens for similar purposes (1 publication of the 'Software Tools for the Public' thematic category), or relevant case studies (4 publications of the 'Cases of Open Data Use for Journalism' thematic category).Much smaller is the number of papers concerning the 'core' of this domain: the ways and practices of open data use by journalists (4 publications of the 'Open Data Journalism Practices' thematic domain), as well as the impact of the open data on journalism (3 publications of the 'Impact of Open Data on Journalism' thematic category).Furthermore, most of the publications we found on open data journalism are based on case studies or interviews with small numbers of journalists (or other stakeholders) but are missing larger scale surveys of large numbers of participants, which would provide more generalizable conclusions; also, most of the publications lacked sound theoretical foundations.
An important share of the open data journalism publications we found (the 4 publications of the 'Journalists Data Literacy' thematic category, and the 6 publications of the 'Collaboration' one) are dealing with an inherent 'structural' problem of open data journalism: in order to find and use open data effectively and extract deep insights and knowledge from them journalists need substantial technological as well as statistical data processing skills and abilities, which the vast majority of them do not have, and this constitutes a big barrier to open data exploitation by journalists; so on one hand relevant education of journalists is required, while on the other hand journalists have to co-operate extensively with technological and statistical experts, and this might result in the development of open data journalism ecosystems.It should be noted that in many of the publications of the other thematic categories as well we identified mentions of the technical skills that journalists need in order to find and use open data properly; this was more prominent in the 'Open Data Journalism Practices' and the 'Cases of Open Data Use for Journalism' thematic categories.Also, in many publications are proposed possible solutions to this problem: in publications of the 'Software Tools for Journalists' and 'Collaboration' thematic categories we mostly encountered ways journalists can use open data without acquiring deep technical knowledge.This can be achieved using easy to use tools that can help in the open data search, collection, analysis and visualisation; furthermore, in the 'Collaboration' thematic category, the overwhelmingly mentioned solution was to work closely with other experts that can cover the skill gaps the journalists have.
From a more detailed examination of the publications we have also found it can be concluded that the themes of 'Open Data Journalism Practices' and 'Cases of Open Data Use for Journalism' can be considered as 'different sides of the same coin' since they focus on the adoption of open data in journalism, but they also have significant differences.Most publications in the 'Open Data Journalism Practices' theme focus on journalists as individuals, and the research methodologies they use are primarily interviews and secondarily surveys.On the other hand, the publications on the 'Cases of Open Data Use for Journalism' themes are mainly case studies, and they are not focused on journalists but on the result of their work using open data.However, the number of publications this category is quite limited (only 4); more case studies are deemed quite necessary, since they allow gaining a better insight about not only the adoption of open data in the journalistic profession but also the impact that they can have on the profession and the society in general (to what extent the use of open data by journalists can lead to a higher quality of 'evidence-based' journalism, and also to what extent the use of open data or/and conclusions drawn from them in journalistic articles can make them more influential to the public.
Another interesting finding from the exclusion process was that 17% of the papers we discovered were not in the English language; we consider that to be quite remarkable since English is the predominant language in academic publications; 12% of these papers were in Spanish and 2% in Portuguese, published mostly from counties in South America.We do not have proficient knowledge of Spanish, so we could not read these publications, but their volume made us formulate the assumption that there is a separate ecosystem in South America that is researching the topic of open data journalism, and this can be another source of research inspiration.
Finally, based on the above review and mapping of open data journalism scientific research, and the research gaps we have identified, we can formulate three main directions of future research in this domain, together with appropriate theoretical foundations for each of them, which can contribute to increasing its maturity; they include qualitative studies (based on in-depth interviews and focus-groups) as well as quantitative studies (based on large scale surveys) on: RD1: Perceptions of journalists and other stakeholders about various aspects of open data, such as ease of use and usefulness, based on the 'Technology Acceptance Model' (TAM) [61][62][63], or comparative advantage, complexity, compatibility, trialability and observability, based on the 'Diffusion of Innovation' (DOI) theory [64][65][66], or performance expectancy, effort expectancy, facilitating conditions, social influence and hedonic motivation, based on the Unified Theory of Acceptance and Use of Technology (UTAUT) [67]; and also attitudes and intensions towards the use of open data for journalistic purposes, as well as real use (if it exists).
RD2: Ways and forms of open data use for various kinds of journalistic work, including investigative journalism, as well as barriers and difficulties; in general investigation of the 'positive affordances' and the possible 'negative affordances' of the use of open data in journalism, both the 'perceived' and the 'actualized' affordances, using the lenses of 'affordances theory' [68][69][70] and exploiting the methods and knowledge in general developed in this area.
RD3: Benefits and impact of open data use for various kinds of journalistic work, concerning its efficiency, effectiveness and innovation, at various levels: individual, organizational (examining various types of journalism organizations, such as newspapers, news portals, etc.) and sectoral (i.e. for journalism in general); for these purposes a useful theoretical foundation can be the DeLone and McLean's Information Systems Success Model [71][72][73].

CONCLUSIONS
In this paper, we used the structured literature review methodology [13][14] in order to review the academic literature on the intersection of journalism with open data, or open data journalism, and develop a thematic map of this domain.We searched four scientific databases, and after a process of excluding some papers based on pre-defined criteria, we ended up with 45 publications, which were analysed and grouped into thematic categories in order to determine the main research themes of the open data journalism domain.Nine research themes were identified, with the largest of them (i.e. the one with the highest number of publications) being 'Software Tools for Journalists' as well as 'Collaboration' (between journalists and technological and statistical experts).By examining the above publications on open data journalism, as well as the above research themes we identified, it can be concluded that this research domain is still in its infancy, as more than half of the 45 publications on open data journalism we found are either descriptions of relevant software tools to be used by journalists or simple citizens, or relevant case studies.Furthermore, research gaps have been identified, based on them three main future research directions have been formulated, and for each research direction appropriate theoretical foundations have been proposed The thematic map of the open data journalism domain developed in this study can be very useful for gaining an overall picture of this domain, as well as for supporting and facilitating the extensive future research required on this topic, while the identified research gaps and the proposed future research directions can be very useful for orienting this research; furthermore, the theoretical foundations we have proposed can enhance the quality, the completeness, the reliability and finally the usefulness of this research.As mentioned in previous sections the first step in our research endeavour was to discover if there are other literature reviews on open data journalism, and, to the best of our knowledge, we have not discovered any; therefore, our work can be used as a guide for other researchers and professionals involved in open data journalism.
Further research is required in order to gain a better understanding of the timewise evolution of open data journalism research, the countries in which it has been conducted, the research approaches and methods it has used, and also its theoretical foundations (for the limited number of publications having some theoretical foundation).Finally, it is necessary to investigate further the barriers and benefits journalists encounter when using open data that have been identified by the literature, as well as the proposed ways and actions for addressing the former and increasing the latter.

Table 2 :
Search queries

Table 3 :
Thematic classification of the selected publications