Cultivating Spoken Language Technologies for Unwritten Languages

We report on community-centered, collaborative research that weaves together HCI, natural language processing, linguistic, and design insights to develop spoken language technologies for unwritten languages. Across three visits to a Banjara farming community in India, we use participatory, technical, and creative methods to engage community members, collect spoken language photo annotations, and develop an information retrieval (IR) system. Drawing on orality theory, we interrogate assumptions and biases of current speech interfaces and create a simple application that leverages our IR system to match fluidly spoken queries with recorded annotations and surface corresponding photos. In-situ evaluations show how our novel approach returns reliable results and inspired the co-creation of media retrieval use-cases that are more appropriate in oral contexts. The very low (< 4h) spoken data requirements makes our approach adaptable to other contexts where languages are unwritten or have no digital language resources available.


INTRODUCTION
In this paper we report and refect on three phases of a project to cultivate speech and language technologies in collaboration with a traditional farming community of Banjaras in Western India, who speak Gormati, a language without a native script.
Linguistic research involving minoritised language communities is often targeted at documenting or preserving a language in the face of worrying statistics that as many as 40% of the 7,000+ languages spoken today are endangered 1 and likely to become extinct by 2050.While such eforts to document and preserve are laudable, they miss out on lines of research that reinvigorate and carry forward a minoritised language through digital media [74] and the possibilities now aforded by Artifcial Intelligence (AI) in general and speech and language technologies in particular.
In this contribution we weave together ethnographic, creative, and technical methods in partnership with a Banjara community.Inspired by the oral culture and agrarian lifestyle of the community, we showcase how we cultivated an AI model from seed-that is, without drawing on any existing digital language resources in the target language-to drive a simple, mobile information retrieval interface to surface co-produced media related to their farming practices in response to spoken-language queries.
The development methodology at the heart of this contribution initially requires very little data, and can be iteratively improved, allowing for tighter feedback loops between community data contributions and system improvements.This approach therefore resonates with participatory and action research methodologies that emphasise partnership and reciprocity.In India alone there are 424 living indigenous languages of which 304 have no digital language support available and therefore no access to, nor pathways to the development of, spoken language technologies [22].Our data collection approach and the IR interface we designed and tested in-situ demonstrate the feasibility of our novel approach to speech and language technology development for communities living in oral cultures and speaking languages that are not (commonly) written and have few digital language resources available.
Figure 1 shows a schematic outline of our research and how it is split across three phases, each of which is further subdivided between feld and lab research and development.This multi-phased approach aforded us opportunities to build trust and identify, develop, and leverage community assets, which together highlight the value of incremental research approaches [85].While each phase lists particular methods and probes developed and utilised, the schematic suggests a linear ordering to these that distorts the 'messy' realities in which methods were adapted, assembled, and performed in response to the particular context at hand (see [37]).This also resonates with ICT for Development (ICT4D) research involving people living in oral cultures that emphasises modifed, fexible, and opportunistic methods in order to deliver valuable and actionable results [28].

BACKGROUND
To contextualise our research contributions, we outline related speech-driven systems and eforts to develop digital language resources.We also engage with orality theory to show how the epistemology of writing afects (speech) user interface design.Finally we consider ethnographic and linguistic accounts of Banjara culture and language as well community-centred research on digital repositories and information retrieval that is tailored to the needs and functions of oral, rural, or indigenous communities.

Orality & Written Representations
Ong's seminal work on Orality and Literacy [48] interrogates the technology of writing and unpacks the ways in which taken for granted aspects of experience are deeply afected by writing and how these do not generalise to primarily oral-rather than literatecontexts and cultures.The subtleties of Ong's theory are not without critique, especially because of the way it frames signed languages and primarily oral cultures as dependent on or inferior to spoken languages and literate cultures, respectively [8,15,26].However, ICT4D researchers have engaged with the broader arguments of Ong's work to surface a range of pertinent and practical design implications: for instance, by drawing attention to the ways in which oral thought relies on repetition; how information is structured through additive narratives; and, how abstract categories and complex information hierarchies should consequently be avoided when designing user interfaces in such contexts [67].Goody's argument that written language is not merely speech transcription but a mode of thought reproduction [27] is pertinent to our research too.Consider this paper as an example, which not only contains tabular structures (for instance, the author table at the beginning of this paper, which lists names above afliations and contact details) and hierarchical ordering (sections, subsections, lists, etc.), but also includes subordinate clauses (such as this one) that exemplify this mode of thinking.According to Goody, knowledge and representation are two sides of the same epistemological coin.This has far-reaching implications, as what we might call the epistemology of the written word also infuence the metaphors that structure our experience and thinking (see [35]).Consider how Bidwell juxtaposes the literal and fgurative translation of "are we walking together?"with "are we on the same page?" in reporting her insights of deep design research with rural, oral communities in the Eastern Cape of South Africa [4].These consequences are not limited to conversation and writing.User interfaces extend and re-produce written thought too [82], just as databases-dominant cultural forms [40] of literate societies across the 'hyperdeveloped' world [72]-are driven by relational, tabular, and hierarchical data structures [20].

Speech Interfaces
Against this backdrop, Speech interfaces in particular have been identifed as promising technologies to broaden digital participation of illiterate and semi-literate populations across the Global South [83], especially for those speakers of languages that, like Gormati, do not have a native script.Here Vashistha and Raza [77] give a useful overview of almost 15 years of research, innovation, and impact of Interactive Voice Response (IVR) and Interactive Voice Forum (IVF) applications that-through the widespread adoption of mobile phones-"have found applications in diverse domains and have profoundly impacted underserved communities in low-resource environments" [77, p. 570].Both IVR and IVF systems are accessed by calling a (typically subsidised) phone number, and then navigating a menu of options using speech commands or keypad numbers.An agricultural extension IVR system, for instance, might allow callers to listen to announcements by pressing 1 or saying 'announcements'.Rather than purely disseminating information, IVF applications provide callers with the ability to record messages, and in the case of Patel et al.'s Avaaj Otalo [50], allow them to post questions and listen to the questions and responses of other farmers and agricultural extension workers.This social element of IVF platforms has driven the success of related systems such as CGNet Swara (for citizen journalism) [45], Polly (for entertainment and jobs) [58], and Sangeet Swara (for social media) [75].
IVR/IVF research has investigated the efcacy of and caller preference for key press vs speech command input modalities (e.g., [50]), but generally do not engage with the nuances of Ong's orality theory and recognise the ways in which IVR/IVF systems impose a hierarchical ordering system.Vashistha and Raza mention that for caller-generated content it is "difcult to automate categorisation" [77, p. 596] and how such categorisation is necessary for users and IVF operators/moderators to fnd and access content.
IVR systems in particular [57], as well as smart-speaker installations [60], have been leveraged to develop digital language resources that many minoritised languages lack.Involving and empowering minoritised language speakers in the crowdsourcing of transcriptions has been a further area of HCI research and user interface innovations [60,76,78,79].However, these systems have to date targeted regional languages with more established writing systems.

Community & Farming Practices
Our approach of documenting and engaging community members around their farming practices resonates with the Digital Green initiative that leverages peer learning and participatory video to record small-scale farmers across India on efective practices such as vermicomposting or fertiliser application [25].However, content that is shared through the organisation's agricultural extension platform, while popular and impactful, is stored on a database and accessed through a webpage with a hierarchical navigation system (e.g., by language; by category; by sub category; etc.) that favours larger, regional languages that are more commonly written.Particularly when working in indigenous, marginalised and minoritised contexts, Science and technology studies scholars Verran and Christie alert us to the misplaced dichotomy between traditional (i.e., oral) and modern (i.e., literate) cultures: Traditional cultures are contemporary forms of life just as modern cultures are.They are rich in modes of innovation [ . . .] We can understand traditional cultures as involving nonmodern forms of identity.They have ontologies that make modern assumptions about knowledge and knowing look strange.[80, p. 73] They key challenge, then, is to devise appropriate and fexible ways of arranging, storing, and fnding digital content that are usable for those working within nonmodern cultures and where this "becomes a site, a time and place where young and old, with their varying competencies, work together [ . . .] in ways that can empower and educate the young while recognizing older people as knowledge authorities" [80, p. 74].

Community Repositories & Information Retrieval
HCI researchers have partnered with remote, marginalised, or indigenous communities across the globe to design digital technologies to "enact culture in the digital age" [74, p.16] to address specifc problems communities face [5] while cultivating sensibilities to cultural and addressing sticky representational issues [44,70].Designing with an Aboriginal community,

Ethics, Consent, & Compensation
Those working in the felds of language documentation [7,21] and ICT4D [16] have highlighted the pertinence of research ethics.The studies presented in this paper were approved by an institutional review board at Swansea University.We also follow best practices within ICT4D research that emphasise long-term engagement and reciprocity [16], and working with local researchers and organisations [16]; and, the practices of linguistic research [7,21] into unwritten languages, which proposes to establish informed consent orally [21], as well as to place linguists in control of operationalising storage and access to collected data [7].In following Brereton et al. 's advice and confguring our research approach for reciprocity and engagement, we supported community members when they visited us in the city or when they asked for support or information not directly related to our research [9].

Ethnographic & Linguistic Resources
We also consulted ethnographic accounts of Banjara culture and language elsewhere in India [11,46] and report on these in the next section.We identifed linguistic resources2 such as word lists [43] and multi-lingual dictionaries [33].However, we found that community practices, particularly surrounding writing and transliterating, diverged from ethnographic descriptions and so we did not draw on written resources in developing our system.This also means that our development approach is more likely to be applicable to other contexts where language is spoken, not written, and where knowledge practices are not-or less-infuenced by writing systems.

PHASE 1: LEARNING, PARTICIPATING, & FRAMING
The frst phase of this project was not directed towards a particular technological purpose.Rather, it was setup as an opportunity to observe, participate, and learn from an agrarian Banjara community in Jalgoan District, Maharashtra State, India.We report on these activities extensively here to introduce what we learned about the community, its everyday practices, and how we drew on these early experiences to situate and frame the subsequent design methods and approaches of later phases, including identifying spoken language technologies as a nascent design opportunity.These activities and experiences also provoked interdisciplinary discussions across HCI, design, linguistics, and NLP through which we decided to focus on use-cases and data collection around more specifc topics (e.g., farming) as opposed to open-ended, unconstrained speech.

Objectives & Methods
The objectives of this phase of research were to lay the groundwork for community partnership, to learn from the community, and to inform and frame subsequent methods and approaches.Previous research involving oral cultures and communities has relied extensively on ethnographic methods for these purposes [6].The ethnographic methods we utilised are inspired by Lee and Ingold's observation that walking, especially when done alongside others, is a powerful, but often underappreciated form of anthropological engagement: that places are made and best understood through the journeys that people make within and between them; that walking attunes us to multi-sensory and embodied experiences; that walking is fundamental to social life; and that walking together is a particularly sociable type of movement that afords opportunities for shared understanding [38].We complemented this style of "feldwork on foot" [4,38] with audio-visual media (photos, videos, and voice recordings) recorded along the way.Here Pink suggests that mobilising audio-visual media in this manner in general, as well as involving local people in the co-production of videos in particular, are efective methods for uncovering and simultaneously documenting insights [52].
We are also mindful of Brereton et al. 's critique that obtaining the privileged position of ethnographer and observer is difcult, particularly in remote or Indigenous settings and for projects seeking to drive (digital) innovation [9].In moving "beyond ethnography", processes of engagement, (mutual) learning, and reciprocity should be primary considerations as these underpin valid and sustainable research partnerships [9].It is in this spirit, rather for the purposes of ethnographic analysis, that we utilise our methods.

Community Background
3.2.1 Setlement & Infrastructure.We (one of the authors) were hosted 3 by a family in the community for fve days.We slept on the rooftop of a two-room pucca 4 house, owned by one of four brothers.The houses of the brothers (and their families) are clustered around a shared courtyard.All but two of the houses in the cluster are single-roomed, tin-roofed, and constructed of wattle-and-daub.The community has a mains electricity supply, although intermittent power cuts are common.Clean water is only available for 30 minutes every few days, leading to a well-rehearsed choreography of flling every available container.There are no sanitation facilities.On our drive into the community we could see a new telephone mast carrying 4G radio units and antennas, boasting actual 4G speeds (~50Mbit/s) that surpass those of many urban areas by a factor of ten.We observed only a small number of (younger) people with smartphones 5 -most elders either do not own a phone or share a featurephone.The cost of mobile data in India is amongst the lowest globally (~$0.10 per GB) 6 .The smartphone usage we increasingly observed in the community largely revolved around popular culture on YouTube and YouTube shorts.
Walking through the community we learned that the sum of all the surrounding courtyards and hamlets constitute the bounds of the thanda-or Banjara settlement of about 6000 people-where many members are related to other members of the thanda [11, p. 45].Agricultural plots adjoining each hamlet within the thanda are similarly owned by close kin and are generally inherited patrilineally.The thanda is adjacent to, but also distinct from, a Marathi7 village, a typical settlement pattern.According to our hosts, Banjaras mostly keep to themselves, although there is some trading between the thanda and the village, and Banjara children attend Marathi-language school.

History & Language.
Our hosts were able to retrace the history of their thanda back to four generations ago.Historically Banjaras led a nomadic life, but were forced to settle by colonial British rule.With the passage of the Criminal Tribes Act of 1871 Banjaras were branded as criminal as their "nomadic ways of life [ . . .] was regarded as suspicious and [ . . .] difcult to be controlled" [11, p. 8].After independence Banjaras were declared a 'Denotifed Tribe' and are currently classifed as a 'Vimukta Jati and Nomadic Tribe' to recongise their historical marginalisation and make them eligible for special considerations in the state of Maharashtra.
There are 30 million Banjaras in India, but due to their nomadic history and contemporary settlements scattered throughout the country, Banjara culture resists neat classifcation.Depending on the region they settled in, they are known by at least 26 names (e.g., Banjara, Banjari, Vanajara, Lamban, Lambadi, etc.).Their language is similarly polyonymous, including (but not limited to) Gormati, Gor, Banjari, Lamni, Lambadi, etc.For consistency we use the term "Banjara" to refer to Banjara people/community and "Gormati" as the language spoke by Banjara community members.This follows the conventions of the particular community we partnered with, but also note that we can only speak for the conventions and practices of that particular community and place.
Gormati belongs to the Indo-Aryan language family, but it has many dialects that can vary even from thanda to thanda within the same region.While kith and kin speak Gormati to each other within and across thandas, outside of their community Banjara people speak the local language of their region-in this case Marathi-and usually also Hindi [11, p. 53].Gormati, however, does not have an indigenous script [11, p. 57], although it can be written (transliterated using the closest corresponding letters) through Telugu, Kannada and Devanagari script, depending on the state in which the thanda is located [10, p. 41].
Our engagements within the community nuance such general fndings.While we found evidence that transliteration of Gormati using Devanagari script is possible, only a minority of people are able to do this, and it is not a common practice.Furthermore, there is a linguistic diference between younger and older generations.Younger generations are generally fuent in and can mostly read and write Hindi and Marathi: regional and national languages, respectively.Older generations often cannot read or write these languages and may not be fuent in them either.
Therefore, a major barrier to digital participation is text-input (see [18]).However those who are not fuent or literate in Hindi make do and seek assistance from younger family members (e.g., to contact a community member or obtain information) [61].

Community & Farming Practices
Every morning and at diferent times throughout the day, we went for walks to the surrounding felds to observe, learn, and participate.By walking with our hosts into the felds, we also followed in the metaphorical footsteps of Gupta, who has been walking with rural communities in India to uncover and share grassroots innovations on topics that range from farming, sustainable conservation, animal husbandry to cooking and recipes [29].Along the way we stopped to greet and chat with other community members.We asked what they were doing and observed them carry out their work in the feld: ploughing, weeding, watering, planting.We captured glimpses of such encounters through photos, and videoed community members demonstrating certain aspects of their work to serve as aide-mémoire and later as a means of communicating these experiences with the wider research team, distilled and presented through slide decks.These activities were not directed towards a particular purpose, but were invaluable later to situate design activities.For instance, we developed an intuition of when people were communicating freely and when the language barrier was creating confusion.We mostly chatted in Hindi, but when we sensed confusion, our hosts would translate.Farming is as intrinsic to Banjara culture as the Gormati language, and our hosts were eager teachers of both.They said that if we stayed a few more weeks we would be able to speak Gormati, as we already were picking up certain greetings, phrases, and names of crops.
Both men and women, old and young, work in the felds.Cotton is the primary cash crop; millet, corn, lentils, chillies, and onions are grown for subsistence.Branches cut from trees planted around feld boundaries are harvested as tree-hay and fed to cattle and goats along with stalks from corn or millet.Oxen are used to pull carts and for ploughing.Cow's milk is usually sold, but goat milk is served with chai.We mostly ate chapatis made of millet with lentils, prepared by women over wood fres.
When it got too hot (~33 °C), we returned to the courtyard and rested under a neem tree.The weather, cycles of day and night, as well as the needs of animals and crops-rather than the clockcreated the rhythm of quotidian activities.Both in the felds and in the courtyard, Banjara women often chant poems and sing songs.And as more community members learned that we were interested in Banjara culture and the Gormati language, they would approach us to capture video of them singing a song.When it got dark, and during periods of down time, we composed some of our own songs inspired by the sonority of the voices we heard throughout the day.

Findings & Implications
Corroborating our direct experiences with ethnographic [11,46] and linguistic [10] accounts of Banjara culture and language elsewhere in India, we were struck by how these frequently mention transliteration through regional scripts (e.g., using the widespread Devanagari script in Maharashtra).This contrasts with a key fnding of our engagements with the community: that Gormati was spoken, rather than transliterated.In fact, the linguistic landscape of the community contained very little writing. 8hrough discussions with the wider research team across Design, HCI, Linguistics, and NLP disciplines we decided not to pursue lines of inquiry that utilised or surfaced transcriptions or transliterations, mostly because of limited transliteration practices in the community, but also because we wanted our approach to be adaptable to other contexts.Respecting oral practices, rather than imposing tranliteration or writing systems, was a key implication of this phase of research.
These discussions were scafolded around slide deck presentations containing images and videos we recorded and co-produced while in the community.These slide decks showed the thandas we visited, farming practices we observed, and expressed and communicated glimpses of what we learned about everyday life and the use of Gormati in the community (see [52]).
Our discussions also focused on what Harper [30] refers to as a "marriage of purpose" between users and machines that is sensitive to community needs and context, but is also anchored in an understanding of how technology works and what it might realistically deliver -an equally important consideration given the opportunities and hype that currently surround AI [23].
Given current barriers to digital participation especially for older generations and the increasing adoption of smart phones by younger generations (after the installation of a 4G telephone mast), the key fnding of our ethnographic engagements throughout phase 1 was that speech and language technologies could have a role to play in the digital expression and sharing of Banjara insight and culture.Through our engagements we also established trusting relationships with community members.Not only had community members proven to be willing (and patient) Gormati teachers, but upon our departure they also expressed a wish for us to return again and establish a nascent partnership.From the technology side, we had to be realistic about what we could deliver, but also needed data to develop Gormati language technologies.Through discussions across the research team, we agreed on a series of high-level guiding principles: To alleviate some of the technical complexities of the project and to reduce the amount of required data, given also that there are so few digital language resources in Gormati, we decided to focus data collection to a couple of specifc domains or topics (e.g., farming).Rather than collecting open-ended unconstrained speech, that one might use when chatting with a friend, we posited that focusing on specifc domains would also involve a smaller subset of potential words (~100), and that gathering 30 hours of speech data within these domain constraints, would provide enough word repetition to develop a basic language model.To implement these guiding principles and to tighten feedback loops, we also decided that members of the broader research team should participate in future community visits.

PHASE 2: ENGAGEMENT, DATA COLLECTION, SYSTEM DEVELOPMENT
In phase 2 we returned to the community to deepen our partnership and to focus our engagements around more topic-specifc use-cases of speech and language technologies.We also experimented with diferent data-collection methods while in-situ, and used the insights we gained to train community members on how they can continue to contribute data after we left the community.It is this data that we then utilised to train a novel spoken language information retrieval system for Gormati.

Objectives & Methods
The objectives of this phase of research were to engage community members to contribute spoken-language data that match and driveuses cases for spoken language technologies that support everyday practices.As nearly three months had passed since our phase 1 visit and since another member of the research team was visiting the community for the frst time, we began phase 1 by leaning on more ethnographic and audio-visual documentation methods of phase 1 (see Section 3).We did this to tune our senses and sensibilities from urban and research lab environments to those found in rural, oral, and agrarian communities.
Gradually we transitioned our methods to focus also on spokenlanguage data collection.We took photos and videos (with consent and permission) of farmers conducting the activities they were doing as we passed by and opportunistically recorded videos of people demonstrating other techniques and practices.For instance, upon hearing bees buzzing in a hedge we took a video of the person accompanying us on harvesting honey.To trial diferent approaches to collecting speech data, we also asked people to narrate (in Gormati) as we were flming.
A central tenet of our design research is to engage and reciprocate, rather than to solely document and collect (see [9]).That is, we wanted to also teach community members how spoken language technologies are able to 'learn' from repeated exposure to particular utterances and how, with time, such a system could identify and match similar phrases.We did this through workshops through which we also recorded further audio data.

Engagement
Returning to the community for fve days, as a team of two researchers (one Hindi speaking; one non-Hindi speaking), we settled into the familiar rhythm of accompanying our hosts into the felds and engaging with people as we went along.We again slept on the rooftop of the main house in the courtyard and benefted from being immersed in context and surrounded by community members.We consulted and clarifed with our hosts whenever questions emerged.During downtime we wrote up notes, which we also shared with the wider research team, and continually discussed, refned, and refected on our plans and methods.
We also ran workshops to engage community members on how speech and language technologies are developed or 'trained' and to experiment with diferent data-collection methods.
Here, we found that the metaphor of how young children pick-up phrases through repeated exposure useful, and would draw on this metaphor to explain how spoken language technologies can make mistakes that can seem childish.We also set up a voicemail box that community members could call and contribute recordings of the diferent things they did to care for their plots, animals, plants, and equipment.We thought initially that it could serve as a spoken diary that fulfls the domain-constrained vocabulary requirement, but that it could later be queried by community members, for instance, if they needed to remember when something happened.However, when we presented this idea to workshop participants they did not express interest in keeping, and being able to query, such a diary and, to our initial surprise, reported no difculty in remembering things.However, on post-hoc refection our surprise likely says more about how we equated memory with written or calendar records, which again shows the deep level at which writing restructures consciousness (see [48, p. 95]).
In preparing for the workshops we worked with one of our hosts to create a consent recording in Gormati that explained that we would be using the data to create an interactive Gormati-language speech-driven application for them.We also created slide decks of the photos we took of diferent crops and tools, but discussed how we wanted to avoid common approaches to voice user interface development that seek to detect the presence of a predefned keyword (e.g., 'cotton') or keyphrase (e.g., 'watering cotton') [e.g., 56,66].We were wary of reifying photographs into objects and keywords or phrases.On this topic specifcally, Ong reminds us that "an entirely oral language which has a term for speech in general [ . . .] may have no ready term for a 'word' as an isolated item, a 'bit' of speech" [48, p. 60].During the workshops, we instead asked community members to play an adapted form of a Wittgensteinian language game [84] and narrate the doings associated with the things (see [32]) pictured.
So we recorded participants narrating those 'doings' -the steps involved in growing the crop or operating the tool pictured in each slide.Here we found a generational divide, whereby older men and women would feel confdent in narrating at length, but younger generations were far briefer.Later on we used these narrated slide decks to showcase how an information retrieval system might work.This was done, for instance, by asking participants to say something similar to what they just recorded for the photo of a ox-drawn wagon and then accessing the slide containing that photo through a keyboard shortcut and playing back their recorded narrative.It was difcult to create 'clean' recording environment as the courtyard represents a nexus of both (noisy) activities and (inter)family relations.People had to come and go, so we worked with people when they had time and showed interest, but also did not keep them longer if they had other things to attend to.Participants in workshops found it difcult to imagine use-cases or domains other than farming.They did however mention fnding songs and accessing religious ceremonies, which typically also included singing.

Data Collection
We discussed and refected on our in-situ activities with the wider team and decided to steer away from sung content because this is an application area that would likely be too complex for the capabilities of a system in such a resource-constrained context.Given that community members had observed us recording community narrations in the felds and workshop participants had practised creating photo narrations, we decided to utilise mobile digital storytelling software 9 as a data-collection probe, replicating the process we had started on our laptops.We trained four younger community members on how to use the data-collection probe (see Fig. 3), following the process we trialled and refned during earlier workshops.We created a slideshow template on each phone, which contained 30 photos of crops, animals, and equipment.
We encouraged younger generations, who were more adept at operating their Android smartphones, to help older people to record narrations, a practice which ICT4D researchers refer to as 'intermediation' [61] and which Verran and Christie identify as a site of inter-generational collaboration [80].To make the process easier for data-contributors, we also explained that it is helpful to record similar content for the same photos; again using the metaphor of a child learning words and phrases through repeated exposure.We loaned a phone to a young lady, to ensure that female voices are represented, as these are often missing from IVR datasets [77].The three young men used their own Android devices.We paid data-contributors and erred on the side of generosity to ensure that the amount was appropriate and commensurate, and also covered airtime expenses: 4000 rupee (~$50) per contributor. 9https://play.google.com/store/apps/details?id=ac.robinson.mediaphoneWe showed data-contributors how to export narratives from the digital storytelling software and how to share these with the research team.Making this step explicit made sure that we did not accidentally collect data that was not intended for us (e.g., if they were using the software for other purposes).We shared the consent recording from the earlier workshop with the data-contributors and explained how they needed to obtain consent if they recorded someone we had not already obtained consent from earlier.
We stayed in contact with the data-contributors and host family after leaving the community and collated data in batches.While the digital storytelling app exports made it easier to link recorded audio to specifc photos across multiple phones and data-contributors, the order of photos in our presentation template meant that datacontributors were creating more narrations for photos that appeared towards the start of the slide deck and fewer for photos that appeared towards the end.In total we collected 3h43m of spokenlanguage annotations of 30 photos (see Fig. 4).

Findings & Implications
Through our community engagements and workshops, community members were starting to understand how speech technologies learn to pick up and match Gormati words and phrases through repeated exposure to them in the form of community contributed recordings.However, talking to data-contributors after we left the community, we got the sense that the task of annotating photos with recordings, while clearly specifed, was still somewhat abstract.That it was unclear how the recordings they were contributing were related to actual Gormati speech technologies and how such technologies might actually work in context.The difculty in sustaining their engagement and contributions further evidences this, which lead us to substantially scale back our data collection ambitions from 30h to 3h43m.
We posited that an interactive demonstrator system would help to motivate and inspire community members to take on the labour of contributing data if they could see how these contributions translated into a working system.

Information Retrieval System Development
Meanwhile, in the lab, we (NLP researchers & linguists) reviewed technical literature about spoken content retrieval to fnd a solution that would allow us to respond to this challenge and build an interactive demonstrator system using only 3h43m of training data.The most straightforward method for spoken content retrieval is keyword search.In this method an automatic speech recognition (ASR) system, trained on manually transcribed data, is used to generate several alternative transcripts of each recording.These transcriptions of all recordings are then used to build a search index, which can subsequently be used to search for a given keyword or keyphrase [2].The small amount of training data poses a central challenge to building a high-quality ASR system.Applying a multilingual phone recogniser trained on data from well-resourced languages [24,39], can remedy this problem.
We decided to split our information retrieval system into two components: a phone recogniser and a ranker 10 .We used a multilingual phone recogniser trained on transcribed speech data from well-resourced languages to transcribe queries and captions into phone sequences.We then trained the ranker to predict how these phone sequences correspond to each photo.Through this decoupled architecture, we were able to rapidly prototype our system and bootstrap the voice user interface with almost zero hours of spoken content in the target language.Furthermore, it allowed us to quickly and iteratively update the system, in anticipation of community members contributing more data during phase 3 of the work.

Probe Development
When we were satisfed with the performance of the IR system in lab conditions, we deployed it as an API so that it could be used and evaluated in the Banjara community.For this purpose, we also created a probe to interact with the IR system through the API, implementing this as an Android app.The interface (see Fig. 5) was kept deliberately simple: after speaking a query the user is shown the top-ranking photo result as generated by our ranker.The user can then rate the result using either the green checkmark or red X-mark buttons, after which the next photo in the ranked list is displayed.After rating all results the user can create a new query.

PHASE 3: EVALUATION & CO-DESIGN
We returned to the community as a team of four researchers (one Hindi speaking; three non-Hindi speaking) for fve days.This time we decided to stay in a nearby town (~30 minute drive), so as not to impose on our hosts, because there were more of us and the monsoon season made it impossible to sleep outdoors.

Objectives & Methods
The objectives of this phase of research were to evaluate the prototype system we developed, which we did quantitatively to avoid participant response bias caused by social and demographic factors [17].We also wanted to leverage the prototype to engage community members on future use-cases of Gormati speech technologies through a series of design workshops and data collection exercises.
A key challenge here is that the methods that underpin more mainstream forms of user centred or co-design fall short when working with oral communities [28], and in those that are (often also) digitally marginalised [41].Traditionally, sketches and paper prototypes are used as shared artifacts that facilitate communication between designers and users.However many, especially older, community members cannot read or write and even those that are textually literate tend to treat writing as a more formal, special activity (e.g., letter writing): the opposite of the informal style of writing through which designers involve users in sketching or paper prototyping (see also [6]).Furthermore, digitally marginalised users often do not have a strong sense of digital technology and the types of things computers, or in our case speech and language technologies, can do well.They also have little experience of how digital technology is malleable and can be (re)programmed to look or function diferently [41].Technology probes-simple prototypes, left strategically incomplete and fexible-ofer a way of engaging user groups on new (unaccustomed) technological possibilities [31].Such probes have been successfully used in other, oral contexts as a central component of design workshops to co-design digital storytelling software [6].
Adapting to these constraints and building on successful uses of technology probes in oral contexts, we also leveraged our prototype as a 'dialogical probe' in the design workshops to facilitate a futureoriented design dialogue: As such, the concrete prototype works as [ . . .] a dialogical probe, that supports increasing cross-cultural understanding through dialogue [ . . .] but it will only work because there are people who accompany it and engage in dialogue around it [69, p. 115].

Evaluation Trial
Before arriving in the community, we ran an evaluation trial with a community member who had recently migrated to the nearest major city (Mumbai).We brought a phone with the probe app installed as well as printouts of the 30 photos listed in Fig. 4. We asked the participant, who was previously a data-contributor, to complete two tasks: 5.2.1 Task 1: Qerying Individual Photos.For this task, we kept the full deck of photos hidden from the participant.We disclosed the photos one at a time, and after showing a photo, asked them to use the app to record a query that they would expect to bring up the photo.

Task 2:
Qerying the Collection.For the second task, we spread out all of the photos on the foor and pointed to an individual photo.Similar to the frst task, the participant was asked to record a query on the app that should bring up that photo as a result.The aim here was to support the query task with contextual knowledge about the full corpus of images.

Results & Reflections.
Between the two tasks, we found (via the participant's feedback) that knowledge of the corpus did not make a diference to how the participant recorded queries.We also found that photos with only minimal audio annotations were not reliably being returned as results.While the participant represented a best case scenario-being digitally savvy and part of the training dataset-we learned that we would need to focus more on communicating the capabilities of the system before recording user queries, even if this meant giving community members an overview of the corpus of photos.We were, however, cautious about afecting how community members articulated their queries.
Trialling the app ourselves during its development, we used Gormati keywords and keyphrases we picked up from the community, such as 'kapashi' (cotton) and 'bajri' (millet), to test if its bi-directional streaming of audio queries and photo results was working.So, we planned to avoid demonstrating the app ourselves in the community and would instead encourage participants to speak naturally.To accommodate longer queries, we confgured the app to only cut of a query after 10 seconds had elapsed.We also adapted the prototype with functions to replay a query and to record audio comments while viewing photo results to allow us to contextualise interactions -for instance to mark queries we might make to test whether the system was working or to indicate why a query failed for external reasons (e.g., loud noises or concurrent speech).

Community Evaluation
On the second day of our visit we recruited eight community members (4M, 4F; aged 20-50), who had not been part of datacontribution during phase 2, to experiment with and evaluate the system.We decided to conduct evaluation sessions inside one of the homes, back-to-back, and all on the same day to minimise community members consulting with one another about the task and how they created queries.We went through ethics and consent with participants as well as introducing the system, what it does, and how it operates.We kept the corpus of photos out of view from participants and then showed an individual photo from the corpus, asking participants, as in our earlier evaluation trial, to say a query that would bring up that photo in the app.We then rated the returned photo results using the rating buttons on the photo results screen (see Fig. 5).After completing the rating step, we showed participants the next photo and asked them to record a query for it.
In some cases we had to retry a photo query: because participants were still thinking about what to say after we had already started recording; because someone had entered the room and was talking while recording; because the goat tied to the front of the house was bleating to be fed; or, because host family members brought in chai.We noted these interruptions by recording comments on the results screen.
In articulating her alternative account of the relations between plans and situated action in the context of scientifc research, Suchman found that the experimenters' expertise lay not in strict adherence to plans and protocols but in being able to continually adapt by drawing on plans as a resource for action [73, p. 185].During the evaluation sessions, we too found it necessary to adapt.Drawing on the trust and intuitions we formed over the course of three visits, we could sense that early participants were getting tired during evaluation sessions, especially since we had to ask them to repeat queries to compensate for interruptions by people or animals.So, we decided to accommodate participants by excluding photos with less than two minutes of audio moving forward, and hence limiting our evaluation to 19 photos.However, in the moment we mixed up two of the photos and excluded the photos of 'Chillies' (437s) and 'Wild Berries' (282s) by accident and included the photos of 'Stove' (60s) and 'Dried Dal' (18s) instead.The green columns of Fig. 3 show the 19 photos that formed part of the evaluation dataset, while the red columns indicate the long tail of photos we decided not to trial because they had so little data associated with them.A further accommodation we made was to show the photos on a laptop, because older participants had difculty seeing the printed photos inside the dimly lit homes.Changing this on the fy, we could no-longer rely on simply shufing the photos in the deck to randomise the order, nor rely on removing cards from the deck to keep track of which ones we had shown to participants.While most photos were shown to between fve and eight participants, three photos ('Dried Dal', 'Corn', & 'Stove') were only shown to one or two participants.Despite occasional interruptions and unanticipated accommodations, however, overall we settled into a steady rhythm of querying and rating the returned photo results.

Results
After returning from the community, we iteratively cleaned the data generated during the system evaluations by frst removing those queries which contained an audio comment to mark them as excluded (e.g., for testing, needing to be retried, etc.) or was inaudible because the microphone was blocked.Next we marked the remaining 99 queries which contained audible speech for inclusion.Figure 6 shows the distribution of results of the 99 queries that participant-evaluators created on the app in response to being shown one of 19 diferent photos.The dotted blue line indicates our target of returning the corresponding photo as a top-5 result.Across the 19 photos there were 73 queries that returned the corresponding photo on the app as a top-5 result.The remaining 26 queries did not meet our target threshold.
Subsequently, we recruited a community member to assist with translating a random sample of 40 queries into Hindi using voice recordings.We further transcribed and translated the Hindi audio into English.
Consider our worst results for 'Pomegranate' (19 th ) and 'Onion' (18 th ).In the 'Onion' example, the participant formulated a query surrounding the replanting of feld, as the photo showed a freshly cultivated feld that had recently been planted out to onions (see Fig. 5).It is likely that the IR model picked up the words surrounding felds and planting, which would overlap with the annotations of the millet photo, which was incorrectly returned as the top result for that particular query.The 'Pomegranate' example also featured a descriptive query about how the fruit on the bush of the photo looked ready for harvest and selling at market.However, the photo of 'Wheat' was returned as the top result.Although these queries produced outlying results, other queries for pomegranate and onions produced top-5 results on fve and three occasions, respectively.This style of fuid and descriptive querying is furthermore representative of the larger query dataset, which did not contain keyword or keyphrase queries (e.g., for 'the pomegranate photo').

Community-Based Co-Creation
We leveraged the interactive demonstrations that our probe afforded to engage and involve eleven community members (7M, 4F; 18-68) in the co-creation of media retrieval use-cases that are more appropriate in oral contexts.These occurred during two workshops across two days, one focused on current information seeking practices and the other on uncovering use-cases that support and extend these practices with more useful content than the photos of the current probe.Before, between, and after workshops we experimented with diferent content generation approaches as well as refning the ways of collecting audio annotation data to drive the IR system with the same participants.

5.5.1
Evolving the IR System.After the evaluation sessions, we asked two younger community members, who had been datacontributors in phase 2, to take three additional photos.They sent us photos of a goat, sorghum, and (a diferent type of) corn.On two phones we created three slideshows, one for each photo, on the digital storytelling data-collection app (see Fig. 3) used during phase 2. Between the two participants, we asked them to create and send us ten spoken annotations for each photo.We used these photos and annotations to retrain the ranker model to showcase how the current IR system can evolve, and we utilised the evolved system during workshops.

Current Information
Practices & Languages.We split this workshop across two groups to ft more comfortably inside the house: the frst was with four younger participants aged 18-22 and the second was with seven older aged participants (aged 30-68).Younger and older generations had diferent responsibilities in the felds and in the homes, and so were available to participate at different times.The generational divide across groups also expressed itself in terms of smart-/feature-phone usage and non-usage as well as their fuency and literacy in Hindi and Marathi.We asked polylingual participants to translate for one participant in the second group who did not speak Hindi.We structured what ended up being lively discussions around fve scenarios/topics, designed to cover a broad range of everyday experiences.To arrive at these, we utilised interpretative research strategies [68] and drew on our observations and lived experiences of previous research phasesdocumented through feld notes and research diaries -to come up with 18 potential scenarios/topics, captured these on post-it notes, Onion ( 5) Papaya ( 6) Parsley ( 5) Pomegranate ( 8) Dried Dal (1) Eggplant ( 6) Groundnut ( 7) Guava ( 4 For example the bar plot on the bottom left is captioned 'Radish (7)' to indicate that the photo was of a radish plant and was shown to seven participants who each formulated their own query for that photo.This particular plot shows that across the seven queries the photo of the radish plant was returned as the 8 th ranked result once, the 2 nd ranked result three times and as the top result a further three times.The 2 nd and 1 st ranked results are shown below the dotted blue line to indicate that they met our top-5 result target.Finally, a cumulative distribution combining results from all 19 photos is shown in the bottom right corner and highlighted in bold (though note that the x-axis is scaled 0-40 instead of 0-6 here).
and following discussion between research team members distilled these down to fve.
In the selling cotton scenario we asked participants to walk us through their line of reasoning for deciding when to sell their cotton harvest.In phase 2 we had observed that one family stored many bags of cotton in the back room of their house, hoping for higher prices later in the year.Some participants were not involved in this process and deferred to and trusted other family members in their decisions.Those involved in the process said that the internet was not a helpful resource: internet prices were characterised as 'fake'more than what is actually ofered from buyers in the area.Instead they call buyers and middlemen in the area, if possible pooling together, so the buyer will come collect the cotton harvest from many families in a single vehicle.But this arrangement often falls through on either side, and then the buyer will not come to collect the crop.Other families drop the cotton of themselves at depot in the nearby town to command a higher price, but need to cover the cost of transport.
On the topic of seeking health information, community members mentioned that they visit physicians in a nearby town and generally do not consult online information.They also reported on their experiences of Covid-19 and how healthcare workers came to administer vaccinations.They received a digital copy of an English vaccination certifcate and those with smart-phones would facilitate receiving certifcates for those without access.
On the topic of seeking help and information to manage crop diseases and pests participants consulted agricultural shops in the nearby town and preferred to go in-person rather than call.Sometimes they will show a photo they have taken of the issue, and usually the people working in the shop will make appropriate recommendations.
Across these three scenarios and especially when information was sought outside the Banjara thanda, participants reported using Marathi, which led us to the topic of language preferences and perceptions.Participants preferred, and indeed cared deeply about, their language, but were also pragmatic about speaking diferent languages, and saw it as necessary to live in a society with lots of diferent people.However, within the community they always speak Gormati with each other.Elders mentioned that Gormati is a strong and stable language, but also acknowledged that their children learn more and more Marathi in order to speak with outsiders.
In their view, they should stick to Gormati.Younger participants prefer Ahirani and Marathi songs, and claimed that songs in those languages are more melodic than Gormati.However, older participants did not share this view.This led us to the topic of multimedia.Here older participants generally relied on those with smartphones to facilitate access.For instance, this was achieved by asking children to play a song from YouTube.Young people demonstrated how they use voice recognition, keyword search, and code-switching on YouTube to search for: "Gor Banjara Song" 11 .They explained that you need to use the English alphabet to fnd content on the internet, and also adjusted their querying style, from fuid Gormati queries that we observed while evaluating the IR system, to using keywords.Top results 12 are of high production value, with well-designed title cards that help identify and diferentiate songs.
Participants mentioned that they would like to see more Gormati videos on topics in the following areas-farming, recipes, songs, comedy, and religion-and to record and share their own songs as well as videos with recipes or showing efective farming practices.Younger participants already create video content, but often delete it from their phones to conserve space and choose not to upload it as it does not match the production value of the videos they like to look at online.

Community-Generated Media Content.
Between the workshops we experimented with generating the type of media content participants mentioned earlier: flming community-members making chapati, cooking lentils, weeding, and ploughing.Participants in the videos narrated what they were doing and, at our encouragement, repeated their demonstrations and narrations.For instance, when demonstrating how to make chapati, the person in the video made multiple chapatis and demonstrated and narrated each step multiple times: dosing and shaping the dough, cooking and fipping the chapati on the stove, and manipulating the cooked chapati to make it more pliable.We also encouraged participants-mothers and older farmers as well as their younger adult children-to make their own videos on project phones.We also found that women in particular wanted to record and share songs.However, unlike phase 2 where we precluded sung content, because it would be technically too difcult for our recogniser to cope with, we encouraged these and later explored ways for community members to contribute spoken, rather than sung, annotation data for the IR system.5.5.4Design workshop.We met with the same participants the next day to think about use-cases for spoken language technologies, as embodied and exemplifed by the current probe, that support and extend their current practices.This time we had arranged to meet with the older group frst, so that we could feed back their insights to the more digitally savvy, younger group.
With the frst group, we started by showing some of the videos they recorded earlier and discussed that they would be of interest both within and outside of the community.We asked them to imagine how they would fnd those videos if they were living in a diferent Banjara community hundreds of kilometres away.They mentioned they would ask their children to help and that songs would be of particular interest to them.In the media gallery of one of the two phones we had lent, we tried to locate one of the songs that participants had recorded, but initially could not locate it across similar-looking thumbnails.We also checked on the second phone, until we fnally found the video after a more systematic check on the frst device.We used this difculty as an opportunity [28] to show our IVR probe again, demonstrate how it can make it easier to fnd content from spoken descriptions, and showcase how it had been extended with new photos since the previous day.We also explained that the IVR probe can be changed in future to include videos and song content, but that it can only understand spoken Gormati.We then asked participants to imagine that many songs were on the IVR probe and tell us how they would fnd a particular song.After some discussion in the group they said that they could either explain the song in words or say the frst line of its lyrics.
With the second group we began by discussing how the elder group had shown an interest in making videos-of their farming, their cooking, and their songs-and considering whether this would be of any value to them.The group said that if they knew the people in the videos they would look at them, and suggested they might laugh initially, but if the content is useful they could see others looking at them.We asked about another nearby community creating such videos and to imagine what these videos would be of.They expressed interest in seeing how diferent communities create fertiliser from cow manure, or stock ponds with fsh.They also mentioned how their fathers are very skilled at particular aspects of farming, such as inter-cropping and keeping an ox-drawn plough straight; videos of these could be shared within the community and with other communities.They mentioned that videos could also be shared via WhatsApp, which led us to enquire how the messaging app is used in the community.Within their group participants tended to use WhatsApp to forward images and videos and to send very short messages-e.g., 'hi', 'what's up', etc.-transliterated using an English keyboard.This was the only evidence we saw of transliteration practices.5.5.5 Revised Data Collection Methodology.On our last day in the community we worked with data-collectors from phase 2 as well as community members who had shown interest in creating video content.We loaned phones to an older farmer, and to two sistersin-law who wanted to perform and share their songs.Other people either had their own phones, or could borrow one from a family member or use one of the phones we had lent.The communitygenerated video content from earlier in our visits already contained some spoken narrations, but not enough for the ranker model of our IR system, and in the case of sung content, would need to rely entirely on spoken annotation.Trialling this with participants, we learned that young people appreciated being able to listen to the original narrations of the content videos, as they found it harder to record annotations if they lacked the knowledge and/or confdence to describe what was being demonstrated in the video.They found it easier to start by 'repeating' what was already said, but also found ways of integrating their own knowledge and experience of the topic once they started speaking.These recordings were therefore not verbatim repetitions used by systems such as 'ReSpeak' [78] to develop transcriptions for written languages.However for our usecase, annotations featuring repetitions with variations are a more useful training resource for the IR system ranker than verbatim ones.
We asked phase 2 data-collectors about their experiences using the digital storytelling software, and they mentioned that it was frustrating to only be able to export the entire digital story slideshow, even if they only wanted to share one new audio annotation.We also wanted to explore a diferent data collection methodology, given that phase 2 audio annotations were unevenly distributed across photos (see Fig. 4).We suggested that they could also try using WhatsApp voice messaging for this purpose, and set up a group between devices to demonstrate this.We shared a farming video to the group, and participants found it easier to respond to that video with a voice message containing their spoken annotation.This refned method also leverages participants' familiarity with the platform (see [36]).Following this, we settled upon and further demonstrated and agreed on the following (ongoing) data collection process: • A video (e.g., on farming, cooking, songs, etc.) is shared to the WhatsApp group; • Participants record and send audio annotations for that video to the same WhatsApp group; • All audio received is assumed to be related to that video; • After enough (10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20) annotations have been received, a new video is shared, and the process repeats; and • Researchers would be included in the WhatsApp group to collect video and audio annotation data and to encourage use.Finally, we established formal participant consent to being included in the WhatsApp group, to participate, for us to use the video content for a community repository and the spoken annotations to improve the IR system.To date, community members have contributed ten further videos with 48 minutes of audio annotations.

DISCUSSION & FUTURE RESEARCH
While the installation of new 4G phone masts in the community at the start of our research show how digital divides surrounding internet access are increasingly being addressed, written epistemological assumptions still pervade UI paradigms (e.g., information hierarchies) [82] and are also prevalent in the dominant keyword query paradigm (see [47]) of speech-driven user interfaces and IVR systems.In short, there are still barriers to digital participation that, as our research demonstrates, could be alleviated through speech and language technologies.Here, our research clearly shows how participants in our study-and presumably this extends to similar oral communities-used fuid and descriptive queries rather than keywords and keyphrases.Supporting this distinct interaction style -especially in oral contexts and for unwritten languages -is crucial to unlocking laudable eforts to create content for and with minority language communities, such as the Spoken Web [1] or Digital Green [25].
Current digital inequalities are furthermore splitting open new divides between Global North and South in terms of access to AI technologies.These technologies are often trained on datasets generated by digital platforms [14], which are inaccessible to many minoritised communities in the Global South [81].In the case of speech and language technologies, this means that the language communities who would stand to beneft the most from this interaction paradigm are simply being cut out of the conversation.Responsible and human-centred [12] forms of AI innovation, as our research shows, have a tremendous role to play in closing this gap and is a critical area for HCI research to contribute [30].
The farming practices we observed while in-situ demonstrate creativity, innovation, and resilience in the face of a changing planet characterised by less predictable and more extreme weather patterns.Not only are these practices never recorded in any datasets generated, for instance, by discussions on online platforms, but we also miss out on engaging with both traditional and contemporary forms of knowledge and practice [80] in the design process of AI.More research, collaboration, engagement, and partnerships are required to bridge these gaps and to ensure more equal representation so that the tremendous opportunities of AI beneft and address the needs of diverse communities across the world, and not just those in the Global North.
Our contribution also speaks to, and is shaped by, NLP research.We have outlined the pipeline and decoupled architecture we used to develop the IR system (see Appendix A) so that more technical researchers might reproduce and build on our results.Technical advances within NLP research, particularly to support so-called 'low-resource' or 'zero-resource' languages, are often organised and structured through competitions associated with major conferences (e.g., [19]) using existing datasets that are far removed from everyday experience and therefore unlikely to beneft those language communities directly.Here our research contributes an adaptable blueprint for NLP researchers to engage with communities from 'day zero'.This blueprint is paired with a development method that supports, and is supported by, these engagements to build interactive systems from scratch -without any existing digital language resources in the target language.The evaluation results of our IR system show that we met our top-5 target metric 74% of the time and demonstrates the feasibility of this approach.Critically this iterative and incremental engagement and development approach, not only facilitates collaboration across HCI and NLP, but also affords tighter feedback loops for communities that build momentum, motivate and engage data contributions, to ultimately co-create more meaningful systems that are matched with appropriate data.
A key fnding of our research is just how important demonstrator systems are to translate the abstract and somewhat inefable concept of 'speech and language technologies' into something practical and concrete, especially in oral context and in communities with less technological familiarity.In our design workshops, the IR system also functioned as an engagement probe to demonstrate the interactive capabilities of spoken language technologies, how these can be iteratively improved, and critically engage community members to uncover use-cases to suit their community and practices.
We are currently building a communal tablet-based system to store and access videos, that is driven by the IR system.In order to include a video (e.g., of a song, showing a farming technique, etc.) on the tablet, it needs to be supported by 10 audio annotations explaining the content of the video (e.g., the lyrics or meaning of the song or explanation of the technique shown).These ten recordings are a compromise between what community members can deliver before the task becomes too tedious and the needs of the ranker system to operate efectively.The supporting audio annotations are then used to train the ranker, so the videos can then be accessed by community members through spoken queries on the tablet.The IR system combined with a community-operated tablet then function as a novel form of storing and accessing information: where videos are stored and accessed through oral descriptions, rather than tags, categories, hierarchies, or meta-data typically used in database systems (see [20]).

CONCLUSION
We began our research by accompanying our hosts in an agrarian, Banjara community in Jalgaon District, Maharashtra, India for a walk [38] into the felds.Along the way we learned about Banjara culture, farming, and cooking practices, and picked-up some Gormati phrases too.We also challenged community members who engaged with us to learn about spoken language technologies, to contribute data, and to experiment and feed back on systems, so that together we can cultivate speech and language technology from seed to support their language and oral practices.
Orality theory [48,67] aforded us a critical lens that brings into focus the written assumptions, epistemology, and representational practices [27] of user interfaces and content repositories more generally and speech interfaces, such as IVR [77], specifcally.Developing oral alternatives to these is a substantial technical undertaking and contribution of our work, especially for unwritten languages without digital language resources.Here we had to balance a sensitivity to community context and unfamiliar oral practices while also being anchored in a frm understanding of how spoken language technology works, what it might realistically deliver, and the data that is required for its development [30].
Time spent in the community was essential to mediate between these demands, to experiment with diferent approaches, to adapt and act opportunistically [28], and to deal with the vicissitudes [71] of such 'data work' [62] in general.These vicissitudes required us to scale back our ambitions as we only had access to very limited amounts of data (< 4h).We therefore tailored the information retrieval system to utilise a multilingual phone recogniser and a ranker that can be trained independently.This decoupled architecture supports the development of interactive information retrieval systems from scratch that can be seeded with as little data as is available.As more media content and annotation data becomes available only the ranker needs to be retrained.Compared to the multilingual phone recogniser, the ranker can be retrained quickly and without much computational resource, supporting both iterative and more sustainable practices.
A trouble that we identify with NLP research is that it falls short on engagement methods especially when developing for minoritised language communities where the very concept of spoken language technologies is abstract and inefable.Taken together our research contributions create an adaptable blueprint for engaging with communities from 'day zero', paired with a development method that supports, and is supported by, these engagements to build interactive systems from scratch -without any existing digital language resources in the target language.tri-gram phone sequences with term-frequency inverse-documentfrequency weights [65] as features to predict corresponding photos.We used the default SVM parameters and selected the n-gram range using cross-validation.

Figure 1 :
Figure 1: Schematic outline and timeline of community engagement, design, development, and evaluation activities.

Figure 2 :
Figure 2: Engaging community members on how speech recognition systems are trained using the metaphor of a learning child.

Figure 3 :
Figure 3: Training community members to use the datacollection probe.

Figure 4 :
Figure 4: Distribution of 3h43m of spoken language annotations across 30 photos.

Figure 5 :
Figure 5: Media retrieval probe screens for querying (left) and viewing/rating query results (right).

Figure 6 :
Figure 6: Distribution of result rankings for 99 queries across 19 photos with top-5 results shown below the dotted blue line.For example the bar plot on the bottom left is captioned 'Radish(7)' to indicate that the photo was of a radish plant and was shown to seven participants who each formulated their own query for that photo.This particular plot shows that across the seven queries the photo of the radish plant was returned as the 8 th ranked result once, the 2 nd ranked result three times and as the top result a further three times.The 2 nd and 1 st ranked results are shown below the dotted blue line to indicate that they met our top-5 result target.Finally, a cumulative distribution combining results from all 19 photos is shown in the bottom right corner and highlighted in bold (though note that the x-axis is scaled 0-40 instead of 0-6 here).
[44] et al.developed a community-notice board with support for both oral and written storytelling, bi-lingual content and diferent representations of time[70].Working with Sámi people of the circumpolar north, Moradi et al. designed a web-based digital archive of cultural heritage materials, where a tension emerged around border(less) maps[44].And in South Africa Bidwell et al. designed two iterations of a community audio repository, to create and share recordings with access to shared tablets, to address "the difculty local Xhosa [59]le have in communicating between villages"[5, p.227].While the interface to record audio remained unchanged across the iterations, community members found it difcult to fnd specifc recordings, so the revised interface allowed users to tag a recording with photos and record short, annotating abstracts about the recording.The interface to fnd a recording displays photo tags alongside the recording and autoplays annotations as the user scrolls through their list of recordings.Difculty in fnding voice recordings, in the form of WhatsApp voice messages, was experienced by both Xhosa participants in South Africa and Marathi participants in Maharashtra, India, in Reitmaier et al.'s study; supporting textual search of voice messages was identifed as key opportunity for Automatic Speech Recognition (ASR) systems[59].