Extracting Cultural Commonsense Knowledge at Scale

Structured knowledge is important for many AI applications. Commonsense knowledge, which is crucial for robust human-centric AI, is covered by a small number of structured knowledge projects. However, they lack knowledge about human traits and behaviors conditioned on socio-cultural contexts, which is crucial for situative AI. This paper presents Candle, an end-to-end methodology for extracting high-quality cultural commonsense knowledge (CCSK) at scale. Candle extracts CCSK assertions from a huge web corpus and organizes them into coherent clusters, for 3 domains of subjects (geography, religion, occupation) and several cultural facets (food, drinks, clothing, traditions, rituals, behaviors). Candle includes judicious techniques for classification-based filtering and scoring of interestingness. Experimental evaluations show the superiority of the Candle CCSK collection over prior works, and an extrinsic use case demonstrates the benefits of CCSK for the GPT-3 language model. Code and data can be accessed at https://candle.mpi-inf.mpg.de/.


INTRODUCTION
Motivation.Structured knowledge, often stored in knowledge graphs (KGs) [12,39], is a key asset for many AI applications, including search, question answering, and conversational bots.KGs cover factual knowledge about notable entities such as singers, songs, cities, sports teams, etc.However, even large-scale KGs deployed in practice hardly touch on the dimension of commonsense knowledge (CSK): properties of everyday objects, behaviors of humans, and more.Some projects, such as ConceptNet [36], Atomic [32], and Ascent++ [21] have compiled large sets of CSK assertions, but are solely focused on "universal CSK": assertions that are agreed upon by almost all people and are thus viewed as "globally true".What The Asian dragon is a symbol of power and good luck.
Green tea is popular in Asian countries.
I am sitting in a cafe in Beijing and saw a dragon in the street.Should I be worried?
You should be worried if the dragon is not in a cage. is missing, though, is that CSK must often be viewed in the context of specific social or cultural groups: the world view of a European teenager does not necessarily agree with those of an American business person or a Far-East-Asian middle-aged factory worker.This paper addresses this gap, by automatically compiling CSK that is conditioned on socio-cultural contexts.We refer to this as cultural CSK or CCSK for short.For example, our CCSK collection contains assertions such as: • subject:East Asia, facet:food, Tofu is a major ingredient in many East Asian cuisines, or • subject:firefighter, facet:behavior, Firefighters use ladders to reach fires.
The value of having a KG with this information lies in making AI applications more situative and more robust.
Consider the conversation between a human and the GPT-3 chatbot 1 shown in Fig. 1.The GPT-3-based bot, leveraging its huge language model, performs eloquently in this conversation, but completely misses the point that the user is in China, where dragons are viewed positively and espresso is difficult to get.If we prime the bot with CCSK about Far-East-Asian culture, then GPT-3 is enabled to provide culturally situative replies.If primed with CCSK about European views (not shown in Fig. 1), the bot points out that dragons are portrayed as evil monsters but do not exist in reality and recommends a strong cup of coffee.
State of the art.Mainstream KGs do not cover CCSK at all, and major CSK collections like ConceptNet contain only very few culturally contextualized assertions.To the best of our knowledge, the only prior works with data that have specifically addressed the socio-cultural dimension are the projects Quasimodo [30], StereoKG [7], and the work of Acharya et al. [1].The latter merely contains a few hundred assertions from crowdsourcing, StereoKG uses a specialized way of automatically extracting stereotypes from QA forums and is still small in size, and Quasimodo covers a wide mix of general CSK and a small fraction of culturally relevant assertions.These are the three baselines to which we compare our results.
Language models (LMs) such as BERT [8] or GPT-3 [5] are another form of machine-based CSK, including CCSK, in principle.However, all LM knowledge is in latent form, captured in learned values of billions of parameters.Knowledge cannot be made explicit; we observe it only implicitly through the LM-based outputs in applications.The example of Fig. 1 demonstrates that even large LMs like GPT-3 do not perform well when cultural context matters.
Approach.CCSK is expressed in text form on web pages and social media, but this is often very noisy and difficult to extract.We devised an end-to-end methodology and system, called Candle (Extracting Cultural Commonsense Knowledge at Scale), to automatically extract and systematically organize a large collection of CCSK assertions.For scale, we tap into the C4 web crawl [27], a huge collection of web pages.This provides an opportunity to construct a sizable CCSK collection, but also a challenge in terms of scale and noise.
The output of Candle is a set of 1.1M CCSK assertions, organized into 60K coherent clusters.The set is organized by 3 domains of interest -geography, religion, occupation -with a total of 386 instances, referred to as subjects (or cultural groups).Per subject, the assertions cover 5 facets of culture: food, drinks, clothing, rituals, traditions (for geography and religion) or behaviors (for occupations).In addition, we also annotate each assertion with its salient concepts.Examples for the computed CCSK are shown in Fig. 2.
Candle operates in 6 steps.First and second, we identify candidate assertions using simple techniques for subject detection (named entity recognition -NER, and string matching), and generic rulebased filtering.Third, we classify assertions into specific cultural facets, which is challenging because we have several combinations of cultural groups and cultural facets, making it very expensive to create specialized training data.Instead, we creatively leverage LMs pre-trained on the Natural Language Inference (NLI) task to perform zero-shot classification on our data, with judicious techniques to enhance the accuracy.Fourth we use state-of-the-art techniques for assertion clustering, and fifth a simple but effective method to extract concepts in assertions.Lastly, we combine several features to score the interestingness of assertions, such as frequency, specificity, distinctiveness.This way, we steer away from overly generic assertions (which LMs like GPT-3 tend to generate) and favor assertions that set their subjects apart from others.Contributions.The main contributions of this work are: (1) An end-to-end methodology to extract high-quality CCSK from very large text corpora.(2) New techniques for judiciously classifying and filtering CCSKrelevant text snippets, and for scoring assertions by their interestingness.(3) A large collection of CCSK assertions for 386 subjects covering 3 domains (geography, religion, occupation) and several facets (food, drinks, clothing, traditions, rituals, behaviors).Experimental evaluations show that the assertions in Candle are of significantly higher quality than those from prior works.An extrinsic use case demonstrates that our CCSK can improve performance of GPT-3 in question answering.Code and data can be accessed at https://candle.mpi-inf.mpg.de/.
Cultural commonsense knowledge.A few works have focused specifically on CCSK.An early approach by Anacleto et al. [2] gathers CSK from users from different cultures, entered via the Open Mind Common Sense portal.However, the work is limited to a few eating habits (time for meals, what do people eat in each meal?, food for party/Christmas) in 3 countries (Brazil, Mexico, USA), and without published data.Acharya et al. [1] embark on a similar manual effort towards building a cultural CSKG, limited to a few predefined predicates and answers from Amazon MTurk workers from USA and India.Shwartz [33] maps time expressions in 27 different languages to specific hours in the day, also using MTurk annotations.StereoKG [7] mines cultural stereotypes of 5 nationalities and 5 religion groups from Twitter and Reddit questions posted by their users, however, being without proper filtering, the method results in quite many noisy and inappropriate assertions.GeoMLAMA [42] defines 16 geo-diverse commonsense concepts (e.g., traffic rules, date formats, shower time) and use crowdsourcing to collect knowledge for 5 different countries in 5 corresponding languages.The dataset was used to probe multilingual pretrained language models, however, is not shared.Moving to computer vision, Liu et al. [18] and Yin et al. [43] expand existing visual question answering datasets with images from different cultures rather than the Western world.As a result, models trained on images from the old datasets (mostly images from Western cultures) perform poorly on the newly added images.Our methodology is the first to utilize large text corpora, and it can extract CCSK in the form of natural-language sentences, for a wide range of cultural groups and facets.
Pre-trained language models and commonsense knowledge.Remarkable advances in NLP have been achieved with pre-trained language models (LMs) such as BERT [8] and GPT variants [5,26].LAMA [25] designs methodology and datasets to probe masked LMs in order to acquire CSK that the models implicitly store.COMET [4] is a method that finetunes autoregressive LMs on CSK triples, and it can generate possible objects for a given pair of subject-predicate.However, the quality of the generated assertions is often considerably lower than that of the training data [20].More recently, West et al. [40] introduce a prompting technique to collect CSK by feeding GPT-3 [5] with a few human-verified CSK triples and ask it to generate new assertions.Although it was shown that the generated resource is of encouraging quality, knowledge bases from LMs are inherently problematic, because their is no apparent way to trace assertions to specific sources, e.g., to understand assertion context, or to apply filters at document level.
In this work, we leverage pre-trained LMs as sub-modules in our system to help with cultural facet classification and assertion clustering.We also show that our method can produce more distinctive CCSK assertions than querying GPT-3 with prompts.

CCSK REPRESENTATION
Our representation of CCSK is based on the notions of subjects (from 3 major domains: geography, religion and occupation) and facets.These are the key labels for CCSK assertions, which are informative sentences with salient concepts marked up.
We assume two sets to be given: Note that the cultural facets need not be mutually exclusive, e.g., food assertions sometimes overlap with traditions.
Our objective is to collect a set of CCSK assertions for a given subject and facet.Existing commonsense resources store assertions in triple format (e.g., ConceptNet [36], Quasimodo [30]), semantic frames (Ascent [22]) or generic sentences (GenericsKB [3]).Although the traditional triple-based and frame-based data models are convenient for structured querying, and well suited for regular assertions like birth dates, citizenships, etc., they often falls short of capturing nuanced natural-language assertions, as essential for CSK.Moreover, recent advances in pre-trained LMs have made it easier to feed downstream tasks with less structured knowledge.
With Candle, we thus follow the approach of GenericsKB [3], and use natural-language sentences to represent assertions.
In principle, an assertion could comprise even several sentences.The longer the assertions are, however, the harder it is to discern their core.In this work, for higher precision and simplicity of computations, we only consider single sentences.Definition 1 (Cultural commonsense knowledge assertion).Given a subject  and a facet  , a CCSK assertion is a triple (, , sent) where sent is a natural-language sentence about facet  of subject .
Since natural language often allows to express similar assertions in many different ways, and web harvesting naturally leads to discovering similar assertions multiple times, we employ clustering as an essential component in our approach.
A cluster (cls) of CCSK assertions for one subject and cultural facet contains assertions with similar meaning, and for presentation purposes, is summarized by a single summary sentence.Each cluster also comes with a score denoting its interestingness.
To further organize assertions, we also identify salient concepts, i.e., important terms inside assertions, that can be used for conceptcentric browsing of assertion sets.
Several examples of CCSK assertions produced by Candle are shown in Fig. 2.

METHODOLOGY
We propose an end-to-end system, called Candle, to extract and organize CCSK assertions based on the proposed CCSK representation.Notably, our system does not require annotating new training data, but only leverages pre-trained models with judicious techniques to enhance the accuracy.The system takes in three inputs: • an English text corpus (e.g., a large web crawl); • a set of subjects (cultural groups); • a set of facets of culture.
Candle consists of 6 modules (see Fig. 3).Throughout the system, step by step, we reduce a large input corpus (which could contain billions of documents, mostly noisy) into high-quality clusters of CCSK assertions for the given subjects and facets.Each cluster in the output is also accompanied by a representative sentence and an interestingness score.We next elaborate on each module.

Subject detection
We start the extraction by searching for sentences that contain mentions of the given subjects.These will be the candidate sentences used in the subsequent modules.To achieve high recall, we utilize generous approaches such as string matching and named entity  8 for the list of techniques and models used in each module).
recognition (NER), and use more advanced filtering techniques in later modules, to ensure high precision.
For the geography and religion domains, in which subjects are named entities, we use spaCy's NER module to detect subjects.Specifically, geo-locations are detected with the GPE tag (geopolitical entities), and religions are detected with the NORP tag (nationalities or religious or political groups).For each subject, we also utilize a list of aliases for string matching, which can be the location's alternate names (e.g., United States, the U.S., the States), or demonyms (e.g., Colombians, Chinese, New Yorker), or names for religious adherents (e.g., Christians, Buddhists, Muslims) -which can be detected with the NORP tag as well.
For the occupation domain, we simply use exact-phrase matching to detect candidates.Each occupation subject is enriched with its alternate names and its plural form to enhance coverage.

Generic assertion filtering
CSK aims at covering generic assertions, not episodic or personal experiences.For example, Germans like their currywurst is a generic assertion, but I visited Germany to eat currywurst or This restaurant serves German currywurst are not.
GenericsKB [3] is arguably the most popular work on automatically identifying generic sentences in texts and it uses a set of 27 hand-crafted lexico-syntactic rules.Candle adopts those rules in this module.However, for each domain and facet, we adaptively drop some of the rules if they would reject valuable assertions.More details on the adaptations can be found in Appx.B.

Cultural facet classification
To organize CCSK and filter out irrelevant assertions, we classify candidate sentences into several facets of culture.Traditional methods for this classification task would require a substantial amount of annotated data to train a supervised model.The costs of data annotation are often a critical bottleneck in large-scale settings.In Candle, we aim to minimize the degree of human supervision by leveraging pre-trained models for zero-shot classification.
A family of pre-trained models that is suitable for our setting is textual entailment (a.k.a natural language inference -NLI): given two sentences, does one entail the other (or are they contradictory or unrelated)?Our approach to adopting such a model for cultural facet classification is inspired by the zero-shot inference method of Yin et al. [44].Given a sentence  and a facet  , we construct the NLI test as follows: The probability of  entailing ℎ will be taken as the probability of  being labeled as  , denoted as  [ ∈  ].For example, with sentence "German October festivals are a celebration of beer and fun", the candidate entailments will be "This text is about drinks", "... about food", "... about traditions", and so on.Multiple of these facets may yield high scores in these NLI tests.
To enhance precision, we introduce a set of counter-labels for topics that are completely outside the scope of CCSK, for example, politics or business.A sentence  will be accepted as a good candidate for facet  if where  + and  − are hyperparameters in the range [0, 1], giving us the flexibility to tune for either precision or recall.
In our experiments, we use the BART model [16] finetuned on the MultiNLI dataset [41] for NLI tests2 .Our crowdsourcing evaluations show that the zero-shot classifiers with the enhanced techniques achieved high precision (see Appx.D.2).

Assertion clustering
The same assertion can be expressed in many ways in natural language.For example, Fried rice is a popular Chinese dish can also be written as Fried rice is a famous dish from China or One of the most popular Chinese food is fried rice.Clustering is used to group such assertions, which reduces redundancies, and allows to obtain frequency signals on assertions.
We leverage a state-of-the-art sentence embeddings method, Sen-tenceBert [29], to compute vector representations for all assertions and use the Hierarchical Agglomerative Clustering (HAC) algorithm for clustering.Clustering is performed on assertions of each subject-facet pair.
Cluster summarization.Since each cluster can have from a few to hundreds of sentences, it is important to identify what those sentences convey, in a concise way.
One way to compute a representative assertion for a cluster is to compute the centroid of the cluster, then take its closest assertion as the representative.Yet for natural-language data, this does not work particularly well.
In Candle, we therefore approach cluster summarization as a generative task, based on a state-of-the-art LM, GPT-3 [5] (see Appx.F for prompt template).Annotator-based evaluations show that GPT-generated representatives received significantly better scores than the base sentences in the clusters (see Sec. 6.1).

Concept extraction
While the cultural groups are regarded as subjects, concepts are akin to objects of the assertions.Identifying these concepts enables concept-focused browsing (e.g., browsing Japan assertions only about the Miso soup, etc.).
We postulate that main concepts of an assertion cluster are terms shared by many members: We extract all n-grams ( = 1..3) of all assertions in a cluster (excluding subjects themselves, and stop words); and retain the ones that occur in more than 60% of the assertions.If both a phrase and its sub-phrase appear, we only keep the longer phrase in the final output.Noun-phrase concepts are normalized by singularization.

Cluster ranking and post-filtering
Ranking commonsense assertions is a crucial task.Unlike encyclopedic knowledge, which is normally either true or false, precision of CSK is usually not a binary concept, as it generalizes over many groups.With Candle, we aim to pull out the most interesting assertions for each subject, and avoid overly generic assertions such as Chinese food is good or Firefighters work hard, which are very common in the texts.
Extracting and clustering assertions from large corpora gives us an important signal of an assertion, its frequency.However, ranking based on frequency alone may lead to reporting bias.As we compile a CCSK collection at large scale, it also enables us to compute the distinctiveness of an assertion against others in the collection.The notion of these 2 metrics can be thought of as term frequency and inverse document frequency in the established TF-IDF technique for IR document ranking [35].Besides frequency and distinctiveness, we score the interestingness of assertion clusters based on 2 other custom metrics: specificity (how many objects are mentioned in the assertion?)and domain relevance (how relevant is the assertion to the cultural facet?).
Frequency.For each subject-facet pair, we normalize cluster sizes into the range [0, 1], using min-max normalization.
Distinctiveness.We compute the IDF of a cluster  as follows: where  is the set of all clusters for a given facet (e.g., food) and domain (e.g., geography>country), and Here, (,  ′ ) is the semantic similarity between the two clusters  and  ′ , and  is a predefined threshold.In Candle, to reduce computation, we approximate (,  ′ ) as the similarity between their summary sentences, which can be computed as the cosine similarity between their embedding vectors.When computing these embeddings, the subjects in the sentences are replaced with the same [MASK] tokens so that we only compare the expressed properties.Then, we normalize the logarithmic IDF values into the range [0, 1] to get the distinctiveness scores of clusters.
Specificity.We compute the specificity of an assertion based on the fraction of nouns in it.Concretely, in Candle, the specificity of a cluster is computed as the specificity of its summary sentence.

Domain relevance.
For each facet, we compute the domain relevance of a cluster by taking the average of the probability scores given to its members by the cultural facet classifier.
Combined score.The final score for cluster  is the average of the 4 feature scores.A higher score means higher interestingness.
Post-filtering.Lastly, to eliminate redundancies and noise, and further improve the final output quality, we employ a few handcrafted rules: • At most 500 clusters per subject-facet pair are retained, as further clusters mostly represent redundancies or noise.• We remove clusters that have no concepts extracted, or that are based on too few distinct sentences (>2/3 same sentences) or web source domains.• We remove any cluster if either its summary sentence or many of its member sentences match a bad pattern.We compile a set of about 200 regular expression patterns, which were written by a knowledge engineer in one day.For e.g., we reject assertions that contain "the menu", "the restaurant" (likely advertisements for specific restaurants), or animal and plant breeds named after locations, such as "American bison", "German Shepherd", etc.

IMPLEMENTATION
Input corpus.In Candle, we use the broad web as knowledge source, because of its diversity and coverage, which are important for long-tail subjects.Besides the benefits, the most challenging problem when processing web contents is the tremendous amount of noise, offensive materials, incorrect information etc., hence, choosing a corpus that has been chiefly cleaned is beneficial.We choose the Colossal Clean Crawled Corpus (C4) [27] as our input, a cleaned version of the Common Crawl corpus, created by applying filters such as deduplication, English-language text detection, removing pages containing source code, offensive language, too little content, etc.We use the C4.En split, which contains 365M English articles, each with text content and source URL.Before passing it to our system, we preprocessed all C4 documents using spaCy, which took 2 days on our cluster of 6K CPU cores.
Subjects.We collect CCSK for subjects from 3 cultural domains: geography (272 subjects), religions (14 subjects) and occupations (100 subjects).For geography, we split into 4 sub-domains: countries, continents, geopolitical regions (e.g., Middle East, Southeast Asia, etc.) and US states, which were collected from the GeoNames database 3 , which also provides alias names.We further enriched these aliases with demonyms from Wikipedia4 .
Facets of culture.We consider 5 facets: food, drinks, clothing, rituals, and traditions (for geography/religion) or behaviors (for occupation), selected based on an article on facets of culture [23].
Execution and result statistics.After tuning the system's hyperparameters on small withheld data (see Appx.C), we executed Candle on a cluster of 6K CPU cores (AMD EPYC 7702) and 40 GPUs (a mix of NVIDIA RTX 8000, Tesla A100 and A40 GPUs).Regarding processing time, for the domain country (196 subjects), it took a total of 12 hours to complete the extraction, resulting in 8.4K clusters for the facet food (cf.Table 2).Occupations and religions took 8 and 6 hours each.
We provide statistics of the output in Table 1.In total, the resulting collection has 1.1M CCSK assertions (i.e., base sentences) which form 60K clusters for the given subjects and facets.

EVALUATION
We perform the following evaluations: (1) A comparison of quality of Candle's output and existing socio-cultural CSK resources: This analysis will show that our CCSK collection is of significantly higher quality than existing resources (Sec.6.1), and even outperforms GPT-3generated assertions (Sec.6.2). (2) Two extrinsic use cases for CCSK: In this evaluation, we perform two downstream applications, question answering (QA) and a "guess the subject" game, showing that using CCSK assertions from Candle is beneficial for these tasks, and that Candle assertions outperform those generated by GPT-3 (Sec.6.3).In Appx.D, we also break down our CCSK collection into domains and facets, analyzing in details the assertion quality for each subcollection.

Comparison with other resources
6.1.1Evaluation metrics.Following previous works [7,30], we analyze assertion quality along several complementary metrics, annotated by Amazon MTurk (AMT) crowdsourcing.
(1) Plausibility (PLA).This dimension measures whether assertions are considered to be generally true, a CCSK-softened variant of correctness/precision.( 2) Commonality (COM).This dimension measures whether annotators have heard of the assertion before, as a signal for whether assertions cover mainstream or fringe knowledge (akin to salience).( 3) Distinctiveness (DIS).This dimension measures discriminative informativeness of assertions, i.e., whether the assertion differentiates the subject from others.Each metric is evaluated on a 3-point Likert scale for negation (0), ambiguity (1) and affirmation (2).Distinctiveness (DIS) is only applicable if the answer to the plausibility (PLA) question is either 1 or 2. In case the annotators are not familiar with the assertion, we encourage them to perform a quick search on the web to find out the answers for the PLA and DIS questions.More details on the AMT setup can be found in Appx.E. 6.1.2Compared resources.We compare Candle with 3 prominent CSK resources: Quasimodo [30], Acharya et al. [1], StereoKG [7].The former covers broad domains including assertions for countries and religions, while the others focus on cultural knowledge.Other popular resources such as ConceptNet [36], GenericsKB [3], Ascent/Ascent++ [21,22], ATOMIC [32], ASER [47] and Tran-sOMCS [45] do not have their focus on cultural knowledge and contain very little to zero assertions for geography or religion subjects, hence, they are not qualified for this comparison.
We evaluate 2 versions of Candle, one where each base assertion is retained independently (Candle-base-sent), the other containing only the cluster representatives (Candle-cluster-reps).

Setup.
For comparability, all resources are compared on 100 random assertions of the same 5 country subjects covered in StereoKG [7] -United States, China, India, Germany and France.We note that among all compared resources, Acharya et al. [1] only contain two subjects (United States and India), so for that resource, we only sample from those.For StereoKG, we use their natural-language assertions.For Quasimodo and Acharya et al., we verbalize their triples using crafted rules.Each assertion is evaluated by 3 MTurk annotators.Additionally, we ask if the annotator would consider the assertion as an inappropriate or offensive material.More details on the annotation task can be found in Appx.E.

Results
. A summary of comparison with other resources is shown in Table 3.
Resource size and assertion length.Candle outperforms all other resources on the number of base sentences.When turning to clusters, our resource still has significantly more assertions than Acharya et al. (which was constructed manually at small scale) and StereoKG (extracted from Reddit/Twitter questions).Quasimodo has comparable size with Candle-cluster-reps for the country and religion domains and has more for the occupation domain.
Assertion quality.In general, Candle-cluster-reps considerably outperforms all other baselines on 2 of the 3 metrics (plausibility and distinctiveness).Our resource only comes behind Acharya et al. on the commonality metric (1.15 and 1.22 respectively), which is expected because Acharya et al.only cover a few relations about common rituals (e.g., birthday, wedding) in 2 countries, USA and India, and their assertions are naturally known by many workers on Amazon MTurk, who are mostly from these 2 countries [31].Importantly, the resource of Acharya et al. is based on crowdsourcing and only contains a small set of 225 assertions for a few rituals.
Candle-cluster-reps even outperforms the manually-constructed KG (Acharya et al.) on the plausibility metric.This could be caused by an annotation task design that is geared towards abnormalities, or lack of annotation quality assurance.
Candle also has the highest scores on the distinctiveness metric, while most of the assertions in other resources were marked as not distinguishing by the annotators.
Between the two versions of Candle, the cluster representatives consistently outperform the base sentences on all evaluated metrics.This indicates that still some of the raw sentences in the collection are noisy, on the other hand, the computed cluster representatives are more coherent and generally of better quality.
We also measured the offensiveness (OFF) of each resource, i.e., the percentage of assertions that were marked as inappropriate or offensive materials by at least one of the human-annotators.Quasimodo and StereoKG, extracted from raw social media contents, have the highest number of assertions considered offensive (18% and 13%).Meanwhile, Candle's judicious filters only miss a small fraction (1% of final assertions).In summary, our Candle CCSK collection has the highest quality by a large margin compared to other resources.Our resource provides assertions of high plausibility and distinctiveness.The clustering and cluster summarization also help to improve the presentation quality of the CCSK.

Comparison with direct LM extraction
Knowledge extraction directly from pre-trained LMs is recently popular, e.g., the LAMA probe [25] or AutoTOMIC [40].There are major pragmatic challenges to this approach, in particular, that assertions cannot be contextualized with truly observed surrounding sentences, and that errors cannot be traced back to specific sources.Nonetheless, it is intrinsically interesting to compare assertion quality between extractive and generative approaches.In this section, we compare Candle with assertions generated by the state-of-the-art LM, GPT-3 [5].
Generating knowledge with GPT-3.We query the largest GPT-3 model (davinci-002) with the following prompt template: "Please write 20 short sentences about notable <facet> in <subject>." We run each prompt 10 times and set the randomness (temperature) to 0.7, so as to obtain a larger resource.We run the query for 5 facets and 210 subjects (196 countries and 14 religions), resulting in 188,061 unique sentences.Henceforth we call this dataset GPT-resource, and reuse it in the extrinsic use cases (Sec.6.3).
Evaluation metrics and setup.For each resource, we sample 100 assertions for each facet (hence, 500 assertions in total) and perform human evaluation on the 3 metrics -commonality (COM), plausibility (PLA) and distinctiveness (DIS).
Results.The quality comparison between assertions of Candle and GPT-resource is shown in Table 4.While plausibility scores are the same, and Candle performs better in commonality, the Table 5: Example assertions of Candle and GPT-resource for subject:China, facet:clothing.

# Candle
GPT-resource The bride usually wears red in a traditional Chinese wedding.Chinese people also like to wear modern clothes such as jeans and t-shirts.

2
The Chinese wear white at funerals bec. it is associated with mourning in Chinese culture.Shoes are also very important in Chinese culture.

3
The Chinese wear new clothes for the New Year to symbolize new beginnings.Chinese people also like to dress their children in very cute clothes.

4
The costumes in Chinese opera are very colorful and important.In China, you will often see little girls wearing dresses and boys wearing shorts.

5
In ancient China, only the emperor was allowed to wear the color yellow.
In the winter, people in China wear coats and scarves to keep warm.difference that stands out is in distinctiveness: GPT-3 performs significantly worse, reconfirming a known problem of language models, evasiveness and over-generality [17].We illustrate this with anecdotal evidence in Table 5, for subject:China and facet:clothing.None of the listed GPT-3 examples is specific for China.

Extrinsic evaluation
QA with context-augmented LMs.Augmenting LMs input with additional contexts retrieved from knowledge bases has been a popular approach to question answering (QA) [11,24], which shows that although LMs store information in billions of parameters, they still lack knowledge to answer knowledge-intensive questions, e.g., "What is the appropriate color to wear at a Hindu funeral?"We use GPT-3 as QA agent, and compare its performance in 3 settings: (1) when only the questions are given, and when questions and their related contexts retrieved from (2) Candle or (3) GPTresource (cf.Sec.6.2) are given to the LM.For questions, we collect cultural knowledge quizzes from multiple websites, which results in 500 multiple-choice questions, each with 2-5 answer options (only one of them is correct).For context retrieval, we use the the Sen-tenceBert all-mpnet-base-v2 model, and for each question, retrieve the two most similar assertions from Candle-cluster-reps and GPTresource.We use the GPT-3 davinci-002 model, with temperature=0 and max_length=16 (see Appx.F for prompt settings).
We measure the precision of the answers and present the results in Table 6.It can be seen that with Candle context, the performance is consistently better than when no context is given on all facets of culture, and better than GPT context on 3 out of 4 facets.This shows that GPT-3, despite its hundred billions of parameters, still lacks socio-cultural knowledge for question answering, and external resources such as Candle CCSK can help to alleviate this problem.
"Guess the country" game.The rule of this game is as follows: Given 5 CCSK assertions about a country, a player has to guess the name of the country.As input, we select a random set of 100 countries, and take assertions from either Candle or GPT-resource.The game has 5 rounds, each is associated with a facet of culture.In each round, for each country, we draw the top-5 assertions from each resource (sorted by interestingness in Candle or by frequency in GPT-resource).All mentions of the countries in the input sentences are replaced with [...], before being revealed to the player.This is a game that requires a player that possesses a wide range of knowledge across many cultures.Instead of human players, we choose GPT-3 as our player, which has been shown to be excellent at many QA tasks [5] (prompt settings are presented in Appx.F).
We measure the precision of the answers and present the results in Table 7.It can be seen that the player got significantly more correct answers when given assertions from Candle than from GPT-resource (i.e., assertions written by the player itself!).This confirms that assertions in Candle are more informative.

CONCLUSION
We presented Candle-an end-to-end methodology for automatically collecting cultural commonsense knowledge (CCSK) from broad web contents at scale.We executed Candle on several cultural subjects and facets of culture and produce CCSK of high quality.Our experiments showed the superiority of the resulting CCSK collection over existing resources, which have limited coverage for this kind of knowledge, and also over methods based on prompting LMs.Our work expands CSKG construction into a domain that has been largely ignored so far.Our data and code are accessible at https://candle.mpi-inf.mpg.de/.

Ethics statement
No personal data was processed and hence no IRB review was conducted.It is in the nature of this research, however, that some outputs reflect prejudices or are even offensive.We have implemented multiple filtering steps to mitigate this, and significantly reduced the percentage of offensive assertions, compared with prior work.Nonetheless, Candle represents a research prototype, and outputs should not be used in downstream tasks without further thorough review.

B GENERIC FILTERING RULES
GenericsKB [3] was built by using a set of 27 hand-crafted lexicosyntactic rules to extract high-quality generic sentences from different text corpora (the ARC corpus, SimpleWikipedia and the Waterloo crawl of education websites).For example, the lexical rules look for sentences with short length, starting with a capitalized character, having no bad first words (e.g., determiners), ending with a period, having no URL-like snippets, etc.The syntactic rules only accept a sentence if its root is a verb and not the first word, and if there is a noun before the root verb, etc.
Candle adopts the GenericsKB rules.However, as GenericsKB only deals with general concepts (e.g., "tree", "bird", etc.), some of the rules are not applicable for the cultural subjects that can be named entities.Hence, depending on the subjects and facets, we adaptively modify the rules (by dropping some of them) so that we will not miss out valuable assertions.For instance, for geography, the has-no-determiners-as-first-word rule will filter out valuable assertions such as The Chinese use chopsticks to eat their food or The currywurst is a traditional German fast food dish, and it must be dropped.In another situation, when exploring the "traditions" facet, the remove-past-tense-verb-roots rule would be too aggressive as it rejects assertions about past traditions.The rule that rejects sentences with PERSON entities can be used for the geography and occupation subjects, but must not be used for religions, because it will filter out sentences about Buddha or Jesus Christ.Full details are in the published code base 5 .

C HYPERPARAMETER SETTINGS
Based on tuning on small withheld data, we select the following values for hyperparameters and run Candle on the C4 dataset with these settings.For cultural facet classification (cf.Sec.4.3 and Eq.1), we fix  + to 0.5 and  − to 0.3.For assertion clustering (cf.Sec.4.4),

D INTRINSIC EVALUATION
We break down the Candle CCSK collection into domains and facets and evaluate the assertion quality for each of these subcollections and get more insights into the produced data.

D.1 Per-domain quality
Candle contains 3 cultural domains -geography, religion and occupation.For each domain, we sample 100 assertions and perform crowdsourcing evaluation with the 3 metrics -PLA, COM and DIS (cf.SubSec.6.1.1).We present the evaluation results in Table 9.
Besides the raw scores (0, 1, 2), we also binarize and denote them as acceptance rates, i.e., a score greater than zero means "accept".
Candle achieves a high plausibility (PLA) score of 1.54 on average.Performance on this metric is relatively consistent through all domains.Meanwhile, the commonality (COM) metric is highest for the occupation domain and lowest for geography domain.
More than 80% of plausible assertions are annotated as distinctive (DIS).Religion and occupation assertions perform significantly better than geography's on this metric.That could be caused by several assertions for geography subjects being correct but too generic (e.g., Japanese food is enjoyed by many people, or German beer is good).On the other hand, religions and occupations are more distinguishing from one another, while countries or geo-regions usually have cultural overlaps.

D.2 Per-facet quality
We select the assertions for the domain country, and for each facet (food, drinks, clothing, traditions, rituals) we sample 100 assertions for crowdsourcing evaluation.Besides commonality (COM), plausibility (PLA) and distinctiveness (DIS), here we introduce one more evaluation metric, domain relevance (DOM), to measure if an assertion talks about the cultural facet of interest.Only when the DOM score is greater than zero, the other metrics will be evaluated.We present the evaluation results in Table 10.It can be seen that Candle maintains good quality on all evaluation metrics.Notably, scores for the DOM metric are consistently high for all facets, suggesting that the enhanced techniques for zeroshot classification work well on our data.Interestingly, the facet drinks outperforms all other facets on 3 of the 4 metrics (DOM, PLA and COM), especially for PLA, its score is significantly higher than others.Assertions for drinks and rituals are also more distinctive than for other facets.

E DETAILS OF ANNOTATION TASK FOR ASSERTION EVALUATION
The evaluations of assertion quality (Tables 3, 4, 9 and 10) are conducted on Amazon MTurk (AMT).We present CCSK assertions to annotators in the form of natural-language sentences (triples from Quasimodo [30] and Acharya et al. [1] were verbalized using crafted rules).We evaluate each assertion along 3-4 dimensions on a 3-point Liker scale -negation (0), ambiguity (1) and affirmation (2).Each AMT task consists of 5 assertions evaluated by 3 different annotators.Workers are compensated $0.50 per task.We select Master workers with lifetime's acceptance rate more than 99%.We obtain fair inter-annotator agreements given by Fleiss' kappa [9]: 25.0 for DOM, 25.7 for PLA and 25.4 for DIS.This number for COM (13.4) is lower than others because it is an objective question (has the annotator heard of the assertion?).
Cluster summarization.We query the curie-001 model, with zero temperature and maximum length of 50 tokens.We only take the first generated sentence as output.An example prompt is presented in Fig. 4.
Generating CCSK for GPT-resource.We use the davinci-002 model and set temperature to 0.7 and maximum length to 512 tokens.
For each facet and subject, we run the following prompt template for 10 times: Please write 20 short sentences about notable <facet> in <subject>.We query for 5 facets (food culture, drinking culture, clothing habits, rituals, traditions), and 210 subjects (196 countries and 14 religions).Table 5 shows some example generations for the subject China and the facet "clothing habits"."Guess the country" game.We use the davinci-002 model, with temperature=0 and a max_length=8.Answers given by GPT-3 are checked manually for their correctness.Example prompts can be seen in Fig. 6.

Figure 2 :
Figure 2: Example assertions of Candle, with subjects (cultural groups) of cultural domains, facets and concepts.

Figure 3 :
Figure 3: Architecture of Candle (see Table8for the list of techniques and models used in each module).

Figure 6 :
Figure 6: Screenshots of GPT-3 output for the "guess the country" game, with assertions of GPT-resource and Candle for subject:Vietnam and facet:drinks.

Table 1 :
Statistics of the Candle CCSK collection (#A: number of assertions, #C: number of clusters).

Table 2 :
Processing time and output size of each step in Candle for the domain geography>country and facet food.

Table 6 :
Results of QA using context-augmented LMs.

Table 7 :
Precision (%) for the "guess the country" game.

Table 9 :
[38]ity of Candle assertions for each domain.For the HAC algortihm, we measure pointwise Euclidean distance of the normalized embeddings.Then, we use the Ward's linkage[38], with the maximal distance threshold set to 1.5.In the few cases where input sets are larger, we truncate them at 50K sentences per subject-facet pair, since larger inputs only contain further redundancies, that are not worth the cubic effort of clustering.This concerns only 15 out of 386 subjects.For cluster summarization, we consider the 500 most populated clusters for each subject-facet pair with a minimum size of 3 sentences.For cluster ranking (cf.Sec.4.6), we fix  in Eq. 3 to 0.8.

Table 10 :
Quality of Candle assertions for each facet and the domain geography>country.