Exif2Vec: A Framework to Ascertain Untrustworthy Crowdsourced Images Using Metadata

In the context of social media, the integrity of images is often dubious. To tackle this challenge, we introduce Exif2Vec, a novel framework specifically designed to discover modifications in social media images. The proposed framework leverages an image’s metadata to discover changes in an image. We use a service-oriented approach that considers discovery of changes in images as a service. A novel word-embedding based approach is proposed to discover semantic inconsistencies in an image metadata that are reflective of the changes in an image. These inconsistencies are used to measure the severity of changes. The novelty of the approach resides in that it does not require the use of images to determine the underlying changes. We use a pretrained Word2Vec model to conduct experiments. The model is validated on two different fact-checked image datasets, i.e., images related to general context and a context specific image dataset. Notably, our findings showcase the remarkable efficacy of our approach, yielding results of up to 80% accuracy. This underscores the potential of our framework.


INTRODUCTION
Social media has evolved into an important platform for disseminating news and images about public incidents [49,65].There are more than 5 billion users on social media [18,29].Social media users publish a large amount of data about public events, and during the COVID-19 pandemic, it served as a crucial tool for data collection, providing valuable insights into the unfolding situations [44,70].These crowdsourced images may contain critical information about public incidents, i.e., road accidents, crime scenes and violent scenes, etc [83].These crucial images are increasingly used in diferent applications i.e., scene analysis, scene reconstruction and image forensics [81].Utilizing these social media images can signiicantly facilitate the investigators to explore unfolding situations which might have led to the incident [2].
Trustworthiness of crowdsourced social media images has hitherto been a prime issue [50,74].An image shared on social media may be modiied to introduce misleading information, which could result in serious consequences, including social and political unrest [19,57].The modiications 1 in an image can be of various types and may be within the picture or in the posted information with the image.For example, an image may be original but the metadata has been tampered with (e.g., date or city when/where the image was taken).In the latter, the image itself is original but it is the metadata or image description that is modiied [46].Figure 1 shows some examples of misleading social media images.These images are legitimate but described in a misleading way.In some cases, changes are only intended for improving the image outlook which does not categorize it as a fake [68].Therefore, the changes need to be extensively analyzed from multiple perspectives before categorizing them as a fake.We deine fake as a context-sensitive parameter.A fake image in one context may not be fake in other contexts.
Detecting untrustworthy 2 images has traditionally been addressed using image processing and information retrieval techniques [15,31,62,78].For instance, a neural-network-based approach has been proposed to detect fake images [33].It utilizes residual signals from chrominance components, such as YCbCr, and Lab, to acquire robust deep representations through a meticulously designed convolutional neural network (CNN).Some solutions are based on image forensics [15,78].An image forensic model is proposed in [78] to learn intrinsic features of an image.Some state-of-the-art approaches also leverage metadata along with the image content to discover changes in an image [73].These approaches are usually costly in terms of computational power.Some lightweight service-based approaches have recently been proposed to deal with crowdsourced images [23,48].For instance, a preliminary service-based trust framework is proposed in [3] which is based on users' comments on a social media post to assess the credibility of the image.Another work considers credibility of a user's stance embedded in the comments on a social media post to determine the credibility of image [1].However, the credibility of images may not be completely assessed based only on the user's stance.Fake posts on social media can get supportive comments from other users [20].The stance of credible users may also be biased [84].To address these limitations, we propose to assess trust among social media images using a lightweight objective framework consisting of determining changes and updates in the images.In this respect, we leverage an image's meta-information 3 to discover and characterize changes in the image.It is worth mentioning that some subtle changes in the colors and shades in an image are not always relected in the image metadata.Therefore, we consider only those changes which may be relected in the image metadata.
An image is a well-deined entity described by its visual components and metadata.We abstract an image as a service having functional and non-functional attributes.Functional attributes are related to the actions to capture an image, i.e., pressing the shutter and switching picture/video modes etc.We represent an image's metadata as its non-functional attributes.An image's non-functional attributes are usually relective of the content within the picture [21,38].Editing an image may make it inconsistent with its non-functional attributes [8,28].For instance, changing the background of a picture may make it inconsistent with the GPS attributes in the image metadata [24].An attempt to hide the facts in the metadata may create some discrepancies among non-functional attributes [27].These discrepancies are not straightforward to identify as they are embedded systematically in the non-functional attributes.The qualitative nature of numerous non-functional attributes presents another challenge, as the discrepancies among them are predominantly semantic in nature.These discrepancies can be identiied by looking into semantic meaning of attributes and the relationships among them.To address these challenges, we need a quantitative representation of non-functional attributes that is also relective of their semantics.A framework is then required to compute semantic diferences among non-functional attributes and to translate these discrepancies in to modiications.
We propose Exif2Vec, a novel approach that leverages only the non-functional attributes of images to detect changes in social media images.It is worth mentioning that most social media platforms remove public access to an uploaded image's metadata.We are proposing this framework from social media owners' perspective which allows us to assume that the metadata is available with the image.An image's metadata comprises of many useful attributes that inform about the image and its creation.We primarily focus on the spatio-temporal and contextual attributes.The non-functional attributes are usually correlated with each other.For instance, GPS location is a non-functional attribute which is correlated to country name and city name.The correlated attributes can be changed collectively to introduce a systematic change.To capture these changes, we explore relationships among non-functional attributes.A pre-trained word embedding model is then employed to create a distributed representation of the attributes.The distributed representation places semantically similar attributes close to each other in a high-dimensional space [7].Afterwards, we isolate consistent and inconsistent pairs of attributes based on a similarity distance between two attributes.For instance, 'Sydney-NSW' is a consistent pair of attributes because the distance between Sydney and NSW is equal to the other city-state pairs.We further inspect these inconsistent pairs to discover if multiple inconsistencies are leading to a same misperception.For instance, the name of a country, city, state and GPS coordinates can be changed collectively such that all attributes remain consistent to each other.The identiied inconsistencies are then leveraged to quantify the severity of changes in an image.Below, we summarize our main contributions: • We propose Exif2Vec, a novel framework designed to detect changes in social media images solely relying on the metadata and the associated posted information accompanying the image.Whereas, state-of-the-art utilizes metadata in conjunction with the image content.• We employ a word-embeddings-based approach speciically Word2Vec to generate vector embeddings for the image non-functional attributes.These embeddings are then used to discover discrepancies in the non-functional attributes.• We propose a novel framework to translate discrepancies in the image's non-functional attributes to the severity of changes in an image.• The proposed Exif2Vec is validated on two datasets i.e., a context-based dataset and an image veriication corpus.Results relect up to 80% efectiveness of the proposed approach.
The proposed approach adeptly discovers a typical set of changes that manifest in the non-functional attributes.However, it could further enhance its coverage by considering speciic alterations intrinsic to the image itself, such as shifts in tones, color intensity, or distortions.It is important to note that the proposed approach is not intended to compete with image processing-based methods but rather ofer a complementary perspective.The main rationale behind the proposed approach is rooted in the fact that certain manipulations or alterations may leave detectable traces in the metadata.Leveraging this information allows for a computationally less expensive analysis compared to traditional image content-based methods.By focusing on metadata, we aim to provide an eicient and scalable solution that complements existing approaches.We acknowledge that this approach may have its limitations, particularly in cases where sophisticated image content manipulations are employed, we believe it ofers unique advantages in scenarios where computational resources or time constraints may be a signiicant factor.It is important to highlight that the efectiveness of the proposed framework might be inluenced by the extent to which non-functional attributes are available.In scenarios where there is a scarcity of such attributes, an alternate method involves extracting meta-information from the social media post.While this alternative approach might exhibit reduced precision due to potential uncertainties regarding the accuracy of the meta-information, it still presents valuable insights.

MOTIVATING SCENARIO
Social-media has become an important tool to share news and information related to public incidents.However, many viral social media posts turned out to be fake in recent years [40].A recent study by Massachusetts Institute of Technology (MIT) reveals that a fake post spreads substantially faster than real ones [76].More importantly, most fake images are accompanied with a manipulated text and image metadata [16,46,60].Indeed, an original image may also contain incorrect description of the image.Most of these posts contain deepfake text generated by deep learning models [53].An image's metadata may relect discrepancies with this deepfake text.Therefore, analyzing the text shared with an image and the image metadata may relect changes in an image and its description.
Our motivating scenario involves a depiction of a plane crash that occurred in New York in 2009.Illustrated in Figure 2, the scene captures the evacuation of US Airways Flight 1549 as it rests on the surface of the Hudson River.Intriguingly, this particular image was erroneously attributed to the missing Malaysian aircraft MH370 by Cable News Network (CNN) in 2014.In this misleading post, the image itself is original but the post contains a false claim.Numerous cutting-edge solutions depend on image processing techniques to detect dubious content in social media images.These approaches center their attention on the visual content within the image, which can potentially lead to overlooking alterations that extend beyond the image itself.In response to this limitation, we put forth an innovative approach for recognizing untrustworthy images.Our method relies exclusively on the utilization of an image's metadata along with the associated textual information.Exploring the image metadata may reveal a lot of inconsistencies in the metadata and the posted information.These inconsistencies may relect the trust of an image.For instance, in Figure 2, the image shared by CNN is unedited and the metadata is relective of the manipulations in the image description.The majority of social media platforms restrict public access to image metadata.We tackle this issue from the standpoint of social media owners, which permits us to operate under the assumption that metadata accompanies the image.The image metadata, as depicted in Figure 2, provide evidence that the image is linked to an entirely separate incident.
Figure 2 shows a very simple example of a cheap fake (a naive way to introduce changes) in which the inconsistencies between the metadata and the image description are straightforward to identify.However, in real case scenarios, the changes are embedded very systematically in images and its metadata.Image metadata is mostly changed to make it consistent with the changes in an image [75].For instance, if a background in an image is modiied, the GPS tags in the metadata can be changed to make it consistent with the manipulated background in the picture.These types of changes are relatively hard to detect.Therefore, a framework is required to ind inconsistencies in an image metadata that can relect the trust of an image.Fig. 2. Motivating scenario models to detect fake social media images.A deep learning approach is proposed in [35] for detecting fake images by using contrastive loss.In the study detailed in [58], neural networks are employed to detect counterfeit images that are disseminated through various social media platforms.The work described in [79] employs a combination of traditional digital forensics techniques and artiicial intelligence methods to uncover instances of image manipulation.Additionally, a block-oriented methodology for detecting Copy-Move Forgery is put into operation, as elucidated in [42].In the publication by Guo et al. [30], the proposal includes two methods for identifying counterfeit colorized images: one is based on histograms, and the other relies on feature encoding.In the work documented by Tanaka et al. [61], a method utilizing robust hashing for the detection of counterfeit images is introduced.The proposed method with robust hashing is demonstrated to have a high fake-detection accuracy, even when multiple manipulation techniques are carried out.A convolutional neural network based approach is proposed in [58] to spot fake images shared over social media platforms.Deep-learning methods for image forensics have been listed in [17].Another strategy involves utilizing image metadata to identify fake WhatsApp images [38].However, this particular approach focuses solely on a limited set of spatio-temporal attributes from the image source.The image processing methods mentioned above exhibit impressive accuracy in identifying fraudulent elements within images; however, they demand substantial computational resources, as indicated by Dang et al. [22].The importance of computationally less intensive solutions for diferent social media applications is highlighted in [4].In this paper, we propose that a subset of trust in images can be derived by using only an image metadata.Some recent studies claim that a subset of trust in social media images can be derived using comments on a post [1,3,82].A new crowdsourced image service trust model is proposed in [3].The trustworthiness of an image service is measured based on the users' stance.Textual features of social media images, i.e., comments and meta-data, e.g., spatio-temporal information are utilized to gather the trust-rate of the image service.A users' stance and credibility-based crowdsourced image service trust model is proposed in [1].The proposed model considers various indicators such as the stance embedded in the images' comments, their meta-data, e.g., time, along with the users' credibility.It models the interactions between commenters and sub-comments using diferent language-based models [71].These approaches are unable to capture modiications in an image because misleading content on social media may receive positive comments from other users [20].Moreover, comments from credible users can be biased [84].We propose a more objective approach consisting of modiications and updates in an image to determine its likelihood of being fake.
Image metadata acts as an indicator of modiications made to an image.Our primary method for detecting forgery hinges solely upon the analysis of image metadata.In recent times, a multitude of approaches have arisen to uncover instances of image manipulation, drawing upon the insights provided by metadata information.A recent work utilizes image metadata and an ELA processor to detect forgery in images [73].The functioning of the proposed system relies on a neural network capable of identifying and processing image regions using a speciic approach.Another work performs image provenance analysis based only on the metadata [10].The image provenance tree implicitly informs about modiications in an image.These solutions are based on a limited use of metadata along with other computer-vision based approaches.Our proposed approach is diference in a sense that it is completely based on image metadata to ascertain the trust of an image service.
Image non-functional attributes exhibit a predominantly qualitative essence, encompassing features like city, country, state, and more.Detecting alterations within an image necessitates the identiication of irregularities present in these non-functional attributes.These irregularities can be ascertained by exploring the semantic variations inherent in these attributes.Numerous cutting-edge studies have focused on extracting the semantic meanings behind these qualitative keywords.Latent Semantic Analysis (LSA) stands out as a widely employed technique aimed at identifying dissimilarities between two documents.This method inds practical utility across a spectrum of domains, strategically addressing a variety of challenges.An illustrative example can be found in [14], where LSA is harnessed for text summarization.Within this context, a pioneering summarization approach is introduced, leveraging frequent item sets to encapsulate latent concepts inherent in the analyzed documents.Through the efective utilization of Latent Semantic Analysis (LSA), authors adeptly condense the potentially redundant assemblage of item sets into a succinct compilation of uncorrelated concepts.Subsequently, the summarization process selects sentences that encompass these latent concepts, all the while minimizing redundancy.LSA relies on the preexisting vocabulary within the documents it analyzes.Out-of-vocabulary words, which are not present in the training data, pose a challenge for LSA.It cannot efectively represent or analyze such words without additional preprocessing or techniques.
We introduce a novel way of discovering changes in an image's non-functional attributes using a wordembeddings-based approach.In this respect, we use some pretrained models to create vector embeddings of the non-functional attributes.The vector embeddings are then exploited to ind inconsistencies among the attributes.Word embeddings are widely used in diferent domains [32,45].For instance, a word-embedding-based solution to search similar records in databases is presented in [26].Word embedding is applied on twitter data in [34] to monitor natural disasters.A context sensitive word embedding approach is presented in [52].Dynamic embeddings are developed in [55] to capture how the meaning of words change over time [55].Word-embeddingsbased models create vectors for qualitative non-functional attributes.However, some non-functional attributes are quantitative in nature e.g., GPS coordinates and GMT ofset.It is a challenge to generate embeddings for the quantitative attributes.A novel approach is proposed in [80] to transform GPS coordinates in to real-valued vectors.Another study, as illustrated by Kazemi et al. 's work on time2vec embeddings [39], furnishes vectorized representations of temporal data.Building upon these methodologies, we employ analogous techniques to convert quantitative non-functional attributes into real-valued vectors.

SERVICE MODEL FOR SOCIAL MEDIA IMAGES
This Section provides details of the proposed service model to abstract social media images as services.Image Service: We represent a social media image in terms of its functional and non-functional attributes.Functional attributes identify the function of capturing an image, i.e., switching picture/video modes, delayed/timed

Non-functional Atributes of an Image
Modiications in an image are usually relected in its non-functional attributes, speciically the spatio-temporal and contextual attributes [28].These non-functional attributes may help in detecting possible changes within an image.We group the non-functional attributes into the following categories: • Spatial Features: Spatial attributes represent the location where the image was captured.Modiied spatial tags may be an indication of fake background in the image.For instance, a footage of a car accident can be forged to display a fake background.It can mislead the viewer that the car was at a diferent location.• Temporal Features: Temporal attributes represent the date and time when the image was captured.Date and time have no direct relation with the credibility of an image.However, it may afect the reconstructed scene because of the modiied timeline.Temporal metadata tags are being forged to develop a fake story line.For instance, in case of a car accident, timestamp of the captured scene can be tampered to depict a wrong cause of the accident.• Contextual Features: Contextual features are related to context of an image.Fake context may support fake spatio-temporal tags of an image.For instance, a car accident can be described in a way to support fake spatio-temporal tags.
There may be some other categories of non-functional attributes that relect changes embedded in the image i.e., change in colors, resolution and objects, etc.We primarily focus on the changes in spatial, temporal and contextual attributes to quantify changes in an image.

Potential Modifications in Non-functional Atributes
Modiications in diferent parts of an image are usually relected in its non-functional attributes.We consider the following changes in an image's non-functional attributes that are introduced to hide the facts.The following are the possible changes in image's non-functional attributes: of Figure 1c, a description is provided that falsely portrays a camel with its limbs amputated, used for begging.However, in reality, the camel is simply resting with its legs folded underneath its body.The image description has been manipulated to reinforce this deceptive interpretation.

PROPOSED FRAMEWORK
This Section introduces our novel framework, Exif2Vec, designed to discover and quantify the severity of underlying modiications in social media images.Figure 3 illustrates the worklow of our proposed approach.It commences with the extraction of metadata from images, followed by an exploration of the relationships among these attributes.Utilizing Word2Vec, we generate vector embeddings for the metadata attributes.Subsequently, we employ a similar distance analogy to identify inconsistencies, providing insight into the trustworthiness of an image.

Extracting Non-functional Atributes
The proposed Exif2Vec framework takes an image's non-functional attributes as an input and returns the severity of changes.To initiate the process, we begin by extracting the non-functional attributes inherent in a social media image.The non-functional attributes can be fetched from the image metadata and the description posted with the image.An important assumption in this paper is that the provided image is not the only version uploaded on social media platforms.In this context, we systematically retrieve various versions of an image from social media.In this regard, we employ Reverse Image Search (RIS) 4 to acquire additional attributes associated with the image.It is important to note that our intention is not to compete with Reverse Image Search (RIS), but rather to utilize it as a tool within our framework.RIS serves as a valuable tool for online image search, but it does not ofer a trust rating for images, which is a distinct focus of our research.It is worth mentioning that some social media platforms remove public access to image metadata while uploading an image [63].We are proposing this framework from social media owner's perspective assuming that we have access to the metadata.

Exploring Relations among Atributes
Most of the non-functional attributes are usually correlated with each other.For instance, name of city is related to name of state and country.Whereas, digital zoom and resolution are independent of location.The connected attributes are usually changed collectively to embed systematic changes in the image's non-functional attributes.Hence, our initial focus involves delving into the interconnections among various non-functional attributes, a foundation upon which we can subsequently gauge the magnitude of alterations within an image.These attribute interrelationships can be categorized into two distinct types, shown in Figure 4 and described below: 1 Intra-relationships.The relationships which exist in-between the non-functional attributes of an image are labeled as intra-relationships.For instance, the relation between GPS and City, and GPS and Country are examples of intra-relationships.We further classify intra-relationships into static and dynamic relationships as described below: Static Relationships: We deine static relationships as one-to-one relationships among non-functional attributes.Static relationships are not impacted by changing other attributes.For instance, the relation between GPS coordinates and the name of a country is a static relationship.It is not afected by changing any other attribute.Dynamic Relationships: We deine dynamic relationship as a relation among non-functional attributes which may be afected by changing the value of another attribute.For instance, there is a relation between Shutter Speed and timestamp of an image.High shutter speed indicates that the photo is captured during the day time, whereas, low shutter speed indicates that the photo is captured in the evening.There is another attribute named Exposure Time which have an impact on this relationship.Exposure time is the time taken by the camera to collect light.Higher the exposure time, lower will be the shutter speed, that indicates a diferent timestamp [64].
The interplay between shutter speed and exposure time is further impacted when the camera's lash is activated.This relationship can also be afected by indoor or outdoor setting.

Inter-relationships.
Image metadata might lack complete attribute information.Spatio-temporal information of an image can be used to retrieve social media posts containing the same image.RIS is one of the commonly used tools to retrieve diferent versions of an image on social media platforms.The relationships which exist between the non-functional attributes and the meta-information acquired from any external source e.g., RIS are labeled as inter-relationships.For instance, the relationship between GPS acquired from metadata and name of City acquired from RIS is regarded as an inter-relationship.It is worth mentioning that inter-relationships are not our primary source of data.We leverage data extracted from these inter-relationships to efectively populate missing values within the metadata.

Representing Relationships in terms of an Atribute Graph
We represent relationships among non-functional attributes in terms of an attribute graph.Figure 5 shows an attribute graph for a typical set of non-functional attributes retrieved from an image.Nodes in the graph represent attributes and values of vertices denote the presence or absence of a relationship.Ideally, the value of a vertex should be a spectrum (a value ranging from 0 to 1) relecting the strength of a relationship.A high value relects a strong relationship and vice versa.However, we consider the value of a relationship to be Boolean for the sake of simplicity.Vertices with value 1 indicate the presence of a relationship, whereas, 0 means that the corresponding attributes are independent of each other.The dotted vertices in the graph represent dynamic relationships, whereas, the solid vertices represent static relationships.The attribute graph can also be represented as a matrix as shown in Figure 5. Relationship of a node with itself is labeled as -1 in the matrix to diferentiate it from 0s.The matrix shown in Figure 5 consists of only a few attributes, however, in real case scenarios, the matrix usually consists of all available spatio-temporal and contextual attributes.The size of matrix depends on the available set of metadata5 .

Generating Atribute Embeddings to Discover Modifications
Modiications in an image may have an impact on the semantics of the image [56].Introducing a change can leave semantic discrepancies in non-functional attributes.It is imperative to efectively quantify these semantic variations in order to accurately gauge the extent of changes within an image.In this respect, we employ word embedding models to explore semantics of the changes in non-functional attributes.Word embedding techniques transform a given word into a high-dimensional vector, encapsulating the contextual essence of the word within the vector itself.We adopt a similar approach to generate attribute embeddings.Attributes with similar meanings exhibit analogous representations in the high-dimensional space.

Distributed Representation of Atribute Embeddings.
A variety of models is available for word embeddings, encompassing notable ones such as Word2Vec, GloVe, and BERT, as outlined by Karani et al. in 2018 [37].It's worth noting that Word2Vec is context-independent, while BERT operates as a context-based model.In making the choice for a language model in our research paper, we opted for Word2Vec over BERT, guided by speciic requirements and considerations.One primary reason for this selection is that BERT treats the entire sentence as a singular input unit, constraining our control over the model's complexity.In contrast, Word2Vec afords us the capability to integrate contextual information by considering a context window, facilitating a more nuanced analysis of word relationships within a sentence.Leveraging this capability allows us to tailor the analysis to align with our research objectives and explore the nuances of language usage in a more controlled manner.BERT takes whole sentence (sequence of words) as an input to generate context-based embeddings.Ideally we should apply BERT because it is able to ind contextual meaning of words.However, we are using image metadata tags along with text.It is obviously important to ind contextual meaning of attributes.However, for this purpose, we need to input a meaningful sequence of attributes to the BERT model.It is very hard to ind meaningful sequences of non-functional attributes.Therefore, to limit the complexity of the model, we use pretrained models speciically Word2Vec to transform attributes into real-valued vectors.These vectors are then plotted on a high dimensional distributed space as shown in Figure 6.Dimensions in Figure 6 6 are derived using deep neural networks and are latent.The model is pretrained on a corpus of words obtained from Wikipedia.In this paper, we rely on the context-independent model trained on a context-independent corpus.Ideally, the model should be context-aware and trained on a context-aware corpus.For instance, if we are dealing with road accident images, the corpus should be built using only the resources related to road accidents.

Similar
Distance for Similar Relationships.Distributed representation of attribute embeddings relects useful analogies among attributes.Similar distance analogy can be viewed as a special type of similarity, where two pairs of attributes that share similar relationships, have similar distance between their corresponding vectors [47].Similar distance analogy in word embeddings refers to a technique that allows for solving word analogy problems by leveraging the semantic relationships encoded in the vector space of word embeddings.It involves using vector arithmetic operations to ind words that exhibit similar relationships to the given analogy.The process typically involves four words: A, B, C, and an unknown word X.For example, if we have the analogy "man is to woman as king is to X, " we can calculate the analogy vector (Vector(woman) -Vector(man) + Vector(king)) and search for the word in the embedding space that is closest to this analogy vector.The resulting word X is expected to be "queen" based on the analogy.We leverage this similar distance analogy in the distributed representation to discover discrepancies among attributes.Let us assume that we are provided with an image having 'Sydney' as the name of city and 'Australia' as the name of country in its non-functional attributes.The distance between 'Sydney' and 'Australia' should be similar to other city-country pairs in the distributed representation.Similarly, the distance between a consistent state and country pair should be similar to that of other correct state-country pairs.Figure 6 shows the distributed representation of attributes and their relationships with other attributes.It can be noticed that similar relationships have same distances.However, if the relationship is not very well-deined, then the distances may not be precisely equal.It can be observed from 'National Sports-Country pairs' in Figure 6 that the distance is not exactly similar for some pairs.For instance, the distance between Australia-Cricket and New Zealand-Rugby is not precisely equal.It may be due to the insuiciency of the content about these words in the corpus.This issue can be resolved by using a context-based corpus.Many useful relations among diferent words can be explored using a context-based corpus.For instance, the distance between 'helmet' and 'bike' may be similar to the distance between 'seat-belt' and 'car'.Similarly, the distance between 'paddle' and 'push bike' may be equal to 'accelerator' and 'motorbike'.However, the distance between these attributes may not be same when the model is trained on a general corpus.As stated earlier, we are training the model on a general corpus.To circumvent this concern, we posit the existence of a relationship when the distance between attributes within a pair falls within a designated range.

Discovering
Discrepancies in Non-functional Atributes.Figure 6 shows that semantically similar concepts have approximately similar distances.For instance, if we are analyzing city-country relationships, the distance between all city-country pairs will be same as shown in Figure 6.We leverage this concept of similar distances to isolate matched pair of attributes from mismatched pair of attributes.If the distance is diferent as compared to other similar pairs, then it relects that the provided city-country pair is not consistent with each other and it may represent a modiication.It is quite likely that the consistent attributes have been modiied collectively to create a systematic change.These inconsistencies object on the credibility of an image.At this step, we isolate the consistent pairs from inconsistent pairs using state-of-the-art clustering approach [36,54].These inconsistencies are further investigated in the next subsection to determine the impact of underlying modiications.

Discovering Consistent Changes.
Several modiications in an image may serve a common purpose.We use the term consistent changes for these changes that support a similar misleading perception.These changes are relatively hard to discover because they are usually embedded systematically in the non-functional attributes.Moreover, the impact of these changes may be relatively more severe.Therefore, it is essential to discover consistent changes to correctly estimate the severity of underlying changes in the non-functional attributes.Changes are considered consistent if both attributes in a consistent pair are inconsistent with another unmodiied attribute.In the example shown in Figure 7, the modiied name of state and country are both inconsistent with GMT ofset.Therefore, this set of attributes involve consistent changes.Consistent changes should be investigated carefully because a minimalistic change in the picture may be misinterpreted as a consistent change.For instance, a mistakenly modiied attribute will be inconsistent with other consistent pairs.This confusion can be avoided by leveraging the inter-relationships between attributes.We compare the attributes with the information acquired from other versions (retrieved from RIS) of an image to determine if the changes are minimalistic or well-organized.Another method to conirm about a consistent change is to investigate if an attribute has a similar distance analogy with multiple attributes of a same type.
It is crucial to recognize that not all inconsistencies hold equal severity.The signiicance of identiied inconsistencies depends on how severely the meaning of the image is harmed.Consequently, establishing a ixed threshold for inconsistencies is not feasible, as it cannot deinitively categorize an image as either genuine or fake.

Significance of Atributes
We deine signiicance as an important parameter to relect the impact of introducing a change in an attribute.Signiicance of a particular attribute may vary in diferent contexts.For instance, color of objects have a high signiicance in scenes involving blood [43].Similarly, timestamp is an important attribute in forensic applications.We ascertain the signiicance of attributes through a comprehensive analysis based on three key criteria: domain experts' opinions, the relevance of an attribute to a speciic context, and the strength of its relationship with other attributes.To enhance our understanding, we utilize insights from image forensic experts, leveraging their expertise to obtain informed opinions on the signiicance of particular attributes, as demonstrated by Bohme et al. [12].For instance, it has been explored by image forensic community that shutter speed and exposure time represent the light availability at the time of taking a picture [11].Hence, these attributes may relect an image's temporal information.These attributes are therefore considered signiicant in time critical scenes.The second criterion i.e., relevancy of an attribute to a context can be determined by using any semantic similarity measure.We employ Latent Semantic Analysis (LSA) to ascertain the contextual or topical relevance of an attribute.
We leverage the signiicance of attributes to determine relatively more severe changes.We also consider the strength of relationships among attributes to determine the severity of underlying modiications.In this respect, we deine three diferent intensities of signiicant changes: highly signiicant, medium signiicant and Signiicant attributes having strong relationships contribute more towards highly-severe changes.Whereas, the changes involving less signiicant attributes and weak relationships are categorized as less-severe changes.

uantifying the Severity of Changes
We deine severity as a measure to quantify the intensity of changes in non-functional attributes.We formalize severity in terms of entropy.Entropy is a measure to represent uncertainty in a variable's outcomes.Entropy of an image represents the uncertainty of image being real.The entropy increases with the increase in number of discrepancies discovered in non-functional attributes.However, the increase in entropy is not linear because changes in signiicant attributes result in higher entropy.We formulate the entropy as follows: where n = total number of pairs and The number of inconsistent pairs are calculated by using similar distance analogy.Two attributes are considered consistent when the distance between them is approximately similar to that of other pairs with similar characteristics.We deine a relatively more strict criteria to compare the distance between signiicant attributes.The distance between signiicant attributes is labeled as consistent if it is precisely equal to other pairs of similar kind.In contrast, for other attributes, this distance can be approximately similar.This criteria implies non-linear behavior in the entropy of an image.The concept of dynamic criteria is taken from [66].
5.6.1 Algorithm to Compute Severity of Changes.We provide an algorithm that enlists all steps involved in our proposed Exif2Vec framework.The algorithm takes an image's non-functional attributes as an input and return the severity of underlying changes.Initially, the metadata is extracted from an image and reverse image search is

Original Metadata
Changes Injected by ChatGPT Fig. 9. Changes injected by ChatGPT crime scenes, violent scenes, natural disasters and public gatherings.We extract spatio-temporal and contextual information available with these images.Hence, this dataset contains three features/columns for each image i.e., spatial attributes, temporal attributes and contextual attributes.
Injecting Changes This dataset consists of original images.To test our framework on this dataset, we systematically introduce diferent types of changes among the non-functional attributes of these images.We leverage ChatGPT to inject changes in the non-functional attributes.ChatGPT has undergone comprehensive training on a wide range of data, encompassing real image metadata.This extensive training enables the model to possess a deep understanding of the characteristics and information associated with images' metadata.ChatGPT is invoked using ChatGPT API.Instructions are provided to ChatGPT to create diferent types of variations between metadata of diferent versions of an image.In this respect, diferent combinations of changes in the spatial, temporal and contextual attributes are introduced.The alterations made by ChatGPT to image metadata cannot be classiied as synthetic, as ChatGPT is trained on a vast dataset derived from real-world sources.This extensive training regimen ensures that ChatGPT's responses and modiications are grounded in genuine linguistic patterns and context, rather than being artiicially generated.Consequently, any adjustments it makes to image metadata are informed by real-world language usage and semantics.Figure 9 shows one examples of changes injected by ChatGPT in an image metadata.Changes in spatio-temporal attributes are injected in a way to mislead the viewer to a diferent incident.Contextual attributes are changed in a way to create a misperception about the image.Moreover, diferent levels of complexities are considered while injecting these changes.For instance, in comparatively simpler cases, a few attributes are changed to introduce a change.Whereas, in relatively complex cases, many attributes are changed collectively to create a consistent change in image non-functional attributes.Spatio-temporal changes constitute 50% of the overall modiications, whereas, rest of the modiications involve contextual attributes along with modiied spatio-temporal features.Moreover, the modiications are introduced in a way to create both consistent and inconsistent changes.Consistent changes are 25% of the total introduced changes.
Image Veriication Corpus The second dataset in consideration is a well-established image veriication corpus, as referenced in [13].This dataset is continually evolving and comprises both fake and authentic social media images.Notably, it contains more than 2500 fact-checked viral social media posts.This dataset is composed of tweets and provides tweet ids, image urls, and information shared with the tweet.This dataset is a generalized dataset containing images related to multiple contexts i.e., Nepal earthquake, Boston Marathon bombings, and Hurricane Sandy etc.The images in this datasets are labeled as either fake or real.We use Python Pillow library to  extract metadata from images.We import the Python Imaging Library module from Pillow to work with images and their metadata.We extract speciic metadata ields by accessing the keys of the metadata dictionary.We use the tweet ids and the authentication information of the twitter developer account to fetch tweets' details.We fetch spatial and temporal keywords from textual information using the spaCy ___ model with timexy pipe.We further simplify the data reserving only image id, spatial, temporal, contextual and label columns.
We combine this dataset with another dataset related to fake news i.e., source-based fake news classiication [9].This dataset contains fact-checked news posted by politicians, news channels, newspaper websites, and common civilians.We consider only those samples which have an image associated with the news.There is a lot of text available with each news, therefore, to ilter out the unnecessary text, we use the KeyBERT model to fetch 20 relevant keywords from the text of the news.These keywords serves as the image's contextual attributes, and are also used afterwards to compile a context-based corpus.We use the spacy library and timexy tool to get łDatež, łTimež and łGPEž information from the text.Afterwards, we classify the text into spatial, temporal and contextual categories.This meticulous data acquisition methodology lays a robust foundation for the subsequent stages of analysis and interpretation in this research endeavor.Dataset Characteristics The three datasets (Context-based dataset, Image veriication corpus, and fake news dataset) each include images that have been fact-checked and sorted into two categories: fake and real.Figure 10 shows the class distributions in the datasets.

Generating Atribute Embeddings
We use a pretrained Word2Vec model to generate distributed representation of attribute embeddings.The model is pretrained using Continous Bag of Words (CBOW) [77].Attribute embeddings are generated for each individual image.The experiments reveal that similar distance analogies exist between many attribute pairs.For instance, Figure 11a shows attribute embeddings generated for a sample image selected from the dataset.'Altona' and 'Victoria' are the spatial attributes of the image.It is evident from the igure that a similar distance relationship exists between 'Altona' and 'Victoria' as that of 'Parramatta' and 'NSW'.Therefore, 'Altona-Parramatta' is a consistent pair of attributes.The pairs for which these analogies doesn't exist are labeled as inconsistent attributes.
To justify the use of word embeddings for image metadata, it is important to irst explore the relationships between diferent metadata tags.We achieved signiicant success in exploring similar distance relationships among many attributes.We explore many new relationships in attributes related to road accidents and crime scenes.For instance, Figure 12a shows that pedal-bicycle is related to accelerator-motorbike.Similarly, Figure 12b shows that shooting-gun and stabbing-knife are related to each other.These relationships are useful in inding inconsistent keywords in an image description.For instance, the keyword "right-hand driving" is inconsistent with countries in which left-hand driving is followed.These inconsistencies relect on the trust of an image.Figure 11b shows more examples of similar distance relationships identiied by the pre-trained model.

Efectiveness
We report the performance of the proposed approach in terms of accuracy, precision, recall and F-score.Accuracy illustrates the correct identiication of modiications.Precision, recall and F-score indicates correct identiication of true positives and true negatives.It relects the correct identiication of consistent and inconsistent changes.Precision relects the number of consistent changes out of the total consistent changes.We test the proposed framework in terms of accuracy and run-time eiciency.We report the accuracy as a percentage of correctly discovered changes.We separately report the accuracy for the attributes having strong and weak relationships.Figure 13 shows the accuracy of the proposed framework for three diferent types of changes i.e., inconsistent changes, consistent changes and signiicant changes.The accuracy in determining the inconsistent changes is relatively higher as compared to consistent changes and signiicant changes because inconsistent changes are injected naively, whereas, consistent and signiicant changes are introduced more systematically and are hard to discover.We conduct our experiments a total of 100 times to ensure a thorough assessment of the consistency of our results.Figure 13 represents the standard deviation on % accuracy, providing a visual representation of the variability observed across multiple trials.This addition not only strengthens the robustness of our indings but also enhances the trustworthiness of our proposed framework.The incorporation of measures of variability, including error bars and standard deviation, allows for a more nuanced interpretation of our results.Vector Size Figure 13 shows the accuracy of pretrained Word2Vec model on the context-sensitive dataset for a vector size of 300.We also illustrate the trend in accuracy for varying vector sizes in Figure 15a and 15b.Speciically, Figure 15a reports the accuracy on context-based dataset, whereas Figure 15b shows the accuracy on the general image veriication corpus.It is also evident from the igures that the accuracy increases with the increase in vector size because increasing the vector size better embeds the context in vectors.

Context Window Size
Context window size has an inluence on the resulting embeddings.In some cases, the context of a keyword is deined by multiple words.For instance, the term "traic signal" contains two words.Its context can not be completely deined by using only one word.Context window size represents number of words being used collectively to generate an attribute embedding.For instance, if a statement is łthe quick brown fox", a context window of two would mean your samples are like (the, quick) and (the, brown).Then you slide one word and your samples become (quick, the), (quick, brown) and (quick, fox) and so on.A larger window size can give higher accuracy due to more available training examples, but also results in longer training time.We repeat the experiments for multiple context window sizes to see the efect on the accuracy.Figure 16a and 16b show the change in accuracy by changing context window size.Figure 16a reports the accuracy on the context-based dataset, whereas, Figure 16b shows the accuracy on the general image veriication corpus.It is evident from the igures that the accuracy increases by increasing the context window by 1 or 2 degrees.Accuracy starts decreasing if we further increase the context window.There is a trade-of between context and accuracy.Increasing the context window better introduces the context of a term.However, a large context window loses discrepancies among diferent sub-words of a keyword.Due to this trade-of, the accuracy of determining consistent changes drops even if the context window size is greater than 1.
Comparison with State-of-the-Art We evaluate the eicacy of our proposed word-embeddings-based framework by contrasting its performance against two established baselines: Latent Semantic Analysis (LSA) and Term Frequency -Inverse Document Frequency (TF-IDF).We chose these baselines due to their common goal of depicting semantic relationships among words, aiming to capture nuanced meanings and contextual nuances.Furthermore, we ofer a comparative analysis with a state-of-the-art method for detecting fake images on WhatsApp [38].Latent Semantic Analysis Latent Semantic Analysis is used to represent the contextual meaning of words by statistical computations applied to a large corpus of text.LSA is widely used to quantify the semantic similarity between two sets of sentences or documents [25,41].Efectiveness of LSA is reported in terms of identiied % dissimilarity in the non-functional attributes.The proposed Exif2Vec performs better than LSA.6.3.2Precision, Recall and F-score.We report the accuracy of determining consistent and inconsistent changes in terms of precision, recall and F-score.These metrics relect the performance in avoiding false positives and false negatives whereas recall indicates the performance in avoiding false negatives.Precision is calculated by dividing the true consistent/inconsistent changes by anything that was predicted as a consistent/inconsistent change.Precision of consistent changes is relatively better as shown in Table 2. Recall refers to the percentage of total consistent/inconsistent changes correctly classiied by the proposed approach.F-score is the harmonic mean of precision and recall.Results reveal a signiicant recall for both consistent and inconsistent changes.Run-time Eiciency We train Exif2Vec on a vocabulary size of 50,000 words downloaded from Wikipedia using wikidump.The training is hosted on a server with 32GB of RAM, Core i9 4GHz CPU with 2 GPUs of memory 25GB each.It takes around 3 days to train Exif2Vec using 4 workers/threads.Figure 18b shows the variation in run-time of training with the change in the number of threads.

CONCLUSION
A novel framework named as 'Exif2Vec' is proposed in this paper to discover discrepancies in an image's nonfunctional attributes to ascertain the likelihood of an image being fake.The non-functional attributes represent the image's metadata and information posted with the image.The proposed approach is unique in a sense that it does not require the use of actual images to determine their trustworthiness.The framework employs a pretrained word embedding model to generate vector embeddings of the non-functional attributes.Our approach primarily focuses on metadata analysis.Complementing the proposed approach with image processing could provide a more comprehensive assessment of image trustworthiness by considering intrinsic alterations such as shifts in tones, color intensity, or distortions.We recognize the importance of capturing the full range of non-functional attributes and their variations in social media images, and we are open to exploring opportunities for integration in future iterations of the framework.A similar distance analogy is then leveraged to discover discrepancies among the non-functional attributes.Afterwards, we quantify the severity of underlying modiications considering the signiicance of attributes in a given context.Experimental results demonstrate up to 80% efectiveness of the proposed framework.The pretrained model contains vector embeddings of 50,000 most commonly used words.This corpus of words is context-independent.Future research could explore the potential beneits of utilizing larger and more diverse vocabularies to further enhance the framework's ability to detect a broader range of image manipulations.The proposed framework can be trained on a context-sensitive corpus in the future to enhance the accuracy.Another promising future direction of this work is to create and include vector embeddings for the quantitative non-functional attributes in the proposed model.Moreover, there is potential for expanding this work to place a greater emphasis on metadata analysis, thereby reducing dependence on information derived from social media posts.Additionally, a prospective avenue for exploration involves conducting image provenance analysis [67].This would aid in determining the origin of an image, ofering a valuable means to identify changes even in scenarios where all non-functional attributes have been modiied.The proposed framework could be used as a criteria to digitally sign the credible images and store the certiicates on blockchain [6].

Figure 7
gives an example of these types of changes.The Figure shows the set of original and modiied attributes of an image.It can be noticed that the original country and state names are 'United States' and 'Florida', which

Fig. 13 .
Fig.13.Combined percentage accuracy of the proposed framework tested on 3 datasets 6.3.1 Accuracy.We report the accuracy as a percentage of correctly discovered changes.We separately report the accuracy for the attributes having strong and weak relationships.Figure13shows the accuracy of the proposed framework for three diferent types of changes i.e., inconsistent changes, consistent changes and signiicant changes.The accuracy in determining the inconsistent changes is relatively higher as compared to consistent changes and signiicant changes because inconsistent changes are injected naively, whereas, consistent and signiicant changes are introduced more systematically and are hard to discover.We conduct our experiments a total of 100 times to ensure a thorough assessment of the consistency of our results.Figure13represents the standard deviation on % accuracy, providing a visual representation of the variability observed across multiple trials.This addition not only strengthens the robustness of our indings but also enhances the trustworthiness of our proposed framework.The incorporation of measures of variability, including error bars and standard deviation, allows for a more nuanced interpretation of our results.

Fig. 14 .
Fig. 14.Standard Deviation in % Accuracy Fig. 15.Accuracy of the model with diferent vector sizes Figure 18a relects that the proposed Exif2Vec performs equally good without relying on image-content-based features.

Table 1 .
Description of non-functional atributes picture taking, etc.The image metadata tags are represented as non-functional attributes.Non-functional part of an image usually consists of spatio-temporal and contextual attributes as listed in Table1.