TTC-QuAli: A Text-Table-Chart Dataset for Multimodal Quantity Alignment

In modern documents, numerical information is often presented using multimodal formats such as text, tables, and charts. However, the heterogeneity of these sources poses a challenge for machines attempting to jointly read and understand the numerical semantics conveyed through text, tables, and charts. In this paper, we introduce a multimodal dataset called Text-Table-Chart Quantity Alignment (TTC-QuAli). This dataset is designed to facilitate a new task that involves linking related quantities across text, tables, and charts. TTC-QuAli is a comprehensive dataset that contains 4,498 quantities in text, aligned with 1,086 chart images and 1,503 tables from real-world statistical reports. It is the first dataset to provide high-quality annotations for linking quantities across multiple modalities, and it includes challenging composite (aggregated/calculated) quantity linking. To address the challenge of bridging representation gaps between different modalities and capturing their shared contextual semantic meaning, we introduce ConTTC, a novel transformer-based cross-modal contrastive learning architecture. This is the first architecture to jointly model text, tables, and charts, and contrastive learning is employed for multimodal quantity linking towards unified representation learning. Our experiments demonstrate that TTC-QuAli presents a significant challenge for existing baselines and serves as a valuable benchmark for future research. Experiment results show that ConTTC significantly outperforms all baseline methods.


INTRODUCTION
Modern documents, such as Wikipedia pages, statistical reports, and scientific papers, often contain a significant amount of quantities [42].However, quantities are distributed across multiple modalities -text, tables, and charts -and must be interpreted in conjunction with one another.For example, in Figure 1, the quantity "93.9%", representing the percentage of the black population aspiring to obtain a university degree, appears in the text, table, and chart image and has the same semantic meaning across all three modalities.Furthermore, the value of "34%" can be derived by subtracting "59.9%" from "93.9%" represented by the chart elements, as illustrated in Figure 1.The extraction of quantity relationships between text, tables, and charts holds significant value in the field of document intelligence.For example, it enables effortless reading by facilitating navigation through linked quantities within a document.It also streamlines the process of smart editing by automatically synchronizing linked quantities to proactively avoid inconsistencies and errors.Additionally, this approach facilitates numerical reasoning by harnessing data from various multimodal sources, such as updating structured information on Wikipedia using real-time textual news.These examples represent only a fraction of the numerous advantages that While there have been recent works on text-table joint quantity alignment [5,17] and efforts to highlight related parts in charts based on mentions in text [44], these works have tackled tables and charts separately.To the best of our knowledge, no research has aligned text, tables, and chart images together.To address this gap, we introduce the first multimodal dataset with a new task for text-table-chart (TTC) quantity alignment -TTC-QuAli -to link quantities among text, tables, and charts.Specifically, snippets in text, cells in tables, and visual elements in charts are interlinked when they either indicate the same quantity or possess calculation relationships.
To construct a large multimodal dataset with high-quality annotations, we leveraged the availability of publicly accessible statistical reports from various organizations [1-3, 19, 33, 41].These reports contain rich tables, charts, and textual descriptions.We collected statistical reports from StatCan and conducted comprehensive human labeling of text-table-chart alignment in statistical reports.As a result, TTC-QuAli was established with 4,498 quantity alignment samples across 1,086 chart images and 1,503 tables.TTC-QuAli possesses three unique characteristics.Firstly, it is the first multimodal dataset to include three types of data formats: text, tables, and chart images.Secondly, it provides high-quality and fine-grained annotations for quantity alignment between different modalities.Thirdly, in addition to one-to-one mapping of quantities, we also provide annotations for composite quantity alignment [17] that involve aggregation and calculation.
Given that linked quantities share the same contextual semantic meaning across different modalities, we propose ConTTC, a crossmodal architecture that models text, tables, and charts together.This architecture uses contrastive learning to learn a joint semantic subspace for quantities from multiple modalities.We propose two model variants based on VL-T5 [7] and VisionTaPas [30] with carefully designed feature channels for multimodal encoding and shared attention layers for information exchange among modalities.Importantly, we are the first to introduce contrastive learning to learn a joint representation subspace for quantities in different modalities.Intuitively, by employing contrastive learning, if quantities from two different modalities are linked together, the distance between their learned representations should be pulled closer to indicate shared semantic and numerical meaning.Otherwise, representations of different quantities should be pushed away.We employed contrastive-learning-based pre-training by leveraging large existing corpora of paired text-table and table-chart data.
Our experiments demonstrate that TTC-QuAli presents a significant challenge for existing baselines and poses considerable challenges for future work.However, the joint encoding of text-tablechart and contrastive-learning-based pre-training prove highly effective.

PRELIMINARY 2.1 Task formulation
We formalize the task of bridging quantities in text, tables, and charts as a quantity alignment problem.We first introduce two subtasks, text-table quantity alignment and text-chart quantity alignment, before introducing a joint text-table-chart quantity alignment task.For a document with one or more tables and charts with textual descriptions, a textual sentence  may contain a set of  text mentions of quantities  = {  :  = 1, . . ., }.Quantity mentions can be numbers or phrases that refer to aggregated/calculated values of multiple quantities such as the sum or difference.For example, "a total of thirteen patients" can refer to an aggregate value -the sum of the numbers of patients from multiple categories in a table.A table  contains  rows,  columns and a set of  cells of quantities  = {  :  = 1, . . ., }.A chart  has a set of  visual elements of quantities  = {  :  = 1, . . .,  }. (1) Text-table quantity alignment: given a table  and associated textual description with a quantity mention   , find out a subset of table cells { 1 , 2 , ...} belonging to  that aligns with (are referenced to) the quantity mention   .(2) Text-chart quantity alignment: given a chart  and its textual description with a quantity mention   , find out a subset of chart elements { 1 ,  2 , , ...} belonging to  that that aligns with the quantity mention   , e.g., find out the two quantity elements with the yellow color in Figure 1 given "34%" in textual description.(3) Text-table-chart quantity alignment: given a specified quantity mention   in textual description and one or more tables and charts, find out the target table  with a subset of quantity cells { 1 , 2 , , ...} belonging to   that that aligns with the quantity mention   or a target chart  with a subset of chart elements { 1 ,  2 , , ...} belonging to  that that aligns with the quantity mention   .
We use accuracy as the evaluation metric, defined as the proportion of quantities in text that are correctly linked with table cells and chart elements.More details are presented in Section 4.1, 4.2.
Text-table-chart quantity alignment presents several challenges for existing methods.(1) It requires an understanding of contextual semantics across different modalities.Tables have two-dimensional structures with flat or even hierarchical headers [5], while charts come in various types with different layout structures such as bar charts and pie charts.(2) Quantity mentions in text often differ in format from their counterparts in table cells.For example, "onethird"/"33.9","1.3 million"/"1,323,876".For charts, accurately estimating the numerical value of visual elements without explicit quantity mentions can be challenging.However, existing methods [29] that can extract underlying data from chart images can help mitigate this gap.(3) Composite quantity alignment [17] involves aggregated/calculated quantities.In such cases, the text mention should be aligned with all operand cells or elements and the search space increases exponentially.

Dataset Construction and Analysis
Dataset Collection Documents in Text-Table-Chart (TTC) dataset are collected from the Statistics Canada [41], which is first leveraged by HiTab.The website consists of thousands of reports authored by over 1,000 domain experts, encompassing textual descriptions, tables, and charts.As a result, the dataset exhibits high quality and reflects a genuine comprehension of the underlying data.The majority of these reports are available in HTML format that facilitates automatic parsing.
We follow HiTab to collect 6,039 English statistical reports through web crawling from StatCan including domains like agriculture, business, education, energy, environment, etc.We only focus on reports that contain charts, tables, or both.First, for table-text alignment, we collect tables and their surrounding paragraphs and save them to the spreadsheet for efficiently labeling cell reference.Note that HiTab provides both entity and quantity alignment annotations in some of our collected reports, so we extract the existing annotations of quantity alignment from HiTab as additional information for annotators to check and reference.Second, for chart-text alignment, we collect chart images and their surrounding paragraphs.For each chart, surrounding paragraphs are split into sentences and then stored to a spreadsheet.To facilitate the subsequent annotation process, we also store the chart image into a spreadsheet.As shown in Figure 2 (b), each pixel in the image is stored in a cell in the spreadsheet in order; the row and column indexes of cells in the spreadsheet correspond to the indexes of pixels in an image; the color of each pixel corresponds to background color of each cell, so that a spreadsheet looks like an image, making it easy to select and record regions shown in Figure 2 (a).In this way, we can annotate aligned elements for charts within Excel as introduced in the following paragraph.We use the ChartOCR method [29] enhanced by [30] to extract the corresponding data table from the chart.This allows us to establish explicit mapping relationships between elements in a chart and cells in the extracted data table, which are well recorded during the data extraction process of ChartOCR.The extracted data table, although may not be 100% correct, is also saved to the spreadsheet for labeling.Text-Chart Quantity Alignment Annotation The annotation process consisted of several steps.(1) We checked for any names or uniquely identifying individual people or offensive content in the text or charts.No such sensitive information was found in our dataset.(2) We identified quantity mentions in the text, such as "94%" and "one-third".(3) We aligned corresponding elements in the chart by recording the bounding boxes of elements in the annotation spreadsheet.For example, a vertical bar with a bounding box "NW250:PP372" in Figure 2. The annotated regions varied depending on the chart type.For a bar chart, we recorded the region formed by the upper left and lower right corners of the bar.In a line chart, we marked the location of the data point.In a pie chart, we labeled two vertices of a sector as well as the center of the circle.Multiple elements were annotated if there was a 1-to-many composite quantity linking with calculation.(4) We record corresponding cells in the extracted data table for the aligned elements, and this enables easy evaluation for quantity alignment methods that build upon a proceeding table extractor in Section 4. In this step, annotators skipped those elements without correct corresponding cells extracted by chartOCR.( 5) The aggregation type is labeled, consisting of aggregation functions (sum, average) and other operators (difference, division, add, change ratio, max, min).
Since quantities in textual sentences are often approximate and sometimes involve composite cases, annotators are trained to precisely align their counterparts using structured contextual information such as axis ticks and legends in chart data.

Text-Table Quantity Alignment Annotation
The annotation of text-table quantity alignment is roughly similar to text-chart alignment, and annotators need to record linked table cells using index like "A17".For those reports already annotated by HiTab, we extract the existing annotations of quantity alignment for annotators to reference.Inter-Annotator Agreement 5% samples are randomly selected from the whole dataset and labeled by all annotators for consistency estimation to estimate the annotation quality.For each sample with a specified quantity in text, the labels from two annotators are considered to be consistent if their linked elements in the corresponding chart are the same.The Fleiss Kappa metric is 0.83, which corresponds to "almost perfect agreement" [24].

Source Table Chart
Type Hierarchy Bar 70.0%, Line 11.3%, Pie 9.2% # Table / (2) compared to other chart-related datasets that are focusing on QA such as PlotQA [31], FigureQA [21], DVQA [20], and LEAF-QA [4], TTC is the first to propose the task of quantity alignment and includes diverse types of charts and real-world associated textual descriptions by professionals.

Pre-training Corpus
This section introduces large existing datasets of paired text-table and table-chart data that can be used for pre-training.

Chart-Table Corpus
Annotating charts in image format can be labor-intensive.As an alternative, PlotQA synthesized a total of 224k image charts using tabular data from the World Bank Open Data, Open Government Data, and the Global Terrorism Database.The dataset provides fine-grained information for elements such as axis titles, axis ticks, legends, titles, bars, and lines.Thus, we can align table cells with their corresponding elements in charts directly through reverse engineering.This alignment information can be used for pre-training to align quantity representations between charts and tables.

Text-Table Corpus
ToTTo is a large dataset for the table-totext task, consisting of 120k tables with human-labeled table-text alignment by highlighting mentioned cells in tables.We filtered candidate quantities from highlighted cells in the table and traverse the word snippets in textual descriptions.We use strict perfect matching to derive highlighted quantities in tables that have a 1-to-1 mapping to textual snippets, which can best ensure the linking quality for table-text alignment pre-training.

Text-Chart Corpus
There is still a lack of a dataset that can be adapted for text-chart pre-training.Fortunately, we can bridge the text-table-chart representations through table-chart and text-table.

METHOD
We introduce the initiative method for TTC quantity alignment in the following sections: (1) proposing the unified transformerbased framework to encode tri-modal information of text-tablechart together; (2) employing contrastive learning to learn a joint representation space of quantities using large-scale datasets for multimodal pre-training; (3) a decoding method based on T5 and a classification method based on TaPas are introduced for quantity alignment.

Cross-Modal Encoder
Transformer-based methods like TaPas [14] and TaBERT [45] are widely used to jointly encode tables and text [11].[30] further proposed VisionTaPas and VL-T5 to encode text and chart together for question answering over charts.In this work, we explored two mainstreaming transformer models, encoder-based transformer (VisionTaPas) and encoder-decoder-based transformer (VL-T5), for joint text-table-chart encoding.They take text, tables, and charts as the input and employ attention layers for cross-modal information interaction, as shown in Figure 4. We use enhanced ChartOCR introduced by ChartQA [30] to extract visual elements and data tables from charts.

TTC(VL-T5).
VL-T5 [7] is an extension of T5 [35] for the Vision-Language (VL) tasks.ChartQA employs VL-T5 in the following manner.For the textual input, it adopts the method of VL-T5 by flattening the data table of the chart image and concatenating it with the textual question.For visual inputs, ChartQA extracts features of visual elements in the chart image (e.g., bars, lines) using Mask R-CNN [13].Since the total number of elements varies across different charts, ChartQA pads features with zeros to ensure a fixed length of 36 elements.In this paper, we refer to visual element features as special [VET] tokens.
To extend VL-T5 to joint text-table-chart encoding, we propose TTC(VL-T5) by changing the serialization pipeline of VL-T5 with the following adaptations.(1) Text, tables, and charts are serialized sequentially based on their order in documents, as illustrated in Figure 3.We iterate table cells row by row and tokenize their strings in the same way as T5.(2) Insert special tokens: <text ID>s to the beginning of text, <table ID>s to tables, and <chart ID>s to charts.We add special tokens [head ID] and [row ID] for each header and data row, respectively, as illustrated in Figure 3. (3) Associate elements ([VET] tokens) with corresponding data cells for charts and encode them sequentially rather than putting elements in the end of the sequence which will lose the critical paired information.As shown in Figure 4, we reuse the embedding mechanism of box coordinates and RoI features (through Mask R-CNN) of VL-T5, and we use the default position embedding of T5 rather than image ID and region ID introduced by VL-T5.

VisionTaPas.
TaPas is a BERT-based [10] end-to-end joint table-text model.It pre-trains on millions of tables and thus enhances the ability to select cells and reason over tables.As shown in Figure 4, the input to TaPas has a leading [CLS] token in front of the whole text input and multiple [SEP] tokens to separate cells.Tokens in tables are encoded with column and row embeddings in addition to segment and positional embeddings of BERT.VisionTaPas [30] is an adaptation of TaPas to encode a chart and text.It consists of: a vision transformer (ViT) encoder to encode the chart image, a TaPas encoder to encode the question and the table, and a cross-modal encoder on top of them.A ViT encoder produces embeddings for image patches [30].The cross-modality encoder consumes the output of the ViT and TaPas encoders and produces multimodal encodings with cross-attention layers, as shown in Figure 4.
We propose TTC(VisionTaPas) by extending VisionTaPas to joint text-table-chart encoding.( 1) Serializing text, tables, and charts sequentially and adding [CLS] to each text snippet, table, and chart.
(2) Extending segment embeddings to distinguish different modality snippets, with identifiers {0, 1, 2} to indicate the textual, tabular, and chart parts, respectively.The token embeddings are combined (added) with positional embeddings and segment embeddings before feeding them to the attention layers.(3) Adding index embeddings to encode the ID of text, tables, and charts based on their order in the document.

Cross-modal Contrastive Learning
Considering that a linked quantity, shares the same contextual semantic meaning although in different modalities, we introduce contrastive learning to learn a joint representation space for quantities in different modalities (,  , and ) to facilitate quantity alignment.Intuitively, if two quantities from two different modalities are linked together (  -  ,   -  ,or   -  ), the distance of their learned representations should be pulled closer.Otherwise, representations of different quantities should be pushed away.
Given a positive example of a linked quantity pair (, ) from two distinct data modalities ( ,  ), we pick a set of  −1 transcripts { −  }  −1 =1 from the same batch randomly as the negative examples.We then apply the multi-class N-pair contrastive loss [40]: where ,  is the cosine similarity function (, ) =  () ⊤  ()/(| ()∥∥ ()∥), and  is the temperature hyper-parameter.Considering that a quantity usually contains multiple tokens in a text span and a table cell, we apply average-pooling of token embeddings on top of the transformer

ConTTC (VisionTaPas)
Autoregressive Encoder ViT Encoder encoder as the quantity embedding, as shown in Figure 3. Similarly, for elements in charts, we use the average token embedding of their corresponding data extracted from them, e.g., "59.9" in Figure 3. Due to the corpus limitation, we only focus single quantity alignment that has 1-to-1 mapping but not composite quantity alignment that may have 1-to-many mapping in the contrastive-learning-based pre-training stage.
Contrastive learning with masked quantity values In addition to normal inputs, we explore a more challenging setting that encourages the model to aggregate contextual information of quantities rather than relying on individual numerical values.For example, in Figure 1, both "33.9" and "36.0" are approximately "one-third", so the contextual semantics stored in "Black adults" in text and "Black population" in the table are key for the final decision.To enforce the model to find linked quantities using context information, we introduce an input setting where all quantities are replaced with [MASK] tokens to eliminate numerical values.30% of sampled data is processed in this manner and we find that adding this additional task can improve the effectiveness of contrastive learning both for single and composite linking.
In text-table-chart scenario, the availability of information from all three modalities simultaneously enables the contrastive learning loss term expressed as follows:

Fine-tuning for Quantity Alignment
In this section, we introduce the module built upon the multimodal encoders for quantity alignment fine-tuning.Unlike the pre-training stage, our approach in the fine-tuning stage not only considers the one-to-one quantities mapping across modalities, but also accounts for the one-to-many mapping, which may involve calculation from multiple table cells or chart elements.We introduce fine-tuning pipelines in two branches, encoder-based TTC(VisionTaPas) and encoder-decoder-based TTC(VL-T5).Since underlying tables are extracted from charts by ChartOCR with recorded cell-element mapping, the problem of element linking in charts can be transformed to linking cells in extracted data tables (as shown in Figure 4) and then directly mapped to visual elements in image charts.So that charts and tables share the same cell linking pipeline in the following approaches.

TTC(VL-T5
).Since VL-T5 employs an encoder-decoder architecture, the model is applied to directly decode quantities in cells (using comma to separate cells for composite alignment), e.g., "<table 3> [row 2] 33.9" or "<chart 2> [row 3] 8.2, [row 3] 3.2".Cells are generated in a top-to-bottom and left-to-right order.For evaluation, simple rules are used to parse the table/chart ID, row ID and then map generated quantities to corresponding cells through edit distance.

TTC(VisionTaPas).
VisionTaPas includes multiple customized classification heads for aggregation prediction, column selection, and cell selection [30].We revise and extend the classification heads for the quantity alignment task.In our TTC setting, a document may contain one or more tables and charts, so the model needs to choose one to align with the target quantity.To this end, we introduce a new head for table/chart selection.To accurately predict the aggregation operator, we use a two-layer fully-connected network that operates on the final hidden states of the special [CLS] token of each table and chart with cross-entropy loss.Composite quantity alignment can involve multiple cells in a row or a column, so we introduce a parallel row selection head to mitigate the gap of only using a column selection head.We reuse the aggregation head with aggregation functions in two directions such as sum-row, sum-column, division-row, and division-column to facilitate the use of row and column selection heads.Losses from the heads are summed together with equal weights during training.In the inference phase, the table/chart with the highest classification score by the table/chart aggregation is chosen.Column/row selection is done similarly.Cell selection uses a threshold of 0.5 to filter cells in a row or column.For operators with static _ operands (e.g., "none" operator needs exactly one operand and "difference" operator needs exactly two operands), to avoid wrong number of selected cells, the top _ cells are selected based on the scores produced by the cell selection head.

EXPERIMENTS
Since existing methods only separately target text-  annotations for the text-table alignment sub-task and text-chart annotations for the text-chart alignment sub-task.We keep at most 16 tokens for each cell and at most 96 tokens for each sentence.We restrict each sentence to have exactly one specified quantity to be aligned, and we split a sentence to sub-sentences if multiple quantities exist in a sentence, so that the model does not need additional indicators for the specified quantity.We extend the supporting sequence length to 2,048 using a batch size of 1 for all transformer-based methods for equal comparison.Beam size is 5 for both training and inference for all methods with transformer decoders.The temperature hyper-parameter in contrastive learning is set to 1.0.We adopt the PyTorch version in huggingface for TaPas, BART [25], and T5-based models.T5, VL-T5, TTC(VL-T5) and ConTTC(VL-T5) take the large version of T5 as the backbone.
The loss functions of classification heads are added together with equal weights when fine-tuning the TaPas series.We follow the basic model configurations as [30], e.g., the learning rate is set to 0.00001.Pre-training takes 1 epoch and fine-tuning takes 10 epochs for ConTTC.We perform experiments on 8 Nvidia V100 GPUs and 4 Nvidia A100 GPUs.

Text-table alignment
Given a sentence with a specified quantity and a table, the model needs to return one or more aligned cells with the specified quantity.The metric is defined as the proportion of quantities in the text that are correctly linked with cells.

Baselines.
Rule-based method We use rules to calculate the degree of matching between the candidate cells in tables and snippets of quantities in text.The Levenshtein distance serves as the similarity metric and the cell with the highest score will be selected.
TaPas We follow the standard input format and reuse the classification heads of TaPas for cell selection.To fit TaPas input, we convert hierarchical tables into flat ones following [34].Specifically, we unmerge the cells spanning many rows/columns on left/top headers and duplicate the contents into unmerged cells.We align the model configuration with the original TaPas and use the the pre-trained checkpoint of the TaPas model for initialization.
BART The raw sentence and flattened table were concatenated as the input following TaPEx [28].We align the model configuration with BART-large in huggingface.We use BART to decode aligned one or more quantities like TTC(VL-T5).T5 T5 is introduced in Section 3. We use the same serialization and decoding setting as TTC(VL-T5), and the large version is adopted.2, the rule-based method achieves 56.2% accuracy of single quantity alignment, but it cannot deal with composite alignment.TaPas, although pre-trained with lots of unlabeled tables, is outperformed by TaPEx, and we believe the reasons are twofold: (1) TaPEx can flexible output multiple cells with the decoder.The original TaPas can only select multiple cells in a column, but in our dataset, multiple composite quantities may also exist in a row, which is out of scope for the original TaPas.(2) TaPEx uses a backbone BART that has a larger size than BERT used by TaPas.Contrastive-learning-based pre-training improves the accuracy of TTC(TL-T5) and TTC(VisionTaPas) by 6.3% and 4.8%, respectively.And by comparing the results of ConTTC(VL-T5) and ConTTC(VisionTaPas), we find that ConTTC(VL-T5) shows better results on single quantity alignment, while ConTTC(VisionTaPas) outperforms ConTTC(VL-T5) on composite quantity alignment, and we attribute it to carefully designed coarse-to-fine prediction heads for column/row/cell selection.

Text-Chart alignment
For text-chart alignment, given a sentence with a specified quantity and a chart, the model needs to return the aligned elements with the specified quantity.As introduced in Section 2.2, We use ChartOCR enhanced by [30] to recognize visual elements and extract the data table from the chart to support ConTTC and baseline methods.Elements in a chart and cells in the extracted data table have explicit mapping relationships, which are well recorded during the data extraction process of ChartOCR.Thus, the text-chart alignment process can be divided into (1) linking cells in extracted data tables and (2) mapping linked cells to elements in charts.The metric is defined by the proportion of quantities in text that are correctly linked with elements in charts, and since the annotation process records the corresponding cell for each element, it equals to the proportion of quantities in text that are correctly linked with the extracted data table.

Baselines.
PlotQA PlotQA employs a table question-answering engine which is fed with an extracted table generated by detecting visual elements from the image.We fine-tune the [34] model pre-trained on the PlotQA [31] on text-chart alignment.LayoutLM Based on detected visual objects (labels, legends, bars, etc.) via chartOCR, we use pre-trained LayoutLM-v2 to encode the chart objects with two-dimensional coordinates, and then use a binary classification head for element selection.Each element is attached with the text of element type and detected numerical values by chartOCR.In this setting, the model does not need the extracted data table.VL-T5 We adopt VL-T5 introduced in Section 3.1 and use the large version of T5 for consistent comparison with TTC(VL-T5).We keep implementation settings consistent with [30].VisionTaPas We adopt VisionTaPas introduced in Section 3.1 with 12-layer separate encoders and a 4-layer joint encoder.We keep implementation settings consistent with [30].(1) directly modeling the chart image and performing reasoning is challenging; (2) pre-extracting the data table from the chart may introduce errors and the error will propagate to the subsequent quantity linking.As Table 3 shows, the accuracy of text-chart alignment, for example, achieved by VisionTaPas, is relatively lower than text-table alignment, as achieved by TaPas.PlotQA only achieves 42.2% total accuracy.Results show that LayoutLM, without using pre-extracted data tables, only achieves sub-optimal results with low composite alignment accuracy.VisionTaPas also does not perform well on composite alignment because it does not support selecting multiple cells in a row in the original implementation.ConTTC(VL-T5) and ConTTC(VisionTaPas) still outperforms all baseline methods, showing the effectiveness of contrastive-learningbased pre-training.

Unified Text-Table-Chart alignment
Considering that multiple tables and charts usually co-exist in modern documents, we explore a setting that linking quantities among one or more tables and charts within a document.New challenges include: (1) modeling multi-modalities together; (2) locating a single table or a single chart from multiple ones.As introduced in experiment setup, we restrict each sample to contain exact one sentence with exact one specified quantity to be aligned.We did not compare TTC with baselines in Section 4.1 and Section 4.2 since it is non-trivial to adapt them to the text-table-chart scenario without significantly decreasing accuracy.As Table 4 shows, even for TTC and ConTTC, compared with results of single-table alignment in

RELATED WORKS
Quantity linking In the domain of quantity alignment, previous research [16,37] has primarily focused on canonicalized quantities from knowledge bases or similar system.[17] proposed a task to link quantity mentions in textual context with web tables that are semi-structured and usually lack standardized measures and proper units.They devised a feature-based classifier and graphbased random walk method to link quantities in tables.HiTab [5] provided annotations for table-text alignment, but HiTab has not conducted evaluation on this task.To our best knowledge, we are the first to include charts to the task of quantity alignment, and the most related work is highlighting visual parts in chart associated with text segments [44].
Joint modeling of text, tables, and charts Recent works exploring to model tables and text together [9,14,18,28,43,45,47] are summarized in this survey [11].There are also quite a few related works on chart QA which needs joint understanding text and tables [22,30,31,39], but the text is usually a question rather than a natural textual statement.Works like [31] use a multistep framework to extract tabular data from charts and then apply question answering over extracted tables.While others like [4,39] use classification-based QA models directly on images of charts.More chart QA works are surveyed in this paper [15].And there are many alternative methods on deriving underlying data from charts and storing them in tables [6,12,23,26,26,29,31,32,36] (more works are listed in these surveys [8,38]).Very recently, concurrent works are proposed for ChartQA [27] and table-chart pre-training for ChartQA [46].TTC-QuAli uniquely targets the task of quantity alignment among text, charts and tables.

CONCLUSION
In this paper, we introduce TTC-QuAli, a new multimodal dataset and task that aims to link quantities across text, tables, and charts.TTC-QuAli is the first dataset to provide high-quality annotations for matching quantities among different modalities.To bridge the representation gaps among different modalities over a shared semantic space, we propose ConTTC, a transformer-based crossmodal contrastive learning framework.In addition to TTC-QuAli, we leverage large text-table and table-chart corpora for pre-training our framework.Our experiments show that TTC-QuAli is a quite challenging dataset and demonstrate the effectiveness of the joint text-table-chart encoding framework and pre-training with contrastive learning.

Figure 1 :
Figure 1: A statistical report contains textual descriptions, tables, and charts with cross-modal linked quantities.1

Figure 2 :
Figure 2: Text-Chart annotation interface in the spreadsheet.

Figure 3 :
Figure3: The figure illustrates the cross-modal encoder and contrastive learning framework, which is applicable to both VL-T5 and VisionTaPas.In this example, we use VL-T5 and present a serialization method for tables and charts using a special token for visual elements ([VET]).In the case of VisionTaPas, not shown in the figure, splits chart images into patches and uses ViT to encode them.Note that for simplicity, we have neglected the 95% confidence interval information in this chart.

Table 1 :
Dataset statistics of text-table-chart alignment.Dataset Statistics and Comparison As shown in Table 1, we have 1,503 tables with 3,245 linked quantities in textual descriptions, including 302 composite quantities and lots of complex single quantities involving approximation, unit conversion, opposite, max, min, etc.We also have 1,086 charts with 1,253 linked quantities in textual descriptions, including 137 composite quantities.63.5% of reports contain both tables and charts.Most tables have a hierarchical structure and involve complex quantity indexing.Charts include 70.0% bar charts, 11.3% line charts, 9.2% pie charts, and 9.5% other types such as combinations of bar and line charts.The average size of table cells (102.6) is much bigger than size of chart elements (18.8).The TTC dataset has distinct characteristics: (1) it is the first dataset to include three modalities of text, table, and chart; ) from large text-table and table-chart datasets introduced in Section 2.3.Additionally, we include the training set of TTC-QuAli in the pre-training phase to bridge the text-table-chart tri-modality.To this end, we introduce ConTTC(VL-T5) and ConTTC(VisionTaPas), which enhance TTC(VL-T5) and TTC(VisionTaPas) with contrastive learning and pre-training.
table or textchart joint modeling, we first compare baseline methods separately on two sub-tasks, text-table alignment and text-chart alignment, then we present results on joint text-table-chart alignment.4.0.1 Experiment Setup.The train (70%), dev (15%) and test (15%) sets are split by reports to avoid document-level overlap.We only use the text-table

Table 2 :
TaPEx TaPEx is based on BART, and it encourages the model to learn SQL execution through pre-training and shows high effectiveness on table reasoning.We use the BART setting and the model is initialised by the pre-trained TaPEx-large.Results of text-table alignment.4.1.2Experiment Result and Analysis.As shown in Table

Table 3 :
Results of text-chart alignment.4.2.2Experiment Result and Analysis.The text-chart alignment task has distinct challenges compared with text-table alignment: