UVDoc: Neural Grid-based Document Unwarping

Restoring the original, flat appearance of a printed document from casual photographs of bent and wrinkled pages is a common everyday problem. In this paper we propose a novel method for grid-based single-image document unwarping. Our method performs geometric distortion correction via a fully convolutional deep neural network that learns to predict the 3D grid mesh of the document and the corresponding 2D unwarping grid in a dual-task fashion, implicitly encoding the coupling between the shape of a 3D piece of paper and its 2D image. In order to allow unwarping models to train on data that is more realistic in appearance than the commonly used synthetic Doc3D dataset, we create and publish our own dataset, called UVDoc, which combines pseudo-photorealistic document images with physically accurate 3D shape and unwarping function annotations. Our dataset is labeled with all the information necessary to train our unwarping network, without having to engineer separate loss functions that can deal with the lack of ground-truth typically found in document in the wild datasets. We perform an in-depth evaluation that demonstrates that with the inclusion of our novel pseudo-photorealistic dataset, our relatively small network architecture achieves state-of-the-art results on the DocUNet benchmark. We show that the pseudo-photorealistic nature of our UVDoc dataset allows for new and better evaluation methods, such as lighting-corrected MS-SSIM. We provide a novel benchmark dataset that facilitates such evaluations, and propose a metric that quantifies line straightness after unwarping. Our code, results and UVDoc dataset are available at https://github.com/tanguymagne/UVDoc.


INTRODUCTION
The task of physical document digitization, e.g. for financial administration, is increasingly being done in a casual setting with the use of smartphones rather than the more traditional in-office flatbed scanners.However, the appearance of these casually captured images usually differs greatly from flatbed-scans due to varying camera angles, unconstrained illumination conditions and physical deformations of the paper, such as folding, crumpling and curving.These appearance variations pose a problem for post-processing steps, such as optical character recognition (OCR).Document image rectification is therefore an important step in the modernized document digitization pipeline, making layout extraction and OCR performance comparable to that of the traditional pipeline.
Several research efforts have been undertaken to tackle the problem of document unwarping using either model-or data-driven approaches.The model-driven approaches typically try to fit a constrained, piecewise-smooth surface to the imaged document.This geometric optimization is generally slow and has limited approximation capabilities, making it unsuitable for practical applications.Data-driven approaches instead rely on training a neural network to perform the unwarping.These methods are fast at runtime but typically require a large amount of high-quality training data, which can be difficult to obtain.The available training data can roughly be categorized as either synthetic or in the wild document images.The former group is generated by rendering images using 3D scans of real document geometries, whilst the latter simply consists of photographs of actual deformed documents.The synthetic category has the problem that dense 3D capture is often noisy, and rendering photorealistic paper can be challenging, making the appearance of the generated data samples non-realistic as a result.The challenge with the latter category is that ground truth data, most notably the ground truth unwarping function, is difficult to obtain.
Our main contribution is UVDoc, a dataset that aims to decrease the domain gap between the synthetic Doc3D dataset [Ma et al. 2018] usually used to train models for the task of document unwarping, and real document photographs.Our dataset contains 20,000 pseudo-photorealistic images of documents, and is equipped with all the required information to train a coarse grid-based document unwarping neural net.We achieve photorealistic appearance by using image compositing instead of rendering, thereby retaining the shading and material appearance from the original image capture.As our dataset is tailored to a coarse grid-based approach, it is easy to produce even though it includes numerous ground-truth annotations.We offer a new benchmark dataset whose rich ground-truth annotations allow for evaluation of the unwarping performance without the entanglement of shading artifacts, as well as a new metric that measures the straightness of lines in the unwarped image.
We train a small deep convolutional neural network that performs document image unwarping from a single RGB image.Its design is chosen specifically to make use of our UVDoc dataset.It uses a dual-head approach to predict both a 3D grid mesh representing the 3D shape of the document, as well as a 2D unwarping grid representing a coarse backward map.The backward mapping acts as an inverse parameterization; it indicates at each output pixel, which pixel coordinates should be sampled from the input image to produce the unwarped image.This dual-task approach encodes an implicit coupling between the 2D and the 3D grid, just like there is a physical coupling between the 3D document shape and its 2D image.Since we learn a coarse 2D unwarping grid instead of a dense unwarping flow, our network size is greatly reduced compared to state-of-the-art methods.
Using our own relatively small model, and training on a combination of the large Doc3D synthetic dataset and our own custom UVDoc data, we obtain state-of-the-art performance on the Doc-UNet benchmark for most evaluation criteria.Moreover, we show that the addition of our UVDoc dataset improves the performance of existing document unwarping methods.

RELATED WORK
Document image unwarping is a widely studied topic We divide previous work into two categories: model-based and data-driven approaches.

Model-based document unwarping
Early works take a geometric modeling approach and try to unwarp document images by first creating a 3D reconstruction of the document surface, which is then flattened onto the plane by solving an optimization problem.These works commonly obtain an estimate of the 3D document surface with the help of auxiliary equipment, such as structured light [Brown andSeales 2001, 2004], two structured laser beams [Meng et al. 2014] or laser range scanners [Zhang et al. 2008].Other model-based methods use multi-view images instead of hardware to estimate the 3D shape of the document surface [Koo et al. 2009;Luo and Bo 2022;Tsoi and Brown 2007;Ulges et al. 2004;Yamashita et al. 2004;You et al. 2018].Finally, Tian and Narasimhan [2011] exploit the structure of the document, such as lines, to reconstruct the 3D geometry through a shape-from-texture method.
Once the 3D reconstruction of the document surface is in place, different methods are used to flatten it to the plane.Brown and Seales [2001;2004] and Zhang et al. [2008] flatten the document surface using a simulation of a stiff mass-spring system falling down to a plane under gravity.Another common technique is to fit a (piecewise-)smooth parametric surface to the estimated 3D document surface and flatten it according to a parameterization.This approach can involve generalized cylinders [Kil et al. 2017;Kim et al. 2015;Koo et al. 2009;Meng et al. 2018;Nachappa et al. 2023;Zhang et al. 2004], generalized ruled surfaces [Meng et al. 2014;Tsoi and Brown 2007], smooth developable surfaces [Liang et al. 2005[Liang et al. , 2008] ] and NURBS [Yamashita et al. 2004;Zhang and Tan 2005].
Parametric approaches often heavily rely on the texture flow of the text lines in the document to estimate parametric line directions, making them less suitable for documents that only contain sparse text.Additionally, their optimization-based nature makes these methods slow and unsuitable for real-time applications, and their dependence on auxiliary equipment makes their use in real-world scenarios inconvenient and costly.

Data-driven document unwarping
Data-driven document unwarping methods work directly on a single RGB image of a document, employing deep learning to infer a 2D displacement fields or a coarse grid that can be used to unwarp the distorted input image.Ma et al. [2018] are one of the first to propose such a network, using two chained U-Nets to predict the forward mapping (the first estimates an initial guess, and the second refines it).DewarpNet [Das et al. 2019] also employs two chained networks inferring first the 3D coordinates and then the backward mapping from the 3D coordinates.They create the large synthetic Doc3D dataset with rich annotations to make this possible.Xu et al. [2022] build on this approach, but use siamese losses to additionally encourage deformation and texture consistency.
Several other ideas have been implemented.Patch-based methods [Das et al. 2021;Li et al. 2019] predict the displacement field independently on different parts of the image, thus better handling local distortions at the cost of having to properly stitch the different patches together.Iterative methods [Feng et al. 2021b;Zhang et al. 2022] progressively refine the predicted warping flow field and predict a foreground segmentation mask before starting the iterative rectification process, removing the burden of localizing the document boundaries from the unwarping network.Several methods based on textlines also use foreground segmentation.Jiang et al. [2022] use these pieces of information as explicit constraints of an optimization problem.Feng et al. [2022] feed a concatenation of textlines and 3D shape features into a network that predicts the displacement map.Recent work by Das et al. [2022] learns a texture parameterization for neural representations through differentiable rendering, using multi-view input in a data-driven approach.
A variety of different network architectures have been proposed, such as fully convolutional neural networks [Xie et al. 2020], pyramid encoder-decoder networks [Liu et al. 2020], or transformers [Feng et al. 2021a].Other works use transformer architectures to tackle specific use cases, such as partially visible documents [Feng et al. 2023] or invoices [Hertlein et al. 2023].
Recently works by Xie et al. [2021], Xue et al. [2022] and Ma et al. [2022] follow the approach of predicting a coarse backward mapping.Some of these are capable of learning from images captured in the wild, either by direct comparison of the Fourier-filtered unwarped images [Xue et al. 2022], or by designing a specific loss on pairs of slightly perturbed images [Ma et al. 2022].
Our dual-task-based network architecture enforces the model to predict physically plausible shapes and unwarping grids.It processes input images in a single stage without any segmentation pre-processing and predicts a coarse backward mapping rather than a dense displacement field, making it very efficient.

Datasets.
The datasets used for training the methods mentioned above can be split into two categories; real and synthetic, Table 1: Comparison between the different document unwarping datasets.The last column indicates whether the ground-truth Backward Mapping (BM) between the distorted and the unwarped document is available.

Dataset
# Samples Type BM Doc3D [Das et al. 2019] 100,000 Synthetic ✓ DIW [Ma et al. 2022] 5,000 Real ✗ WarpDoc [Xue et al. 2022] 1,020 Real ✗ Ours 20,000 Pseudo-real ✓ with the latter being most commonly used.Earlier works [Ma et al. 2018;Xie et al. 2020Xie et al. , 2021] ] [Ma et al. 2018], since they are based on depth captures of actual deformed papers, they are heavily smoothed compared to the original document shapes.The rendered appearance is also not very photorealistic, which causes performance degradation when using the network on actual photographs.In contrast, our UVDoc dataset, which is made from real, captured sheets of paper, is more realistic both visually and geometrically.Datasets of real photographs of deformed documents, paired with their flatbed scans ground truth, have recently gathered increased interest.Some of these datasets provide segmentation information [Ma et al. 2022] but most do not come with further annotations [Xue et al. 2022].These datasets are closer in appearance to real document images, but since they are equipped with very few annotations, they require the design of custom loss functions to train models.In comparison, our dataset is equipped with a lot of annotations while being visually similar to in-the-wild data.See comparative summary in Table 1.

THE UVDOC DATASET
We create our own dataset, UVDoc, containing 20k pseudophotorealistic images of warped documents.Our motivation is to obtain a dataset of photorealistic document images that has more ground truth information available than document in the wild images, and more realistic appearance than synthetically generated renderings.This allows for a stronger supervision signal than what is available for general document in the wild data and benefits from more realistic appearance.We compare the main characteristics of our dataset against other available datasets in Table 1.
Capture.We print regular grids of dots, with grid size of 89 × 61, on A4-sized pieces of paper using an inkjet printer with UV ink that is invisible to an RGB camera in regular light, but becomes visible in UV light in an otherwise dark room.We opt for this grid aspect ratio to obtain an equally spaced grid in both horizontal and vertical direction, and to approximate the aspect ratio of A4 paper in portrait mode, the most common paper type that documents are printed on.Note that on the paper boundary, we deviate slightly from a perfectly regular grid by offsetting the border dots a little, so that they fully fit on the paper and can be detected more easily.We fold and bend the pieces of paper in various ways to emulate common deformations.We then capture pairs of RGB-D images of deformed papers using the Intel RealSense SR305 depth camera: one image in regular lighting and one in UV lighting (Fig. 2).We use two commercially available 30 W, 395 nm UV lamps and one 72 W, 395 nm UV lamp to reduce the amount of shadows in the UV-lit image.We also use a regular light with adjustable color temperature and brightness to create varying lighting conditions.We control the camera and the lights using a laptop and remote switches, so that there is no movement between the two captured frames, and the depth and pixel information is aligned.We capture various types of deformed paper, such as curved, folded, and crumpled, and we also vary the lighting conditions.The dataset contains a total of 1008 distinct geometries, which we augment to 4032 geometries by applying horizontal and/or vertical flips to each sample.
Recovering the grid.Using the UV-lit image, where the printed grid is visible, we obtain the pixel coordinates of the grid points on the deformed piece of paper.To detect them we use OpenCV's SimpleBlobDetector, coupled with manual annotation for extreme cases where the automatic detection fails (less than 0.5% of the points need to be manually annotated).Once all points have been detected, we compute their correspondences to the vertices of We call the ordered grid the 2D unwarping grid.Combining the coordinates of the 2D unwarping grid with the depth values at these same pixel coordinates and the intrinsics of the camera, we construct a 3D grid mesh corresponding to the 3D shape of the piece of paper.
Pseudo-photorealistic image generation.Since we have a known mapping between the 2D unwarping grid and the original regular 2D grid, we can construct a coarse -parameterization of the 3D grid mesh.We use bilinear interpolation for the -parameterization when applying a texture to the geometry and for the 2D unwarping grid when performing the unwarping, to obtain a full-resolution dense backward mapping.
The -parameterization is used to apply a document texture on top of the image of the blank warped paper.The document textures include books and scientific articles sampled from the web, as well as other types of documents such as magazines, invoices, and music sheets, generated using a text-to-image model [AI 2023].As illustrated in Fig. 3, we blend the document texture with the lightingbaked blank document image by multiplying the two images.This gives a pseudo-photorealistic combination between the lighting and the texture.We also replace the background in the image with a background sampled from the Describable Textures dataset [Cimpoi et al. 2014].Finally, we apply color correction to match the hue of the background to the hue of the document and we also equalize the brightness of the background to the foreground.Using this approach, we create a dataset of 20,000 images in total.We provide the original lighting-baked blank document images along with the -parameterization, so users of the dataset can easily replace the document and the background textures if desired.
At the end of our data capture pipeline, we are equipped with a ground-truth 2D unwarping grid, a -parameterization and a 3D grid mesh for each sample in our dataset.Since we use the physical -parameterization recorded via the 2D grid, rather than a parameterization designed by a rendering engine, our texture gets deformed and applied with greater physical accuracy.Additionally, by circumventing a rendering pipeline, our images look like real paper, which is hard to simulate when rendering.The full UVDoc dataset is available at https://github.com/tanguymagne/UVDoc.

METHOD
To completely unwarp a document, we assume that the input photograph is taken from a camera position in which the document's 3D shape can be represented as a height field, i.e., the entire document is visible and there are no occlusions and foldovers.
We use a dual-head network to predict a 45 × 31 2D unwarping grid  containing pixel coordinates, and a 45 × 31 grid mesh of 3D shape coordinates  from a warped 488-by-712 input image   .We do not predict  at the full ground-truth resolution in an attempt to keep our network as compact as possible.As illustrated in Fig. 4, the 2D unwarping grid  encodes the deformation that leads to the unwarped document: grid-point  , holds the pixel coordinates (relative to the image, in the range [−1, 1]) of the pixel that will be placed at position (, ) in the unwarped image (up to constant scaling).The grid  can also be seen as a coarse backward mapping.Finally,  is bilinearly interpolated to the original image size.This upsampled backward mapping is used to generate the full-resolution unwarped image.The 3D grid mesh  is not used for the unwarping, but we incorporate it in training with an  1 loss as a regularization term.This helps the network understand the underlying geometry of the document and improve the unwarping performance (see ablations studies in Sec.6.2 and Table 5).

Network architecture
We use a relatively straightforward dual-head, fully convolutional encoder architecture inspired by the encoder part of the architecture used in [Xie et al. 2020].The input image goes through two convolutional downsampling layers that each use a 5 × 5 kernel and reduce the image size by a factor of two.This is followed by three dilated residual blocks, which lead to a spatial pyramid with stacked dilated convolutions.Finally, two heads with two convolutional layers predict  and  , respectively.We give a detailed graphical overview of our architecture in the supplemental.

Training loss
We denote the ground-truth variables as their regular symbols (e.g., ) and their predicted counterparts with a hat (e.g., Ĝ).Our training loss is a combination of  1 losses on both the 2D unwarping grid  and the 3D grid mesh  , as well as an image reconstruction loss where , ,  are weights used to balance the influence of the individual loss terms.L  is an  1 loss between the ground truth unwarped image and the image unwarped using the predicted unwarping grid .For Doc3D samples, the reconstruction loss is computed directly on the unwarping of the input image   , which includes shading.For our UVDoc samples, to allow the network to focus on the content of the document rather than on shading artifacts, we compute the reconstruction loss using the unwarping of the unshaded document and the ground truth document texture.We provide further training details for our method in the supplementary material.

EVALUATION METRICS AND THE UVDOC BENCHMARK
We discuss common existing evaluation metrics for document unwarping and propose a new benchmark (UVDoc), along with a new metric that provides faithful evaluation even in the presence of varied shading.

UVDoc benchmark
To foster more detailed evaluation of document unwarping methods in the future, we create the UVDoc benchmark dataset.This benchmark is generated in a similar fashion to the UVDoc dataset but contains other geometries, document textures and backgrounds, not included in the main dataset.The benchmark consists of 50 images.Thanks to our pseudo-photorealistic data generation pipeline, we have access to pairs of warped images with and without lighting (and thus shadows) baked in, see Fig. 5.This setup provides new opportunities for meaningful metrics.For each sample in the benchmark, an unwarping pipeline can predict the unwarping function for the shaded image, and then apply it to the unshaded image.The unwarped unshaded image can then be compared to the groundtruth original image texture.This way the effect of illumination can be removed and the unwarping deformation can be better evaluated.

Evaluation metrics
In our objective evaluations, we employ image similarity metrics as well as optical character recognition (OCR) performance.Following [Ma et al. 2018] and [Das et al. 2019], we use multi-scale structural similarity (MS-SSIM) and local distortion (LD) as metrics for the image similarity evaluation.We also employ the aligned distortion (AD) metric (introduced by [Ma et al. 2022]), which corrects some of the flaws of the previous metrics.We evaluate OCR performance using the character error rate (CER) and edit distance (ED).
As OCR engines are typically targeted towards use with images originating from flatbed scanners, they are ill-suited for text recognition on images with lighting variations and shadows [tesseract-ocr 2023].The two images in Fig. 5 are identical up to shading but result in vastly different OCR performances.We therefore want to point out that OCR performance on the DocUNet dataset should be interpreted with care, since its baked-in shading plays such a large role.More details regarding our evaluation metrics can be found in the supplementary material.

Line straightness metric.
Our UVDoc benchmark is annotated with not just a ground-truth unwarped image but also the groundtruth warping function, which allows us to design a new metric that evaluates the straightness of lines in the unwarped image.We generate triplets of images consisting of a warped document image and two images containing warped horizontal and vertical lines.These three images are generated using the same geometry and thus correspond to the same ground-truth unwarping function.We can now predict the unwarping function from the warped document image, and then apply it to the warped line images, giving us the unwarped horizontal and vertical lines.A perfectly predicted unwarping function should map the lines (which are 1 pixel thick) back to exactly horizontal and vertical lines.By measuring the average standard deviation of the lines we obtain a measure of how well the unwarping function maps horizontal (resp.vertical) lines Table 2: Quantitative unwarping performance comparisons on the DocUNet benchmark dataset.Bold font indicates best, underline indicates second-best and italic indicates thirdbest score.The last column compares the network sizes, expressed in number of parameters (millions).We compare our results against DewarpNet [Das et al. 2019], DispFlow [Xie et al. 2020], DocTr [Feng et al. 2021a], PW Unwarping [Das et al. 2021], DDCP [Xie et al. 2021], FDRNet [Xue et al. 2022], RDGR [Jiang et al. 2022], Marior [Zhang et al. 2022], PaperEdge [Ma et al. 2022]  to horizontal (resp.vertical) lines; see Fig. 6 for a visual explanation of the process.Note that unlike other metrics that are commonly used in the document unwarping field to measure distortion (such as LD and AD), this metric does not rely on the usage of dense SIFT flow, which is slow to compute and can give unstable results due to shading artifacts.

EXPERIMENTS 6.1 Evaluation
We evaluate our network on the DocUNet benchmark dataset [Ma et al. 2018] as well as on our own UVDoc benchmark, described in Sec.5.1.The DocUNet benchmark is composed of 65 documents.
For each of them, 2 deformed images in a real-world scenario are provided.The ground truth flatbed-scans are also provided for comparison.Note that similarly to Feng et al. [2022] we exclude the two images of document 64, as the real world images are rotated by 180 degrees.We also exclude this document when computing the quantitative results for previous works.
Quantitative evaluation.We compare our method with several state-of-the-art deep learning methods.For each of them, we use the DocUNet result images published by the authors.We also evaluate the methods that additionally published their pre-trained models on our UVDoc benchmark.All metric scores are evaluated using Tesseract v4.0.0, pytesseract v0.3.10,MATLAB R2022a, Levenshtein v0.21.0 and jiwer v3.0.1.The results are presented in Tables 2 and 3.
Compared to previous works, our method achieves state-of-theart MS-SSIM, LD, AD and ED performance and a second-best CER score on the DocUNet benchmark.On the UVDoc benchmark our method achieves state-of-the-art performance on visual metrics (MS-SSIM and LD).On OCR metrics, our network performs close to state-of-the-art.Small differences should be interpreted with care, as OCR scores suffer from high standard deviation (see Table 5).This is due to the Tesseract OCR engine being sensitive to small changes in input, as explained in its documentation [tesseract-ocr 2023].Our method also ranks best on the horizontal and vertical line straightness metrics (Sec.5.2.1), indicating better unwarping of the geometric features.
Our approach builds on a grid-based unwarping method, thanks to which our network is significantly smaller in size than current state-of-the-art methods, whilst still achieving state-of-the-art performance.We compare our network size to previous works in the last column of Table 2.
In addition to the performance of our own method, we also evaluate the effect of adding our UVDoc data to the DewarpNet [Das et al. 2019] architecture.We compare the performance of the pre-trained DewarpNet models fine-tuned for 10 epochs on the Doc3D data with the performance of the pre-trained models fine-tuned for 5 epochs on a combination of Doc3D and UVDoc data.As shown in Table 4, adding the UVDoc data into the finetuning process greatly improves all metrics except for the OCR performance on the shaded DocUNet images.
Qualitative evaluation.In addition to the quantitative comparisons made in the previous section, we provide a qualitative comparison to previous works.We show a side-by-side comparison of unwarped images by several methods in Fig. 8 and Fig. 7.The leftmost column shows the input images.The images unwarped by our method are perceptually of high quality and have good unwarping at the borders of the document as well, even though we do not include explicit handling of borders or segmentation, in contrast to [Feng et al. 2021a[Feng et al. , 2022;;Ma et al. 2022;Zhang et al. 2022].We present more qualitative results on real-world images in Fig. 1.We include the unwarped images for all items in the DocUNet and UVDoc benchmarks in the supplemental material.

Ablation study
We show the effectiveness of the dual-task learning, i.e., the combination of predicting the 3D and the 2D grid meshes in the training process, the effectiveness of the reconstruction loss L  , as well as the benefit of combined training on both Doc3D and our UVDoc dataset, via ablation experiments.As we notice large variance in the OCR performance, we use averages of 10 repeated experiments with constant settings to perform the ablation study.
We first show in Table 5 that training on a combination of the Doc3D and UVDoc datasets considerably improves the performance on all metrics, compared to training only on the Doc3D data.To ensure a fair comparison between the two, we double the number of epochs for the Doc3D-only training (both the number of epochs at a constant learning rate, as well as the number of epochs with linearly decaying learning rate), such that they process the same number of samples and have an equal amount of optimizer steps.In particular, adding the UVDoc data to the training process leads to improvements of 8.9% for MS-SSIM, 12.9% for LD and 9.7% for AD on the DocUNet benchmark.We attribute the improvement in performance to the fact that our data (UVDoc) is closer in appearance to the real document photographs in the DocUNet dataset, as well as the fact that the 3D ground-truth data in our dataset is more physically accurate (albeit coarser), since we do not apply any smoothing to it.
We also show that the dual-task of learning the 3D grid  along with the 2D grid  improves the performance by comparing it against our full model but with the loss on  removed.Table 5 shows that including the  1 loss on  greatly improves all non-OCR metrics on both DocUNet and UVDoc benchmarks whilst the OCR metrics remain comparable.
Table 5 additionally shows the benefits of the reconstruction loss.Adding it improves performances on almost all metrics, and those that worsen only do so slightly.This loss helps the model to target content in the document, making the unwarping better in the areas that matter the most.

CONCLUSION
We presented UVDoc, a new document unwarping dataset that consists of pseudo-photorealistic images of warped documents along with annotated ground-truth 3D shapes and unwarping functions.Our proposed acquisition methodology is simple to implement and uses relatively inexpensive, commonly attainable equipment, enabling easy replication and further expansion of the dataset by others.Since our dataset includes both a shaded and unshaded version of each document image, it allows its users to evaluate unwarping performance without the influence of shading artifacts.
We show that with the addition of the UVDoc dataset, our dualtask deep learning approach that implicitly encodes the coupling between the document's 3D shape and its appearance in a 2D photograph achieves state-of-the-art performance on the commonly used DocUNet benchmark.Additionally, we introduce the new UVDoc benchmark and a new line straightness metric, on which we also achieve state-of-the-art results.
The shape-from-shading effect can remain quite strong in the unwarped documents and makes some of them appear more distorted to the human eye than they geometrically are.Research into the illumination correction process is therefore of great importance.Since the pseudo-photorealistic nature of our dataset allows us to decouple the deformation and shading of a warped document image, it could benefit research progress in this field.
Table 4: Quantitative comparison on the DocUNet and UVDoc benchmark datasets for DewarpNet [2019] finetuned with and without the UVDoc data.We finetuned the pre-trained models for 10 epochs with Doc3D only, and for 5 epochs with Doc3D+UVDoc to equalize the number of optimization steps.A UVDOC DATASET: ORDERING THE GRID Using the UV-lit image, where the printed grid is visible, we obtain the pixel coordinates of the grid points on the deformed piece of paper.We then need to compute their correspondences to the vertices of a regular grid, which is equivalent to ordering them as an 89 × 61 grid.We solve the ordering problem in 3 steps: (1) Finding the top-left corner.We first find the top-left corner of the grid.We compute the two principal components of the detected grid points and define the diagonal direction of the grid as the sum of these two vectors.For each point, we draw a line orthogonal to this diagonal direction and we count the number of points on each side of the line.The top-left corner is then the point that has exactly zero points to its left.The process is illustrated in Fig. 9. (2) Ordering border points.Next we detect all border points.To this end, we use a segmentation of the paper that we obtain by thresholding the UV-lit image.Based on this segmentation, we use OpenCV's findContours function to extract an ordered contour polyline.For each contour vertex, we find the nearest neighbor point in the set of grid points.We then define our grid border points as the 296 grid points -the number of points on the border of the grid -that are most frequently found as nearest neighbor.Finally, since the contour extracted using OpenCV is ordered, we can also order the detected grid border points.(3) Ordering interior points.The final step is to order the points that lie in the interior of the grid.We iteratively identify all points (, ) ∈ [2, 88] × [2, 60] in row-major ordering, starting from point (2, 2) (the top-left interior grid point).
We do this (for point (, )) by finding the three nearest yetunordered points for each of the previously-ordered points ( − 1,  − 1), (,  − 1), and ( − 1, ) (the points to the top-left, top and left of the point we are currently trying to identify).The point that is in the intersection of these three nearest-neighbor sets is chosen as point (, ).We use the average distance to the three reference points as a tiebreaker in case the intersection contains multiple points.This point is then considered ordered, and we move on to the next point.

B TRAINING DETAILS
We obtain the ground-truth  and  for the Doc3D dataset by sampling the ground truth backward maps at a regular grid of 45 × 31 points covering the entire backward map.For our UVDoc dataset (see Sec. 3 of the main paper) we slice the available highresolution ground truths by a factor of 2. We use the ADAM optimizer [Kingma and Ba 2015] with a batch size of 8.The initial learning rate is set to 2 × 10 −4 for 10 epochs and linearly decays to 0 over 10 further epochs.We alternate optimization steps based on a batch of Doc3D data with a batch of our UVDoc data, using the same loss function on both of them.
We visually augment both the Doc3D and our data with noise, color changes and other appearance transformations.Additionally, we augment our data with rotations, since our images are captured from a more uniform angle than the Doc3D data.All images are tightly cropped before being fed to the network.Empirically, we find that the best set of weights to balance the influence of the individual loss terms as defined in Eq. 1 in the main paper are  = 5 and  = 5.During training  is set to 0.0 for the first 10 epochs (first half) and then to 1.0 for the remaining 10 epochs.We give a detailed graphical overview of our model architecture in Fig. 11.

C EVALUATION METRICS
As explained in the main paper, we used image similarity metrics such as MS-SSIM, LD and AD as well as optical character recognition (OCR) performance measured with CER and ED.Details about these metrics are provided below.The structural similarity measure (SSIM) [Wang et al. 2004] quantifies the visual similarity between two images by measuring the similarity of mean pixel values and variance within image patches between the two images.The multiscale variant (MS-SSIM) repeats this process at multiple scales using a Gaussian pyramid and computes a weighted average over the different scales as its final measure.We use the same weights as described in the original implementation [Wang et al. 2003].
LD is computed using a dense SIFT flow mapping [Liu et al. 2008] from the ground truth image to the rectified image.Using this registration, LD is computed as the mean  2 distance between mapped pixels [You et al. 2018], essentially measuring the average local deformation of the unwarped image.
Aligned distortion (AD) is a more robust variant of the LD metric, introduced in [Ma et al. 2022].In contrast to LD, AD eliminates the error caused by a global translation and scaling of the image by factoring out the optimal affine transformation out of the SIFT flow distortion.Such a global affine transformation can cause large LD values but does not greatly impact human readability of the image.Additionally, AD weighs the error according to the magnitude of the gradient in the image, emphasizing interesting areas, such as text or image edges, rather than the background.Prior to computing these similarity metrics, we resize all images, both rectified and ground-truth, to a 598,400-pixel area, as suggested in [Ma et al. 2018].
In addition to the image similarity metrics, we evaluate OCR performance based on character error rate (CER) and editing distance (ED) [Navarro 2001].The CER is defined as the ratio between the ED (the edit distance between the recognized and reference text) and the number of characters in the reference text.We obtain the reference text by extraction from the flatbed scans of the documents.The full definition for the CER then becomes: CER = ( + +)/ , where , ,  are the number of substitutions, insertions and deletions, respectively, and  is the number of characters in the reference text.

D ADDITIONAL EXPERIMENTS
Mixed training.As shown in the main paper, we find that training models on a combination of the Doc3D and UVDoc datasets yields improved performance compared to training on Doc3D alone.However, models trained on a combination of both datasets see more samples and thus more variety than the ones trained on Doc3D only.To verify that the increased number of unique samples is not the cause of the performance gain, we train on a combination of Doc3D and UVDoc datasets, removing 20,000 samples from the Doc3D dataset.This way, the models trained on a combination of the two datasets see equally many samples as the ones trained on Doc3D only.The results of these experiments, along with the results of models trained on Doc3D only and on a combination of the full Doc3D and UVDoc datasets are presented in Table 6.
The models trained on a combination of the reduced Doc3D dataset and UVDoc have slightly worse performance than the models trained on the full datasets.This is expected, as the models are trained with fewer samples.However, the difference between the two is very small.More importantly, the models trained on the full Doc3D dataset alone give very poor results in comparison.Replacing samples from the Doc3D dataset with higher-quality ones from our UVDoc dataset improves its overall performance.

E LINE UNWARPING VISUALIZATION
Our new UVDoc benchmark, equipped with the ground-truth unwarping function, allows one to warp and unwarp not only the document image but any texture.We can warp the texture based on the ground truth deformation and unwarp it using the predicted deformation.This idea, which we apply to create our new line straightness metric, can also be used to better visualize the structural behavior of an unwarping function.By unwarping the unshaded document texture, we can remove the visual effect of shape-from-shading, giving a better visualization of the remaining geometric distortions.We apply this to visually compare our method with related works in Fig. 10.

Figure 2 :
Figure 2: An overview of our data capture setup and sample data acquired in the process.The top shows our capture setup: [1] UV lights, [2] SR305 depth camera, [3] deformed sheet of paper, [4] regular light.The bottom shows a capture sample including RGB images of the normally lit and UV-lit paper, and its depth image.

Figure 3 :
Figure 3: The pipeline used to create a sample of our UVDoc dataset.It combines the captured image of a blank paper, the texture and the background.

Figure 4 :
Figure4: Our unwarping pipeline.We start with an RGB image of a warped document and feed it into our encoderstyle network.The network predicts both a 3D grid mesh (top branch), as well as a 2D unwarping grid (bottom branch) in parallel.The 2D unwarping grid is then bilinearly interpolated to the desired output image resolution and is used to sample pixels from the input image to obtain the final unwarped document image.

Figure 5 :
Figure 5: The shaded and unshaded version of a sample from the UVDoc benchmark, identical up to shading.The shaded version and unshaded version have a CER of 0.439 and 0.004 respectively and ED of 959 and 14.Note that the unshaded version has non-zero CER and ED as it is compared to the original texture, while it has been warped and unwarped using our coarse bilinearly interpolated 2D grid, and thus includes some artifacts.

Figure 6 :
Figure 6: Our new horizontal line metric is the standard deviation of the  coordinate of warped horizontal lines (middle) unwarped using the predicted backward mapping (right in red, ground-truth in black).

Figure 9 :
Figure 9: Illustration of the top-left corner identification step.Cyan lines represent the principal components of the grid points, the yellow line is the diagonal direction, and the white line is the orthogonal line defining the dividing half-space.Red points are towards the left of the line and black points towards its right.(Left) There are several red points, this is not the top-left corner.(Right) There are no red points, the top-left corner is the point on top of which the white is located.

Figure 11 :
Figure 11: An overview of the architecture of our network.

Table 3 :
[Feng et al. 2022] et al. 2022].Quantitative unwarping performance comparisons on our UVDoc benchmark dataset.Bold font indicates best, underline indicates second-best and italic indicates thirdbest score.Refer to Table2for the list of referenced methods.

Table 5 :
line ↓ Ablations about losses and data used.The reported values are averages and standard deviations over 10 repetitions of training with otherwise constant parameters on the DocUNet and UVDoc benchmark datasets.Settings used in our final model are underlined.

Table 6 :
Ablations on training data.The reported values are averages and standard deviations over 10 repetitions of training with otherwise constant parameters.Settings used in our final model are underlined.We show performance on the DocUNet and UVDoc benchmarks.Doc3D reduced is a version of the Doc3D dataset with 20,000 samples removed to offset for the additional UVDoc samples.The underlined setting is the one we use.