Towards Exascale Computation for Turbomachinery Flows

A state-of-the-art large eddy simulation code has been developed to solve compressible flows in turbomachinery. The code has been engineered with a high degree of scalability, enabling it to effectively leverage the many-core architecture of the new Sunway system. A consistent performance of 115.8 DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By leveraging a high-order unstructured solver and its portability to large heterogeneous parallel systems, we have progressed towards solving the grand challenge problem outlined by NASA, which involves a time-dependent simulation of a complete engine, incorporating all the aerodynamic and heat transfer components.


JUSTIFICATION FOR GORDON BELL PRIZE
• High complexity: Aero-thermodynamics of high pressure turbine blades with geometrical features across multiple orders of magnitude are modelled using an unstructured Computational Fluid Dynamics (CFD) code.• High precision: The unstructured CFD code solves Navier-Stokes equations with numerical order of accuracy upto 8 ℎ order.
• High fidelity: The numerical results are validated with detailed experimental results.• High efficiency: A sustained peak performance of 115.8 DP-PFLOPs was achieved with strong scaling parallel efficiency measured at 89%.

OVERVIEW OF THE PROBLEM
Turbomachinery remains a significant consumer of fossil fuels worldwide.At current fossil fuel prices, any small gas turbine performance improvement can result in a multi-billion-dollar economic impact and a major  2 emission reduction to mitigate the effects of global climate change.The bladed components of gas turbines, commonly referred as Turbomachines, are typically exposed to extremely high temperature and pressure, especially for the turbines located downstream of the combustor (i.e.high pressure turbine as illustrated in Fig. 1).Given that the thermodynamic efficiency of a gas turbine is proportional to both the turbine entry temperature and compressor pressure ratio, there is a strong motivation to increase both parameters.However, this results in an elevated operating temperature (often above the material melting point) for high-pressure turbines.To mitigate the problem associated with high temperatures and pressures, intricate flow paths within blades are designed to channel cooling gas directly from the compressor outlet (as depicted in Fig. 1(e)).Reliable prediction of the blade temperature field is critical, as an error of just 2% in predicting the blade metal temperature can halve its operating lifespan [2].Currently, a conservative design approach is typically employed to maintain the material and structural integrity from thermal damage.A higher fidelity prediction method could yield more accurate results.For instance, a 1% reduction in cooling flow via a better thermal design could increase the overall efficiency of a gas turbine by 0.4% [2].This could result in significant fuel savings worth billions of dollars, considering the number of gas turbines currently in operation [3].
The high pressure turbine blade is widely considered to be one of the most challenging components to simulate and design in gas turbines.Due to the highly unstable flow and temperature at the combustor exit, particularly under off-design conditions, heat transfer coefficient prediction errors can exceed 100% [4].The richness and complexity of the flow behavior within high pressure turbines contribute to the difficulty of prediction.The internal cooling passages are long, narrow channels with widths on the millimeter scale and lengths on the centimeter scale.This wide range of scales, coupled with the complex geometry, presents a significant challenge in meshing the computational domain [5,6].
Advanced CFD methods, such as Large Eddy Simulation (LES) or Direct Numerical Simulation (DNS), have been proposed to predict the flow and temperature of high pressure turbines with great accuracy [7].Although these methods offer accurate and reliable simulation, they are computationally demanding.The cost of applying such high-fidelity methods is significantly higher compared to Reynolds-Averaged based methods.However, with the continuous advancement of high-performance computing power over the last decades, exascale computing is now within reach.By leveraging powerful computers and advanced numerical methods, more complex physics can be accurately captured and ultimately revealed to the designer.
To take full advantage of the computing power of exascale clusters, a novel code called ZJU-FR (ZFR) has been developed from scratch to ensure portability across heterogeneous systems.Developed initially at the University of Florida [8] and subsequently at Zhejiang University in China, ZFR is equipped with advanced turbomachinery flow capabilities, including rotating periodic boundary, sliding plane method, etc.The present study investigates the flow surrounding representative high pressure turbine blades from von Kármán institute [9] and NASA Lewis Research Center [10].

CURRENT STATE OF THE ART 4.1 Current State of Large Eddy Simulation
Reynolds-Averaged Navier-Stokes (RANS) simulations, which are widely used in industrial and academic applications, possess several drawbacks, including the loss of temporal information and the lack of universality of the models.Despite the inherent complexity and unsteadiness of flows in gas turbines, turbomachinery designers still rely heavily on RANS simulation for day-to-day analysis.The circumferential averaging performed by RANS simulation between each blade row results in the smoothing out many important flow behaviors, such as wake-blade interaction.It has been demonstrated that these flow features are important in determining compressor and turbine efficiency, particularly for multi-stage components with many blade rows [3,11].
As supercomputers are more accessible to research communities now, especially with the development of computation-focused devices like Graphics Processing Unit (GPU) or many-core devices, previously unaffordable DNS and LES are becoming more feasible for industrial applications.Researchers have used AVBP, a compressible Navier-Stokes solver developed by CERFACS, to conduct a full engine LES and investigate the complex flow features within gas turbines [12,13].The finite element method (FEM) based AVBP solver utilizes a Taylor-Galerkin discretization method to overcome the numerical instability of traditional FEM for convection equations.As part of the European Programming Environment for Heterogeneous Supercomputers (EPEEC) project, more advanced multi-layer parallel programming model, such as GASPI+MPSs and ArgoDSM+MPSs [14], have been integrated with AVBP to take advantage of the upcoming heterogeneous supercomputers.

Current State of Numerical Methods and Related Solvers
Computational mesh must be fine enough for LES to resolve the majority of turbulent scales and the time step size must be small enough to capture time-accurate flow dynamics, resulting in a computational cost orders of magnitude higher compared with RANS simulation [15].To address this challenge, it is essential to take advantage of the large modern massive parallel supercomputers.
Almost all flagship systems feature a high FLOP/s-per byte ratio, meaning that memory speed is increasingly becoming the bottleneck of the computation.To keep up with the fast advancement of computing facilities, new series of finite-element-based high-order numerical schemes with higher data locality have been proposed, such as Discontinuous Galerkin (DG) method [16], and the Flux Reconstruction (FR) method [17].These methods are hybrid structured/unstructured, with structurally aligned solution/flux points within each element while data associated with elements is unstructurally stored.This hybrid method offers flexibility when meshing complex domains, and fully utilizes the computing power of modern architectures due to its compactness and structuredness.
A few finite-element-based high-order CFD solvers have been developed over the past few decades.NEK5000 [18] and Alya [19] served as a pioneering work that inspired and initiated the effort to push finite element methods for more practical industrial flows.Later, solvers based on DG method, such as SU2 [20] and Flexi [21], were developed to enable the solution of high-speed compressible flows on unstructured grids.A more unified framework called FR method or Correction Procedure via Reconstruction (CPR), was introduced later by Huynh in 2007 [17] .The FR method can recover DG and spectral Difference (SD) methods [22] and is simpler and more computationally efficient than its predecessors [23].
PyFR is an open-source framework based on the FR approach to solve advection diffusion type problems.It was developed at Imperial College London [24], and is primarily written in Python.Mako template language is embedded in the code to generate numerical scheme kernel code for different hardware platforms at runtime, enabling developers to achieve high performance while still maintain a clean code base.The performance of FR methods is heavily limited by memory bandwidth on modern hardware.To alleviate this issue, a data structure facilitating cache blocking has been implemented, leading to a speedup of up to 2.8X for the Tyalor Green vortex test case [25].A petascale DNS simulation of compressible flow over a low pressure turbine (LPT) at Reynolds number of 200,000 was conducted.Meshes with up to 180 million hexahedron elements with fifth order accuracy were used, resulting in over 11 billion degree of freedoms per equation.The result was selected as one of the finalists for the 2016 Gordon Bell prize.

Current State of New Sunway Supercomputer
4.3.1 System Architecture.The latest generation of the Sunway supercomputer has far more computing power than its predecessor (Sunway TaihuLight) thanks to a larger number of chips and the improved high-performance heterogeneous many-core processor.
The system consists of over 100,000 SW26010 pro chips.Compared to the previous generation chip used in the Sunway TaihuLight, the upgraded processor includes six core-groups (CGs), each containing one MPE and one 8×8 CPE cluster with 16 GB DDR4 sharing memory (see Fig. 2).Each CPE has a local data memory (LDM) extended to 256 KB and supports wider 512 bit SIMD operation.More details of this new Sunway system can be found in [26,27].Sunway Accelerate Computing Architecture (SACA) is an essential component of the latest Sunway supercomputer.The SACA contains a parallel accelerated computing platform and programming model tailored to Sunway many-core architecture.SACA provides three levels of parallel acceleration, which are task (MPI), thread (Athread) and data (SIMD), that maximize system performance.By utilizing Direct Memory Access (DMA) and register communication, programmers can explicitly and effectively control data exchange on the LDM, which is crucial to minimize the communication cost of CPEs.

Parallelization
Challenge on Sunway for ZFR.The ZFR solver faced several challenges to be efficiently ported on the Sunway system.To fully utilise the computing processor on the heterogeneous system, a hybrid "MPI + Athread" strategy incorporating multi-level parallelism needs to be adopted.Additionally, as an unstructured solver, it is impossible to perform regular partition like stencil computations on CPE cores.Though FR method can significantly reduce data communication with high orders, bandwidth can still be the bottleneck, as Sunway has a relative lower byte-to-flop ratio compared to other heterogeneous systems.Thus, attention needs to be paid to data locality.As the mesh scales to billions of elements and the solver runs with tens of millions of cores, an efficient parallel mesh pre-processing is also required.With these challenges in mind, many innovative methods have been implemented on the pre-processor and the solver.

INNOVATIONS REALISED 5.1 ZFR Framework
This section provides an overview of the structure of the inhouse code -ZFR.A schematic of the solver is illustrated in Fig. 3.The code contains two main components: a low-level supporting framework and a high-level FR solver.The low-level framework incorporates algorithms and data structures that are independent of numerical schemes, offering an unified interface for high-performance physics simulation.The parallel partitioner is used to decompose the computational domain into non-overlapping sub-domains for parallel computation, with either ParMETIS [28] or PT-Scotch [29] working as the backend.
The FR solver comprises several components that define the geometrical data structures, numerical algorithms, and equation systems.Only the equation system encapsulates the physical problem while other components are problem independent.The current implementation supports the compressible flow equation system, with physical models including boundary conditions, Riemann solvers, body forcing, LES sub-grid scale models, and shock capturing methods.
5.1.1FR Algorithm.The governing equations are rewritten into the form of conservation law and transformed to reference domain within each element as where Q( , ) = |J|Q(, ) is the transformed solution variable,  element as where   is number of solution points inside each element.On the interfaces, the normal common numerical flux is calculated using an approximate Riemann solver with solutions interpolated to both sides of the interface.Then, a correction function  is used to correct the divergence of flux with the normal common numerical flux on the interface to recover a global solution.In this work,  is chosen so that the resulting scheme is equivalent to a nodal collocated DG method [17].The viscous flux is handled using the local discontinuous Galerkin (LDG) method, which is described in detail by Huynh [30].The semi-discretized form of the scheme takes the form dQ d where   and   are the number of interfaces on each element and number of flux points on the interface, respectively.The transformed numerical flux, and the normal transformed discontinuous flux on the interface are represented by F⊥ and F⊥ , respectively.Finally, the solution is advanced in time using an ODE solver such as strong-stability-preserving Runge-Kutta method [31].
5.1.2Pre-processor Parallelization.The preprocess module of the code is carefully designed to efficiently read and process large scale computational mesh in parallel.Mesh entities including cells and vertices to be read by each rank are determined using a cumulative storage format, so that each rank reads a non-overlapping chunk of data.For CGNS file format, this operation is performed using the parallel I/O support of HDF5 built on MPI-IO, either collectively or independently.
After reading the cell and vertex information, the mesh partitioner is called to construct a dual-graph, which includes cell-cell connectivity information.The face list is constructed on each rank using local cell connectivity, and the vertex list of each face is sorted in ascending order to create a unique numbering.The face list is then sorted in ascending order by vertex IDs to make sure that coupled faces are next to each other.Then, local internal faces are coupled and repeating entries are eliminated, so that the remaining uncoupled faces are either MPI internal faces or boundary faces.
To read the boundary faces, algorithm 1 presents the pseudo code that uses the cumulative storage format to read each boundary.According to the CGNS specification, each boundary condition corresponds to one or more face element sections.
MPI internal faces in each rank need to find their remote counterparts.The face matching information is used later in the solver to exchange data within each time step.The algorithm to find MPI matching faces is presented in Algorithm 2. Uncoupled MPI internal faces are redistributed linearly throughout the ranks by their leading vertex ID.This method makes sure that coupled faces will be sent to the same rank.Similar to the local internal face matching algorithm, the received faces are sorted and their counterparts are found.

Three-level Parallelization Design
To optimize computational performance on the Sunway system, we have implemented a three-level parallelization strategy using the SACA programming model.This approach, as illustrated in Fig. 4, involves decomposing the unstructured grid into different partitions for each MPI process, distributing computing tasks evenly across elements and interfaces for each CPE thread, and conducting vectorization on the data dimension of sequential points per element or interface.The success of the last two levels of parallelism relies on maximizing utilization of the limited LDM space on CPEs.To Algorithm 1 Method to read boundary faces with CGNS.if MPI_Issend finished then MPI_Issend(sbuffer,) for  ← 1,  update MPIFaceRank with rbuffer 29: end procedure achieve these, we have implemented the following three levels of parallelizatioin.

MPI Parallelization.
The ZFR solver utilizes a decomposition strategy to partition the entirety of an unstructured grid into multiple sub-computation domains.Each domain is assigned a partition weight that is based on the type of element it contains, aiming to optimize the load balance across MPI processes.

LDM
Blocking for CPE Thread.The SW26010 Pro's memory hierarchy presents a simple approach to second-level parallelism by equally distributing computational workload of the entire subdomain across 64 CPE threads, while LDM operates as a cache with 256B cache lines.This design is similar to traditional multicores processor.However, it can lead to significant performance degradation due to long data access latency and cache miss.To improve the efficiency of CPEs' access to main memory and avoid complex data layout design caused by cache lines, we employed a LDM blocking technique and managed the LDM explicitly, inspired by cache blocking.Our method involves selecting a consistent quantity of computation data (e.g., elements or interfaces) by local data id, partitioning computation data into small blocks along the order of memory address, and processing each block in a sub-loop of thread.After blocking, the size of each block is limited by the 256 KB LDM space and size of unit data.The data transfer between CPEs is highly efficient by leveraging the DMA channel.However, when launching point-wise kernels with indirect memory access (PI) which is unavoidable in many unstructured solvers, a noticeable portion of data block can not be filled along the memory address directly, resulting in a landslide in computing performance.To address this issue, we analyzed the layout of variables and found out address continuity of the lowest level of data (n_fpts_per_inter) on one interface.We then packed them as the basic unit of data communication by DMA.In computationally intensive kernels, we applied a double-buffering approach to reuse the LDM data and hide the memory latency.Based on the LDM blocking strategy, data blocks are divided into two smaller ones, allowing the computation of the current block while a CPE thread is prefetching the next block data in a pipeline way.

Vectorization.
Vectorization is an crucial step in parallel acceleration to achieve efficient utilization of the chip's performance.While the SACA programming model provides automatic vectorization and optimization schemes, the ZFR solver did not achieve the expected speedup after enabling this feature.To address this issue, we opted to perform an 8-double width loop tile on the lowest level of data layout and refactor a series of related subroutines with SIMD.

Point-wise Kernels Fusion Scheme
On the many-core architecture, the performance of FR method is significantly limited by the memory bandwidth.While we have employed various strategies, such as LDM blocking to maximize bandwidth utilization, double buffering in some kernels to hide variables communication, and shared data between CPEs through register communication to alleviate limited LDM space, memory access still dominates almost all of the point-wise kernel runtime after vectorization.Through analysis of these point-wise kernels, we found that a dataset transferred from LDM by one point-wise kernel is latter consumed by another.With this understanding, we propose a point-wise kernel-fusion scheme to reduce redundant memory bandwidth expenditure.
To describe the details of the fusion scheme, we first introduce a directed acyclic graph (DAG) that indicates the data dependencies.Fig. 5 shows the entire structure of the DAG, which includes nodes representing the main variable in the ZFR solver and edges determined by kernels representing the direction of the data flow.Three line-types were used to denote the different kernels categories: matrix multiplications (DGEMM), point-wise kernels with direct memory access patterns (PD), and point-wise kernels with some level of indirect memory access(PI) [25].The DAG provides a clear overview of the entire data flow and dependencies.The optimal kernel fusion is to consolidate all data onto a single node on the DAG and perform all computations using that kernel.However, with xMath [32] as the third-party library for high performance matrix multiplication, it is impossible to achieve a full kernel fusion without the original code of the library.In this work, we primarily focuses on PI and PD kernels for the fusion and find out two potential fusion region on the DAG, which is framed by red dotted line.We successfully fused five kernels into two optimized ones, as shown in detail in Fig. 6.

HOW PERFORMANCE WAS MEASURED 6.1 Test Case Design
The performance of the code are measured using two cases: the LS89 cascade and the vane of the first stage high pressure turbine of GE-E 3 [10].The LS89 cascade is relatively simpler with no cooling holes.However, it has more detailed measuremnt data.Therefore, scalling test and code validation are performed using the LS89 cascade.The GE-E 3 test case, on the other hand, is more representative to the real turbine blade with cooling holes and complex internal channels.2. The computational domain with the adopted boundary conditions are shown in Fig. 7.It is noted that two sponge zones are imposed near the inlet and outlet boundaries to damp out any fluctuations and prevent reflections from boundaries.A total of five sets of unstructured meshes were generated, with the element counts given in Table 3. Mesh1 was employed to validate the framework and case setup.This validation run enabled the time averaging, sponge zones, statistics collection, and parallel data output with fourth-order solution polynomials interpolation within each element.Mesh2 to Mesh5 were designed to run the weak scaling test, Mesh2 and Mesh3 were also used for the strong scaling test.The peak sustained performance was measured using the Mesh5.

Test Case 2.
To demonstrate the current solver can simulate complex practical cases, flow within the vane of the first stage high pressure turbine of GE-E 3 [10] is calculated.Fig. 9 shows the three-dimensional model of the turbine vane and the boundary condition implemented in the current study.With periodic boundary condition applied in the circumferential direction, one flow passage is solved.Relatively low temperature coolant is injected into the computational domain near the hub.The coolant channels into the internal passage of the turbine vane, and then ejects to the mainstream from the small cooling holes.The detailed boundary condition and mesh element counts can be found in Table 4.

FLOPs Measurement
All numerical tests were performed on the new Sunway supercomputer using double precision.The performance tests were designed    to evaluate the scaling and efficiency of the solver with no postprocessing module enabled (i.e.no I/O or statistical data collection).The performance is measured by averaging time taken for one time step, after running a test for 50 time steps.The total number of floating point operations (FLOPs) of one time step is estimated in a manner similar to that in [33].The CPE kernels were divided into two groups: matrix multiplications and point-wise calculation.In matrix-multiplication kernels, the floating point operations was counted as 2mnk, with m representing the number of rows in A and C matrix, n representing the number of columns in B and C matrix, k representing the common dimension of A and B matrix.This estimate was less than the actual FLOPs executed in the ZFR solver, resulting in a conservative estimation of the overall FLOPs.For point-wise kernels, we used two methods to count the number of operations.The first method counted all double-precision instructions in the assembly code, while the other counted the specific algorithm and operation of the numerical scheme.Both methods produced similar operation counts.Here, we chose the latter for the final number.7.1.2Weak scaling.For weak scaling, four sets of meshes (see Table 3) have been generated, with element counts ranging from 211 million to 1.69 billion.Fig. 12 shows the weak scaling test result,

Peak sustained performance for Test Case 1
The peak performance was benchmarked using Mesh5 with a seventh order of polynomial, resulting in 1.69 billion elements and 865 billion DOFs.The case was ported to 60% of the Sunway machine with 350 thousand MPE cores and 22.4 million CPE cores, achieving a sustained performance of 115.8 DP-PFLOPs.

Physics revealed through simulation with
high complexity, high fidelity, high precision, and high efficiency 7.3.1 Test Case 1.The physics run was performed on Mesh1 with a third order polynomial.The chosen Courant-Friedrichs-Lewy (CFL) value was 1, requiring approximately 100,000 time steps to complete one flow pass over the blade chord.The case commenced with a 1 st order of accuracy to flush out transients, and then switched to the fourth order for a high fidelity run.Following the low order run, a total of 300,000 time steps was initiated to achieve a fully developed state, after which time averaging was performed over another 300,000 time steps.The entire simulation was performed on 2000 nodes with 768,000 CPEs for 40 wall clock hours.Fig. 13 compares the simulation results with experimental data.It is noted that numerical results compare favourably with experimental and numerical results in literature for both heat transfer coefficient and isentropic Mach number over the blade surface.The convective heat transfer coefficient is notoriously difficult to predict with errors of over 50% for RANS simulations [7].The fact that the current high fidelity simulation can accurately predict the heat transfer coefficient is that the rich flow physics is captured.Fig. 14 shows a snapshot of the instantaneous flow with turbulent flow structures visualized by Q-criterion [34] and shock wave and wake propagation by pressure gradient in the background.The vortex shedded from the blade trailing edge generates strong pressure waves, which impinge on the neighbouring blade suction surface, creating a region with oscillating heat transfer coefficient (see fig. 13).On the suction surface near trailing edge, boundary layer transition triggered by the upstream moving pressure wave is also evident, which is consistent with the sharp increase of heat transfer coefficient near the trailing edge.simulation can confidently predict the blade surface temperature as more physics are captured, such as jet ejecting to the mainstream, endwall flows, wake shedding from the trailing edge, etc.

IMPLICATIONS
The present study details the development of a LES code, which is fully portable to the new Sunway architecture.The code includes a parallel pre-processor that can handle large meshes with billions of elements and efficiently find matching faces for MPI communication.The code was validated and profiled using high-pressure turbine testcases, and numerical results compared favorably with experimental data, particularly for the challenging prediction of heat transfer coefficient.The ability to accurately predict heat transfer is crucial for predicting blade temperature and material integrity in industrial turbomachinery applications, particularly for transitional and separated flows.With higher confidence in simulation, more boundary-pushing and groundbreaking designs can be made to increase overall engine efficiency and eventually reduce  2 .
Our work demonstrates the potential of high order FR methods and heterogeneous computing power for LES of turbomachinery flows.

Parallelization Optimization for FR method on Heterogeneous Computing Architecture
We have demonstrated in this work that the Sunway architecture can be efficiently utilized for high-order CFD simulations on unstructured meshes, through the implementation of multilevel parallelism technique.Our results demonstrate excellent scalabilty up to 300,000 MPI ranks with 19.2 million CPE cores.Furthermore, while the Sunway architecture differs from other heterogeneous platforms in terms of its memory model, we anticipate that such multi-level parallelism technique can be employed to optimize algorithms for other platforms as well.With support from other technologies, such as adaptive mesh generation with high-order elements, the flux reconstruction method can be a competitive candidate for the future whole engine simulation.

Large Eddy Simulation for Turbomachinery Flows
Nowadays, CFD plays an increasingly important role in turbomachinery's aerodynamic, thermal, and aeromechanic design, resulting in shorter design cycles, safer and more efficient performance, and reduced weight and costs.However, RANS based turbulence model often underestimates the diffusion-driven turbulent mixing.This affects the prediction of almost every turbomachinery component.LES, on the other hand, offers substantially more accurate and detailed information about the flow field than RANS, but requires drastically more computational resources.In our research, we demonstrated how LES can accurately model sizable sections of practical turbomachinery blades on the emerging exascale computing technology.Since the scale of computing used in cutting-edge HPC systems, like Sunway, is about 10-15 years ahead of that used by the leading industrial adopter [35], this pioneering research will open the door for future high-fidelity and complex numerical studies for turbomachinery flows, encouraging more innovative designs with short turnaround time to achieve safer and more efficient gas turbines with lower emissions.

Figure 1 :
Figure 1: Main components of gas turbine and turbomachinery flow features: (a) schematic of core components of a gas turbine; (b) illustration of a turbo-fan engine; (c) enthalpy field of hot streaks incoming from combustor into the high pressure turbine; (d) turbulent flow structures near the turbine endwall; (e) CAD model of a real turbine blade; (f) clean-up CAD model of a turbine blade with cooling holes for CFD calculation; (g) temperature field on high pressure turbine blade surface with cooling holes.

Figure 2 :
Figure 2: The architecture of SW26010Pro many-core processor on the new Sunway supercomputer.

Figure 3 :
Figure 3: Overview of the code structure.

Figure 7 :
Figure 7: Computational domain of the high pressure turbine testcase.

Figure 8 :
Figure 8: Cross-sectional view of the mesh around the blade trailling edge.

Figure 9 :
Figure 9: Geometrical model and boundary condition of high pressure turbine vane with cooling holes.

Figure 10 :
Figure 10: Surface mesh of the high pressure turbine vane on (a) internal channels and cooling holes and (b) the junction of leading edge and endwall and (c) the pressure surface and (d) the trailing edge.

1
Scaling tests for Test Case 1 7.1.1Strong scaling.Fig. 11 shows the strong scaling test results on the latest generation Sunway supercomputer.The 211 million and 422 million unstructured meshes of LS89 case were benchmarked and achieved the excellent parallel efficiency from 1.2 million to 19.2 million CPE cores.At 19.2 million CPE cores, the parallel efficiency of three configurations all dropped to below 90%.This is due to the fact that each CPE core is only allocated with 11 elements or 22 elements for Mesh2 and Mesh3 respectively, at which the CPE core cannot be fully loaded.Considering the configuration of Mesh3 with different polynomial orders, it is observed that the efficiency of order p = 6 is higher than that with p = 5.The strong scalling test result suggests that the element size on each CPE plays a vital role in the parallel performance.

Figure 11 :
Figure 11: Strong scaling of ZFR on the latest generation Sunway supercomputer for different meshes and polynomial orders.

where 2 .
4 million to 19.2 million CPE cores are used in the test.Excellent weak scaling performance is observed up to 19.2 million CPE cores with various orders.

Figure 12 :
Figure 12: Weak scaling of ZFR on the latest generation Sunway supercomputer for different polynomial orders.

Figure 15 :
Figure 15: Temperature over the suction surface of the high pressure turbine vane with coolant ejecting from cooling holes.

Table 1 :
Performance attributes on Sunway supercomputer.
• G are the transformed inviscid and viscous flux, ∇  • □ is divergence operator in the reference domain, and Ŝ = |J|S is the transformed source term.Here, J   =     is the transformation Jacobian, where  represents coordinates in physics domain, and  represents the coordinates in reference domain.The transformed solution and flux functions are approximated by piecewise discontinuous polynomials defined in the reference

Table 3 :
Mesh element counts of Test Case 1.

Table 4 :
Test Case 2 boundary condition and mesh element counts.