Accepted List
-
Dimensionality Reduction-based Interactive Visual Analytics Approach for Investigating Ensemble Weather
Simulations
Go Tamura (Photron Limited), Sena Kobayashi (Kobe University),Naohisa Skamoto (KobeUniversity),
Yasumitsu Maejima (Kobe University), Jorji Nonaka (RIKEN R-CCS) -
Scalable Dual Coordinate Descent for Kernel Methods
Zishan Shao (Duke University), Aditya Devarakonda (Wake Forest University) -
ITTPD: In-place Tensor Transposition with Permutation Decomposition on GPUs
Kai-Jung Cheng (National Tsing Hua University), Che-Rung Lee (National Tsing Hua University) -
Scaling molecular dynamics for large-scale simulation of biological systems on AMD CPU/GPU
supercomputers: Lessons from LUMI
Diego Ugarte La Torre (RIKEN Center for Computational Science)
Jaewoon Jung (RIKEN Center for Computational Science, RIKEN Cluster for Pioneering Research),
Yuji Sugita (RIKEN Center for Computational Science, RIKEN Center for Biosystems Dynamics Research) -
When HPC Scheduling Meets Active Learning: Maximizing The Performance with Minimal Data
Jiheon Choi (Ajou University), Jaehyun Lee (Ajou University), Minsol Choo (Ajou University), Taeyoung Yoon (Ajou University),
Oh-Kyoung Kwon (Korea Institute of Science and Technology Information), Sangyoon Oh (Ajou University) -
Qsync: Extending Simplified SMR Protocol with Partial Network Partition Tolerance
Aoi Kida (Keio University), Hideyuki Kawashima (Keio University) -
Lattice QCD code on GPUs: Implementation and performance comparison with OpenACC and CUDA
Wei-Lun Chen (High Energy Accelerator Research Organization (KEK), The Graduate University for Advanced Studies (SOKENDAI)),
Issaku Kanamori (RIKEN Center for Computational Science),
Hideo Matsufuru (High Energy Accelerator Research Organization (KEK), The Graduate University for Advanced Studies (SOKENDAI)) -
Large Scale Ensemble Coupling of Non-hydrostatic Atmospheric Model NICAM
Takashi Arakawa (The University of Tokyo), Hisashi Yashiro (National Institute for Environmental Studies),
Shinji Sumimoto (The University of Tokyo), Kengo Nakajima (The University of Tokyo) -
A Full-Path Priority Based Workflow Scheduling Approach
Wei-Cheng Tseng (National Taichung University of Education), Kai-Yang Rong (National Taichung University of Education),
Kuo-Chan Huang (National Taichung University of Education) -
Accelerating General Relativistic Radiation Magnetohydrodynamic Simulations with GPUs
Ryohei Kobayashi (Institute of Science Tokyo), Hiroyuki R. Takahashi (Komazawa University), Akira Nukada (University of Tsukuba),
Yuta Asahina (University of Tsukuba), Taisuke Boku (University of Tsukuba), Ken Ohsuga (University of Tsukuba) -
PBHS: A Prediction-Based Scheduler for Hyperparameter Tuning
Hong-Feng Yu (National Tsing Hua University), Jerry Chou (National Tsing Hua University) -
Libra: A Python-Level Tensor Re-Materialization Strategy for Reducing Deep Learning GPU Memory
Usage
Ling-Sung Wang (National Tsing Hua University), Sao-Hsuan Lin (National Tsing Hua University),
Jerry Chou (National Tsing Hua University) -
Using a Large Language Model as a Building Block to Generate Usable Validation and Verification
Suite for OpenMP
Swaroop Pophale (Oak Ridge National Laboratory), Wael Elwasif (Oak Ridge National Laboratory),
David E. Bernholdt (Oak Ridge National Laboratory) -
Revisiting Memory Swapping for Big-Memory Applications
Shun Kida (Keio University), Satoshi Imamura (Fujitsu Limited), Kenji Kono (Keio University) -
PCIe Bandwidth-Aware Scheduling for Multi-Instance GPU
Yan-Mei Tang (National Tsing Hua University), Wei-Fang Sun (NVIDIA AI Technology Center),
Hsu-Tzu Ting (National Tsing Hua University), Ming-Hung Chen (IBM Research), I-Hsin Chung (IBM Research),
Jerry Chou (National Tsing Hua University) -
Lossy Compressed Collective Inter-FPGA Communications
Michihiro Koibuchi (National Institute of Informatics, The Graduate University for Advanced Studies), Yoshinobu Ishida (NSW Inc),
Shoichi Hirasawa (National Institute of Informatics), Yao Hu (Information Technology Center, The University of Tokyo),
Takumi Honda (Fujitsu), Yusuke Nagasaka (Fujitsu), Naoto Fukumoto (Fujitsu) -
Fast Malicious Packets Inspection Framework Using Converged Accelerator
Chuan-Ming Ou (National Tsing Hua University), Yong-Xuan Huang (National Tsing Hua University), Ming-Hung Chen (IBM Research),
I-Hsin Chung (IBM Research), Jerry Chou (National Tsing Hua University)
Accepted Papers List
-
Power-efficient Data Compression for Edge-Cloud Environment
Yiyu tan, Peng Chen, Yusuke Tanimura, Truong Thao Nguyen
Abstract: A power-efficient data compression solution is proposed to minimize communication bandwidth and storage requirements for advanced edge-cloud computing platforms, which serve as a vital infrastructure for the post-5G era. Our solution leverages power-efficient Field Programmable Gate Arrays (FPGAs) to implement the GZIP, a highly optimized lossless compression algorithm, within the constrained environment of edge-cloud systems, where computing resources like power and processing capacity are limited. By deploying our method on the FPGA card DE10-Agilex and testing with the widely used dataset enwik, we demonstrated that the data size was reduced by 48.8% and 52.5% in the dataset enwik8 and enwik9, respectively, which was crucial for maintaining performance in bandwidth-sensitive edge-cloud environments. The FPGA-based data compression achieved approximately 5.6x and 5.2x in compression throughput and power efficiency over the pigz executed on a desktop with an Intel’s Xeon Gold 6212U processor, respectively.
-
Optimizing the Domain Decomposition of 𝑁-body Simulations and Performance Evaluation on the Supercomputer Fugaku
Aki Fujita, Tomoaki Ishiyama
Abstract: The degree of parallelism in modern supercomputers is enormous, but the execution time of a parallel program is not simply proportional to the number of parallel processes. To achieve improved execution performance, it is also important to elevate the efficiency of domain decomposition. We therefore developed a method for optimizing a domain decomposition part proposed in a previous study (Ishiyama, Fukushige, and Makino, 2009) and evaluated execution performance on Fugaku. As a result, the total execution and calculation times of our proposed approach were 3.7 times faster than those of the traditional approach for 64 processes; 5 times faster for 128, 256, and 512 processes; and 7 times faster for 1,024 or more processes. Communication time was also reduced across all process cases, and parallelization performance was higher than that of the traditional approach. Future work will include the evaluation of execution performance with even larger data sizes and numbers of nodes, as well as optimization by thread parallelization.
-
Graphical Representation for Directed Acyclic Graph with Ray-tracing Cores Acceleration
Zhengyang Bai, Peng Chen, Zhuang Chen, Jing Xu, Emil Vatai, Mohamed Wahib
Abstract: Nowadays, GPUs are widely used in high-performance computing due to their ability to handle massive parallelism with thousands of cores and high memory bandwidth, making them ideal for tasks like simulations and machine learning. However, graph algorithms present challenges because of irregular data access patterns, complex dependencies, and dynamic workloads that do not align well with GPUs' SIMD architecture. These issues lead to inefficient parallelism, under-utilization of GPU cores, and difficulties in load balancing, particularly with large datasets on supercomputers. As highlighted in the latest TOP500 and GRAPH500 rankings, systems like Frontier and Aurora achieve less than 2.5% of their peak performance on graph problems, underscoring the need for more effective solutions. This proposal introduces a radically different approach to modeling directed acyclic graph (DAG) , where all edges are directed and the graph is cycle-free, analysis by leveraging geometry rather than relying on discrete mathematics and linear algebra in the conventional approaches. By treating DAG analysis as a rendering problem, this proposal applies ray casting techniques typically used in computer graphics to uncover relationships in DAG with ray tracing cores acceleration in the latest GPUs. This proposal aims to develop an open-source library for high-performance DAG analysis, with applications across various fields.
-
A highly scalable domain decomposition method for blood flow simulation in human artery with slip and leakage effects
Xiangdong Zhang, Li Luo, Ye Li, Xiao-Chuan Cai
Abstract: We consider the numerical simulation of blood flows in the human artery using the incompressible Navier-Stokes equations with slip and penetration boundary conditions on part of the arterial wall. Most existing approaches assume the no-slip and no-penetration boundary conditions for the entire wall, but such assumptions do not hold under certain pathological conditions. For these problems, we develop a fully implicit domain decomposition method for the incompressible Navier-Stokes equations with the proposed more realistic boundary conditions. A stabilized finite element method, with a high order boundary discretization, is used for the spatial discretization and a backward Euler method is used for the temporal discretization. The resulting large, sparse, and highly ill-conditioned nonlinear algebraic equations are scaled and solved by a Newton-Krylov method that is preconditioned by a parallel overlapping Schwarz method. Benchmark simulations for flows in 2D and 3D channels demonstrate good agreement with previous studies. An in-depth study is provided for the influence of the slip and penetration boundary conditions on the flow behavior and the parallel solver. The proposed method performs well for blood flows in a realistic patient-specific cerebral artery with some level of stenosis, and the parallel scalability is studied using up to 4,096 processor cores.
-
Parallel Simulation of Cardiac Electrophysiology of a Human Heart
Tianhao Ma, Yi Jiang, Xiao-Chuan Cai
Abstract: Numerical simulation of cardiac electrophysiology has emerged as a crucial tool in clinical research yet faces significant computational challenges due to the complexity of modeling human heart dynamics. This study presents a parallel implementation of cardiac electrophysiology simulation using the monodomain model coupled with detailed ionic models - the Grandi 2011 model for atria and the ten Tusscher 2006 model for ventricles. The approach incorporates a linear anisotropic conductivity model that accounts for fiber effects, with fiber orientations generated through a rule-based method. The numerical solution employs a parallel implicit-explicit finite element method, combining first-order finite elements for spatial discretization with an implicit scheme for temporal discretization. The gating variables are handled using the Rush-Larsen method, while ion concentrations are computed via the forward Euler method. The resulting system of algebraic equations is solved using a conjugate gradient method with additive Schwarz preconditioning. Performance analysis demonstrates strong scaling capabilities across thousands of computational cores, enabling efficient simulation of high-resolution heart models while maintaining accuracy. This work provides a robust framework for investigating complex cardiac conditions through computational modeling, offering potential applications in clinical research and patient-specific cardiac studies. The implementation achieves both computational efficiency and numerical accuracy, making it suitable for large-scale simulations of cardiac electrophysiology.
-
Improving the Convergence of the Preconditioned Bi-CGSTAB Solver through Error Vector Sampling for a Sequence of Asymmetric Linear Systems
Hirotoshi Tamori, Takeshi Fukaya, Takeshi Iwashita
Abstract: This poster focuses on a solution process for a sequence of linear systems having an identical asymmetric coefficient matrix. Such sequences often appear in time-dependent simulations where each right-hand side vector depends on the solution of the previous system. These systems are usually solved using Krylov iterative methods, but slow convergence can significantly hinder performance. Therefore, improving the convergence rate of iterative solvers is important for enhancing computational efficiency. While various preconditioning techniques exist, handling slowly converging components remains particularly challenging for asymmetric systems. We present the error vector sampling subspace correction (ES-SC) preconditioning method for asymmetric systems.
The method identifies and corrects components that cause slow convergence, particularly those associated with small singular values. These components are captured through error vectors sampled during the first solution process, from which we obtain approximate left and right singular vectors using the Rayleigh-Ritz method. Using these singular vectors, we construct auxiliary matrices that define the subspace for SC preconditioning and accelerate the solution process for the subsequent linear systems in the sequence. A similar approach for symmetric linear systems has already been reported, and we extend the method to asymmetric systems in this research.
The ES-SC preconditioner is designed to work with other preconditioning techniques. We combine it with ILU preconditioning using the additive Schwarz method and implement a parallel BiCGSTAB method with block Jacobi for ILU. The ILU preconditioner improves solver convergence, and block Jacobi is well-suited for parallel processing, making them widely used in large-scale numerical simulations. Numerical experiments using matrices from the SuiteSparse Matrix Collection demonstrate that, under appropriate parameter settings, the ES-SC preconditioner reduces solver iterations across all test cases compared to ILU preconditioning alone. Additionally, it decreases computational time in 8 out of 14 cases by offsetting its overhead with faster convergence. -
Building the Quantum-HPC Hybrid Platform: Perspectives on Environment Setup
Tomoya Yuki, Shin'ichi Miura, Takashi Uchida, Miwako Tsuji, Yuetsu Kodama, Mitsuhisa Sato
Abstract: Quantum computers are advancing rapidly, with increasing qubit counts gradually enabling practical applications. From an HPC perspective, quantum computers can act as accelerators for offloading specific tasks, making QC-HPC integrated infrastructure needed. This work introduces a multi-site QC-HPC Hybrid Platform designed to facilitate cooperation between quantum and classical resources. The platform addresses key challenges such as job scheduling, resource prioritization, and authentication. A unified authentication system provides single sign-on (SSO) across all resources, while job coordinators ensure efficient management of hybrid jobs using existing schedulers. Additionally, user-friendly interfaces, including pre-configured quantum environments accessible via OpenOnDemand, enable seamless access to quantum programming resources. This paper also discusses several challenges encountered during the development of the platform.
-
DSICE: Parameter Tuning Library using d-Spline Estimation
Yuga Yajima, Akihiro Fujii, Teruo Tanaka
Abstract: Software auto-tuning (AT) is a technology for automating of software performance optimization. One of the AT approaches is trial and error instead of humans. In this approach, programmers parameterize the factors that affect performance as “performance parameters” in advance. Then, the AT mechanism controls these parameters and searches for the appropriate values for the performance parameters. In parameter tuning, appropriate estimation and intensive trial for good likely area are desirable. We have studied estimation methods with the fitting function d-Spline. d-Spline estimation can keep the tuning cost down. In this study, we created DSICE, that is a library of these methods. DSICE provides two types search methods, the sequential search and the parallel search. The sequential search method is mainly for normal computers. With only minor changes to the program (no more than 5 lines in the minimum case), the search method with d-Spline estimation can be easily incorporated. On the other hand, the parallel search method is mainly for supercomputers, especially for hyperparameter tuning of machine learning programs. When using the parallel search method, in addition to the tuner (with d-Spline estimation), DSICE provides the job management functions and the agent functions. The job management functions can send values for trials and aggregate performance values. The agent functions can receive values that will be set to performance parameters and send performance values to the job manager. With DSICE, you can introduce the low-cost tuning easily.
-
Application of Physics-Informed Neural Networks to Magnetohydrodynamic problem
Yen-Chen Chen, Hsiao-Hsuan Lin, Tsung-Che Tsai, Wei-Sheng Chen, Kun-Han Lee, Han Hu
Abstract: High-temperature plasma in a tokamak is unstable and challenging to control. We propose utilizing a novel approach known as Physics-informed neural networks (PINNs) to address control problem in tokamak operation. We apply PINNs to derived solutions for Magnetohydrodynamics (MHD) and integrate the model into a real-time control system. In this presentation, we demonstrate an example of an MHD problem solved by PINNs, followed by a discussion of the results.
-
Fast Inference by Dynamically Selecting a Quantization Level
Shun Sasaki, Hideyuki Kawashima
Abstract: We propose a dynamic inference framework that utilizes both a non-quantized (Base) model and a 4-bit quantized (Quant) model. A diAiculty classifier determines whether an input is "easy" or "diAicult." Easy samples are processed by the Quant model for eAiciency, while diAicult samples are handled by the Base model for accuracy. The classifier is trained using sentence length, semantic similarity, syntactic complexity, and lexical diversity, achieving 91.28% accuracy on a training set.
Our approach is evaluated on the MRPC task from the GLUE benchmark. The all-Base strategy achieves 63.25% accuracy with an average inference time of 6.62s per sample, while all-Quant yields 61.28% accuracy at 3.19s. Our routed approach attains 61.39% accuracy at 3.98s, demonstrating a trade-oA between accuracy and speed.
All experiments were conducted on a DGX system with eight H100 GPUs. For inference, we applied a greedy decoding strategy with a low temperature (0.0) and a maximum output length of 256 tokens. Quantization used 4-bit representation with double quantization and bfloat16 compute type, ensuring a balance between compression and stability.
While our method does not exceed the accuracy of the Base model, it surpasses the Quant model while improving inference speed. This work highlights the potential of dynamic quantization selection to optimize LLM eAiciency, paving the way for further refinements that approach Base model accuracy while maintaining lower inference costs. -
CASSA: Cloud-Adapted Secure Silo Architecture
Masahide Fukuyama, Hideyuki Kawashima
Abstract: In today’s data science landscape, deploying databases on cloud platforms or supercomputers has become a common practice. Cloud providers typically protect data from external threats by encrypting information in transit (e.g., through secure network channels) and at rest (e.g., through storage encryption). However, data processed in memory remains at risk because it could theoretically be accessed by providers with full administrative privileges. Thus, ensuring the confidentiality of data on the server depends on absolute trust in these providers. A promising solution for ensuring both data and query confidentiality in untrusted environments is the use of a Trusted Execution Environment (TEE). EnclaveDB, a pioneering system in the database area, employs TEE to protect both stored data and queries. However, its logging mechanism is sequential and does not take full advantage of the parallel I/O capabilities available in modern storage systems. To increase performance while maintaining data confidentiality and log integrity, we introduce CASSA: Cloud-Adapted Secure Silo Architecture. CASSA provides a sufficient implementation of transactional record logging for recovery and supports Remote Attestation, a feature essential for commercial applications. By leveraging TEE (such as Intel SGX) to secure data during processing and integrating Silo, a high-performance optimistic concurrency control mechanism that enables parallel logging, CASSA achieves high throughput without compromising confidentiality. Our evaluations show that CASSA, particularly its transaction engine, performs competitively even when compared to non-SGX environments. Under the YCSB-A workload (50% read/write) using 60 worker threads, CASSA achieved a throughput of 2.27 million transactions per second with only 7.8% overhead, while maintaining the security guarantees of SGX.
-
Optimizing Epic Deterministic Concurrency Control with More Parallelism
Koki Matsushima, Hideyuki Kawashima
Abstract: In transaction processing systems, ensuring data consistency while maintaining high through- put requires efficient concurrency control. Recently, deterministic concurrency control methods have attracted considerable attention; by determining the execution schedule in advance, they avoid aborts and deliver stable performance even under high-contention workloads. However, in the initialization phase of Epic—one such method—a full scan is required to determine the processing order of batched transactions, and this process has become a bottleneck that impedes parallelization. In this study, we propose a fine-grained optimization technique that eliminates the need for a full scan in order to enhance parallelism in Epic’s initialization phase. Specifically, parallelization is enabled by employing a record ID–based index rather than an epoch-based index when determining the execution order of operations. Evaluation of the proposed method demonstrated up to a 59% improvement in throughput compared with conventional Epic, with particularly notable performance enhancements under high-contention workloads. These results indicate that significant performance improvements in deterministic concurrency control can be achieved through the optimization of the initialization phase.
-
An Operational Data Analysis and Energy Cost-Saving Assessment of the Cogeneration System in an HPC Data Center
Masaaki Terai, Katsuyuki Tanaka, Yasuhiro Hiraki, Shin'ichi Miura, Fumiyoshi Shoji
Abstract: A Cogeneration System (CGS) is power generation equipment that supplies multiple forms of secondary energy (e.g., electricity, heat) from primary energy sources (e.g., oil, natural gas). At RIKEN R-CCS, two 6MW class CGS units have been installed since the K computer operational period. The system performs regular power generation by gas turbine engines using city gas alongside commercial power sources and serves as an alternative to uninterruptible power supply (UPS). By combining waste heat with absorption chillers and once-through boilers, the system produces cooling energy and improves energy efficiency from 30% to approximately 75-80%.
This study evaluates the cost-effectiveness of CGS in a high-performance computing facility. Prior to assessing the cost-benefit performance, we introduced comprehensive unit rates to simplify complex utility rate structures. Using these comprehensive unit rates, our analysis compared the costs between CGS electricity generation through gas purchases and equivalent utility electricity purchases, while evaluating the reduction in electricity costs due to thermal energy utilization. The results revealed that while the system achieved cost benefits in five out of twelve operational years (FY2016, 2017, and 2019-2021), it operated at a loss during most periods. The thermal energy recovery consistently provided positive returns, but these benefits were often offset by negative returns from power generation. When considering the installation and maintenance costs of essential equipment such as absorption chillers and once-through boilers, even the profitable periods’ gains were substantially diminished.
This operational data analysis spanning more than a decade provides important implications for future facility planning and insights into CGS implementation in HPC data center environments amid widely fluctuating energy prices. -
Gato: Optimizing Multi-Version Deterministic Concurrency Control with Dynamic Partitioning
Jiyeon Lim, Haowen Li, Hideyuki Kawashima
Abstract: Multi-Version Deterministic Concurrency Control (MVDCC) ensures consistency by predefining transaction execution order, eliminating aborts and rollbacks. Bohm, an MVDCC-based protocol, assigns records to threads using static partitioning, ensuring deterministic execution. However, this approach can lead to significant workload imbalances, as certain threads may become overloaded while others remain underutilized. Additionally, in the execution phase, read spinning occurs when transactions must wait for unresolved write dependencies, increasing contention and degrading performance in high-contention workloads.
To address these inefficiencies, we propose Gato, an MVDCC protocol integrating dynamic partitioning to enhance load balancing in the concurrency control (CC) phase. Unlike Bohm's static approach, Gato continuously monitors workload distribution and dynamically redistributes partitions among threads in real time, mitigating contention and improving parallelism. Additionally, Gato incorporates split-on-demand, a mechanism designed to minimize read spinning by allowing transactions to proceed without unnecessary stalls when dependencies permit. Our current implementation focuses on Gato's CC phase, evaluating the impact of dynamic partitioning on system performance. Experimental results demonstrate that Gato achieves higher throughput (73,000-77,000 transactions per second) compared to Bohm, particularly under write-intensive workloads. While Bohm's static partitioning leads to severe load imbalances, reducing system efficiency as contention increases, Gato maintains stable execution by dynamically adapting to workload variations.
These results highlight the effectiveness of dynamic partitioning in improving MVDCC performance. Future work will focus on implementing the execution phase, incorporating the split-on-demand mechanism to further optimize transaction processing in high-contention environments.
Keywords: Concurrency control, Multi-versioned databases, Deterministic systems, Dynamic partitioning, High-contention environments. -
A trial for energy-efficient operation by incentivizing user cooperation on Fugaku
Fumichika Sueyasu, Yuji Iguchi, Mitsuo Okamoto, Katsufumi Sugeta, Fumiyoshi Shoji
Abstract: Energy efficiency is a critical challenge in operating large-scale HPC systems. The RIKEN Center for Computational Science (R-CCS) has implemented system-side measures at various stages, including system design, manufacturing, and operation of the supercomputer Fugaku, to enhance energy efficiency. However, in FY2022, a sharp increase in electricity prices necessitated a reduction in overall system power usage, leading to the temporary shutdown of approximately one-third of the compute nodes. This experience demonstrated that system-side power-saving efforts alone are insufficient and highlighted the need for user cooperation in further reducing power consumption. To address this, starting in FY2022, we requested users to voluntarily cooperate in reducing the power consumption of their jobs, but the effectiveness of this volunteer-based approach was limited. Therefore, in FY2023, we introduced the Fugaku Point Program, establishing a system that provides users with incentives to lower job power consumption and encourages their cooperation in energy-efficient operations. This approach, which actively involves users by providing incentives to improve energy efficiency, is pioneering and unique among largescale, practical systems ranked among the top positions on the TOP500 list. This poster presents an overview of the Fugaku Point Program and its operational results in FY2023. Additionally, we discuss the challenges identified during its implementation and the measures taken to address them in the future.
-
ML-Driven Prediction of Time-Varying Throughput with Darshan
YUTSEN TSENG, Keichi Takahashi, Hiroyuki Takizawa
Abstract: Due to the growing complexity of their workloads, High-Performance Computing (HPC) systems increasingly face I/O bottlenecks. Existing I/O-aware schedulers predict aggregate demand but fail to capture time-varying I/O throughput patterns, leading to inefficient resource allocation. This research presents a machine learning-driven framework for dynamically predicting job-specific I/O trends and execution times, enhancing I/O-aware scheduling efficiency.
The proposed method leverages Darshan logs to extract job metadata and transform I/O volumes into time-series data. A quantile regression model estimates execution time variability, while a long short-term memory (LSTM) network captures long-term dependencies in I/O throughput. The framework effectively detects I/O bursts, achieves an execution time prediction accuracy of 80.56% within the 10th-90th percentile range, and identifies 71.22% of sudden peaks within two consecutive time intervals.
These results demonstrate the framework’s potential to improve HPC scheduling by providing workload-specific I/O insights. Future work will enhance I/O peak detection accuracy and integrate the model into real-time scheduling systems for optimized resource allocation. -
Performance Evaluation and Analysis of C++ Parallel STL on GPUs
Joanna R. Masikowska, Keichi Takahashi, Hiroyuki Takizawa
Abstract: The C++17 standard introduced a set of parallel algorithms, referred to as Parallel STL, that is designed to be programmerfriendly and portable across CPUs and GPUs. Since the model is still relatively new, there are few studies comparing it with other GPU programming models. Furthermore, the underlying reasons for performance differences between Parallel STL and GPU programming models are not discussed. This work thus compares the performance of Parallel STL and other GPU programming models and investigates what causes the performance differences. Three benchmarks are selected to compare the selected GPU models: BabelStream, Himeno benchmark, and TestSNAP. In the Himeno benchmark, Parallel STL performance reaches only 61% of the performance of OpenACC. Profiling reveals that Parallel STL has a smaller cache hit ratio than OpenACC and does not make use of shared memory, while OpenACC does. It was also found that the reduction operation was a bottleneck for Parallel STL. In TestSNAP, Parallel STL performance is worse than that of both Kokkos and OpenMP Target Offloading. The reason is likely the usage of Managed Memory, which Parallel STL needs to use. Overall, while the performance of Parallel STL is comparable to other GPU programming models, it still falls behind in some cases. The reasons for poorer performance include a lower cache hit ratio, lack of usage of Shared Memory, weak performance in the reduction operation, and the necessity to use Managed Memory.
Accepted Posters List
About Us
On behalf of HPC Asia 2025committee and the NCHC-National Center for High-performance Computing (NCHC), it is our pleasure to welcome you to HPC Asia 2025, hosted in the vibrant city of Hsinchu, Taiwan. This year’s theme,“Chip-based Exploration and Innovation for HPC,” resonates deeply with Hsinchu's dual legacy of rich cultural heritage and cutting-edge technological advancements.