nature methods Brief Communication https://doi.org/10.1038/s41592-024-02562-6 Efficiently accelerated bioimage analysis with NanoPyx, a Liquid Engine-powered Python framework Received: 5 September 2023 Accepted: 7 November 2024 Bruno M. Saraiva1,2,12, Inês Cunha1,3,4,12, António D. Brito 1,5,12, Gautier Follain Raquel Portela5, Robert Haase 7, Pedro M. Pereira 5, Guillaume Jacquemet 6,8,9,10 & Ricardo Henriques 1,2,5,11 , 6 Published online: xx xx xxxx Check for updates The expanding scale and complexity of microscopy image datasets require accelerated analytical workflows. NanoPyx meets this need through an adaptive framework enhanced for high-speed analysis. At the core of NanoPyx, the Liquid Engine dynamically generates optimized central processing unit and graphics processing unit code variations, learning and predicting the fastest based on input data and hardware. This data-driven optimization achieves considerably faster processing, becoming broadly relevant to reactive microscopy and computing fields requiring efficiency. Super-resolution microscopy has revolutionized cell biology by enabling fluorescence imaging at an unprecedented resolution1–4. However, data collected from these experiments often require specific analytical procedures, such as image registration, resolution enhancement and quantification of data quality and resolution. Many of these procedures use open-source image analysis software, particularly ImageJ5/FIJI6 or napari7. The computational performance of each of these tools bears notable implications for processing time, which becomes especially salient given the increasing need for high-performance computing in bioimaging analysis. In this work we present NanoPyx, a Python framework for microscopy image analysis that exploits the Liquid Engine to massively accelerate analysis workflows. With the increasing use of deep learning, many bioimaging analysis pipelines are now being developed in Python. Pure Python code often runs on a single central processing unit (CPU) core, impacting the performance and speed of Python frameworks. Alternative solutions, such as Cython8, PyOpenCL9 and Numba10, allow CPU and graphics processing unit (GPU) parallelization, which can reduce run times (Supplementary Note 1). However, identifying the swiftest implementation depends on the hardware, input data and parameters. Figure 1 illustrates a case where denoising the larger image with a nonlocal means (NLM) algorithm11,12 is approximately two times faster when using a CPU unthreaded strategy than a pixel-wise threaded implementation strategy on a GPU in a professional workstation (Fig. 1c and Supplementary Note 2). Notably, the same algorithm cannot be run on the testing laptop’s GPU with the same parameters due to architecture limitations (Fig. 1b). This means that certain acceleration strategies have hardware constraints and require a different approach. However, for other conditions (condition 2 on workstation and laptop and condition 3 on laptop), GPU-based processing is a faster alternative for the same NLM algorithm. Extended Data Figs. 1–5 further support these observations, by illustrating run times for various implementations across distinct datasets and parameters on contrasting hardware set-ups. Another example is Catmull–Rom13 interpolations parallelized in a pixel-wise manner (Extended Data Fig. 2), in which choosing an OpenCL14 implementation for lower-sized images could escalate run time by several orders of magnitude compared with parallelized CPU processing. Similarly, threaded CPU processing for larger-sized images performed up to 30 times more slowly than GPU processing on professional workstations. Supplementary Tables 1–4 present benchmarks Instituto Gulbenkian de Ciência, Oeiras, Portugal. 2Gulbenkian Institute for Molecular Medicine, Oeiras, Portugal. 3Instituto Superior Técnico, Lisbon, Portugal. 4Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden. 5Instituto de Tecnologia Química e Biológica António Xavier, Universidade Nova de Lisboa, Oeiras, Portugal. 6Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland. 7DFG Cluster of Excellence “Physics of Life”, TU Dresden, Dresden, Germany. 8Turku Bioimaging, University of Turku and Åbo Akademi University, Turku, Finland. 9Faculty of Science and Engineering, Cell Biology, Åbo Akademi University, Turku, Finland. 10InFLAMES Research Flagship Center, Åbo Akademi University, Turku, Finland. 11UCL-Laboratory for Molecular Cell Biology, University College London, London, UK. 12 These authors contributed equally: Bruno M. Saraiva, Inês Cunha, António D. Brito. e-mail: r.henriques@itqb.unl.pt 1 Nature Methods ----!@#$NewPage!@#$---- Brief Communication https://doi.org/10.1038/s41592-024-02562-6 500 1,000 Patch size = 5 500 Runtime on laptop c Task 1 … Task 2 Task n 200 200 Dimensions in pixels Slow Input Output Runtime on workstation Condition Condition Condition 1 2 3 Implementation Hardware GPU 8.754 s 0.024 s limitation Condition Condition Condition 1 2 3 51.92 s 0.481 s 748.0 s 12.04 s 0.031 s 0.062 s 1 1 2 2 3 … 648.2 s 11.68 s 0.024 s 0.062 s 1 2 1 2 3 … 646.7 s 11.66 s 0.026 s 32.28 s 0.068 s 1 2 3 1 2 … 740.2 s 12.41 s 0.032 s 89.22 s 0.114 s 27.84 s 37.61 s 0.052 s CPU threaded 1 1 1 2 2 … 1,015 s 32.33 s 0.068 s 1 1 1 2 2 … CPU T. dynamic 1 1 2 2 3… 990.0 s 30.03 s CPU T. guided 1 2 1 2 3 … 990.5 s 29.98 s CPU T. static 1 2 3 1 2 … 1,019 s 64.78 s CPU unthreaded d = 10 Patch size = 5 a 0.070 s … Patch size = 50 1,000 b Condition 3 d = 100 … Benchmarked task: nonlocal means image denoising Fast Condition 2 d = 50 … Condition 1 a Fastest path Fig. 1 | Comparative run times of multiple implementations of an algorithm, run on a consumer-grade laptop and a professional workstation. a–c, The fastest implementation (Supplementary Note 3) depends on various factors such as the shape of the input data, method-specific parameters and the user device. a, Nonlocal mean denoising is performed on images of varying shape using a collection of patch sizes and distances (d). b, Run times of various conditions when performing analysis on a consumer-grade laptop; condition 1 could not be run on the GPU due to hardware limitations. T. dynamic, threaded dynamic; T. guided, threaded guided; T. static, threaded static. c, On a professional workstation, faster implementation changes with each condition, illustrating how it is affected by the input data and method-specific parameters. across ten different hardware set-ups, highlighting the limitations of relying on a single implementation, because it may not universally offer the fastest performance. Here we introduce NanoPyx, a high-performance bioimaging analysis framework exploiting the Liquid Engine. It uses multiple variations (here called implementations; Supplementary Note 3) of the same algorithm to perform a specific task. These variations include multiple acceleration strategies, including PyOpenCL9, CUDA15 (using CuPy16), Cython8, Numba10, Transonic17 and Dask18 (Extended Data Figs. 1–5). Although these implementations provide numerically identical outputs for the same input, their computational performance differs by exploiting different computational strategies. The Liquid Engine features three main components: (1) metaprogramming tools for multihardware implementation (using Mako templates19 and a custom script, named c2cl; Supplementary Note 4); (2) an automatic benchmarking system; and (3) a supervisor machine learning-based agent that determines the ideal combination of implementations to maximize performance (Fig. 2). Liquid Engine uses a machine learning system (Supplementary Note 5) to predict the optimal combination of implementations while including device-dependent performance variations (Fig. 1 and Extended Data Figs. 1–5). When a user does not have access to one of the implementations, Liquid Engine ignores it, guaranteeing that the user will always be able to process their images. Dynamic benchmarking substantially enhances computational speed for tasks involving input data of varying size. This technique predicts when to switch between different algorithmic implementations, resulting in up to 24-fold faster processing compared with use of the pixel-wise parallelization strategy (CPU threaded; Fig. 2). Even when compared with running both methods on a GPU, performing dynamic implementation selection still provides 1.75-fold acceleration (Supplementary Table 5). Liquid Engine maintains a historic record of run times for each implementation. Manual benchmarking can be initiated by the user, prompting Liquid Engine to profile the execution of each implementation and identify the fastest (Supplementary Table 5). The system uses fuzzy logic20 (Supplementary Note 6) to identify the benchmarked example with the most similar input properties, utilizing it as a baseline for the expected Nature Methods Data dimensions Agent Record Run time 1 1 1 2 2 … 1 1 2 2 3 … CPU CPU CPU b … GPU Run time in workstation Implementation CPU unthreaded CPU threaded GPU Fastest path 1 1 1 2 2 … Task 1 Denoising + Task 2 eSRRF = Workflow run time 29.55 s 49.54 s 79.10 s 725.3 s 3.906 s 729.69 s 51.94 s 0.209 s 52.15 s Liquid Engine 29.76 s Fig. 2 | NanoPyx achieves optimal performance by exploiting Liquid Engine’s self-optimization capabilities. a, NanoPyx is built on top of the Liquid Engine, which automatically benchmarks implementations of all tasks in a specific workflow. Liquid Engine retains a historical record of the run times of each task and input used, allowing a machine learning-based agent to select the fastest combination of implementations. b, Liquid Engine dynamically chooses the fastest implementation for each method, based on its input parameters. For a workflow performing denoising on a 1,000 × 1,000 image, using NLM11,12 (patch distance 50 pixels, patch size 50 pixels, sigma 1.0 and cut-off distance (h) 0.1), followed by super-resolution of the data with eSRRF21 (magnification ×5, radius 1.5, sensitivity 1 and using intensity weighting), Liquid Engine selects the fastest combination of implementations to substantially reduce run times. execution time (Supplementary Table 5). This system enables NanoPyx to make instant decisions based on an initially limited set of records, progressively improving its performance as further data are obtained. Each time a workflow is scheduled to run, a supervisor agent is responsible for selecting the best implementation based on previous run times; this selection is made without imposing any substantial overhead (Supplementary Table 5). When users do not trigger manual benchmarking, the agent uses ‘factory-default’ benchmarks until sufficient run times have been recorded on the user’s hardware. The agent constantly monitors the run times of all available methods, and can adapt to unexpected delays by ensuring that the optimal implementation is selected. In the case where a severe delay is detected, the agent predicts whether the optimal implementation has changed and calculates the likelihood of that delay being repeated in the future (Extended Data Fig. 6). Over the course of ----!@#$NewPage!@#$---- Brief Communication several sequential runs of the same method, we show that delay management improved average run time by a factor of 1.8 for a two-dimensional (2D) convolution and by 1.5 for an established super-resolution radial fluctuations (eSRRF)21 analysis (Extended Data Fig. 7). NanoPyx enhances and expands the super-resolution analysis methods previously included in the NanoJ21–24 plugin family, and introduces additional bioimage analysis techniques, including example testing datasets (Supplementary Note 7). Extended Data Fig. 8 illustrates an example workflow where NanoPyx starts by performing drift correction. NanoPyx then allows super-resolution reconstructions using SRRF23 or its improved version, eSRRF21. Next, quality assessment is performed by running Fourier ring correlation25 and decorrelation analysis26, and by calculating a SQUIRREL error map24. Besides the aforementioned methods, NanoPyx also includes channel registration (Extended Data Fig. 9), multiple interpolators, 2D convolution, denoising through NLM11,12 and several other bioimage analysis methods (Supplementary Table 6). Although not all of these methods exploit the advantages of Liquid Engine (Supplementary Table 6), we are actively developing new parallelization strategies for the remaining methods. NanoPyx is accessible as a Python library, which can be installed via either Python package index or our GitHub repository (Supplementary Table 5). Liquid Engine is also available as a standalone Python package that is readily integrated in other projects. Alongside these Python libraries, we provide cookiecutter (https://cookiecutter.readthedocs. io) template files to help developers implement their own methods using Liquid Engine (Supplementary Note 8). Secondly, we provide Jupyter notebooks27 (Supplementary Fig. 1a and Supplementary Table 5). Users of these notebooks are not required to interact with any code directly, because a graphical user interface is generated28. Lastly, we developed a plugin for napari7, a Python image viewer (Supplementary Fig. 1b). By offering these three distinct user interfaces, we ensure that NanoPyx can be readily utilized by users irrespective of their coding proficiency level. In NanoPyx’s repository, we have provided usage guidelines for end-users along with several tutorials, including videos (Supplementary Table 7), on how to run NanoPyx through any of its interfaces, and how to implement their own methods exploiting optimization of Liquid Engine (Supplementary Note 8). Looking ahead, a priority for NanoPyx is expanding support for emerging techniques such as artificial intelligence-assisted imaging and smart microscopes. Because these methods involve processing data in real time during acquisition, NanoPyx’s accelerated performance becomes critical. In addition, we aim to incorporate more diverse processing workflows beyond currently implemented methods. Cumulatively, NanoPyx delivers adaptive performance optimization to accelerate bioimage analysis while retaining modular design and easy adoption. This flexible framework is important and timely, given the expanding volumes of microscopy data and the need for data-driven reactive microscopy. The optimization principles embodied in its Liquid Engine can be extended to other scientific workloads requiring high computational efficiency. As data scales expand, NanoPyx offers researchers an actively improving platform to execute demanding microscopy workflows. Online content Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41592-024-02562-6. References 1. Rust, M. J., Bates, M. & Zhuang, X. Stochastic optical reconstruction microscopy (STORM) provides sub-diffractionlimit image resolution. Nat. Methods 3, 793–795 (2006). Nature Methods https://doi.org/10.1038/s41592-024-02562-6 2. Bates, M., Huang, B. & Zhuang, X. Super-resolution microscopy by nanoscale localization of photo-switchable fluorescent probes. Curr. Opin. Chem. Biol. 12, 505–514 (2008). 3. Hell, S. W. & Wichmann, J. Breaking the diffraction resolution limit by stimulated emission: stimulated-emissiondepletion fluorescence microscopy. Opt. Lett. 19, 780–782 (1994). 4. Guerra, J. M. Super‐resolution through illumination by diffraction‐born evanescent waves. Appl. Phys. Lett. 66, 3555–3557 (1995). 5. Schindelin, J., Rueden, C. T., Hiner, M. C. & Eliceiri, K. W. The ImageJ ecosystem: an open platform for biomedical image analysis. Mol. Reprod. Dev. 82, 518–529 (2015). 6. Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods 9, 676–682 (2012). 7. Sofroniew, N. et al. napari: A multi-dimensional image viewer for Python. Zenodo https://doi.org/10.5281/zenodo.7276432 (2022). 8. Behnel, S. et al. Cython: the best of both worlds. Comput. Sci. Eng. 13, 31–39 (2011). 9. Kloeckner, A. et al. PyOpenCL. Zenodo https://doi.org/10.5281/ zenodo.7063192 (2022). 10. Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. In Proc. Second Workshop on the LLVM Compiler Infrastructure in HPC 1–6 (ACM, 2015); https://doi.org/10.1145/2833157.2833162 11. Buades, A., Coll, B. & Morel, J.-M. Non-local means denoising. Image Process. Line 1, 208–212 (2011). 12. Darbon, J., Cunha, A., Chan, T. F., Osher, S. & Jensen, G. J. Fast nonlocal filtering applied to electron cryomicroscopy. In Proc. 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro 1331–1334 (2008); https://doi.org/10.1109/ ISBI.2008.4541250 13. Catmull, E. & Rom, R. A class of local interpolating splines. in Computer Aided Geometric Design (eds Barnhill, R. E. & Riesenfeld, R. F.) 317–326 (Academic Press, 1974); https://doi.org/ 10.1016/B978-0-12-079050-0.50020-5 14. Stone, J. E., Gohara, D. & Shi, G. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 66–73 (2010). 15. CUDA Toolkit - Free Tools and Training. NVIDIA Developer https://developer.nvidia.com/cuda-toolkit 16. Okuta, R., Unno, Y., Nishino, D., Hido, S. & Loomis, C. CuPy: A NumPy-compatible library for NVIDIA GPU calculations. In Proc. Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017). 17. fluiddyn/transonic Make your Python code fly at transonic speeds! GitHub https://github.com/fluiddyn/transonic (2019). 18. Rocklin, M. Dask: parallel computation with blocked algorithms and task scheduling. In SciPy Proc. 126–132 (2015); https://doi.org/ 10.25080/Majora-7b98e3ed-013 19. Bayer, M. Mako: templates for Python. BibSonomy www. bibsonomy.org/bibtex/aa47d818a1c2f889b7456117003b3d42 (2012). 20. Novak, V., Perfiljeva, I. & Mockor, J. Mathematical Principles of Fuzzy Logic (Springer, 1999); https://doi.org/10.1007/978-1-46155217-8 21. Laine, R. F. et al. High-fidelity 3D live-cell nanoscopy through data-driven enhanced super-resolution radial fluctuation. Nat. Methods 20, 1949–1956 (2023). 22. Laine, R. F. et al. NanoJ: a high-performance open-source super-resolution microscopy toolbox. J. Phys. Appl. Phys. 52, 163001 (2019). ----!@#$NewPage!@#$---- Brief Communication 23. Gustafsson, N. et al. Fast live-cell conventional fluorophore nanoscopy with ImageJ through super-resolution radial fluctuations. Nat. Commun. 7, 12471 (2016). 24. Culley, S. et al. Quantitative mapping and minimization of super-resolution optical imaging artifacts. Nat. Methods 15, 263–266 (2018). 25. Nieuwenhuizen, R. P. J. et al. Measuring image resolution in optical nanoscopy. Nat. Methods 10, 557–562 (2013). 26. Descloux, A., Grußmayer, K. S. & Radenovic, A. Parameter-free image resolution estimation based on decorrelation analysis. Nat. Methods 16, 918–924 (2019). 27. Kluyver, T. et al. Jupyter Notebooks – a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players 87–90 (IOS Press, 2016); https://doi.org/10.3233/978-1-61499-649-1-87 28. Haase, R., Bragantini, J. & Amsalem, O. haesleinhuepf/stack view: 0.6.2. Zenodo https://doi.org/10.5281/zenodo.7847336 (2023). Nature Methods https://doi.org/10.1038/s41592-024-02562-6 Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/. © The Author(s) 2025 ----!@#$NewPage!@#$---- Brief Communication Methods Mammalian cell culture Human umbilical vein endothelial cells (HUVEC) (PromoCell, catalog no. C-12203) were grown in endothelial cell growth medium (PromoCell, catalog no. C-22010), with a supplementary mix ((Promocell, catalog no. C-39215) and 1% penicillin/streptomycin (Sigma); Fig. 1). Endothelial primary cells from P0 (commercial vial) were expanded to a P3 stock frozen at −80 °C to standardize the experimental replicates. A549 cells (The European Collection of Authenticated Cell Cultures) were cultured in phenol red-free, high-glucose, l-glutamine containing DMEM (Thermo Fisher Scientific), supplemented with 10% (v/v) fetal bovine serum (Sigma) and 1% (v/v) penicillin/streptomycin (Thermo Fisher Scientific), at 37 °C in an incubator with 5% CO2 (Extended Data Fig. 8). Sample preparation for microscopy HUVEC were seeded in an eight-well, glass-bottom µ-slide (Ibidi, catalog no. 80807) precoated with warm endothelial cell growth medium without antibiotics (50,000 cells per well). Cells were then grown for 48 h, fixed with prewarmed 4% paraformaldehyde in PBS (Thermo Fisher Scientific, catalog no. 28908) for 10 min at 37 °C and stained with DAPI. A549 cells were seeded on an eight-well, glass-bottom µ-slide (ibidi) at density 0.05–0.10 × 106 cells cm−2. Following 24 h incubation at 37 °C and under 5% CO2, cells were washed once with PBS and fixed for 20 min at 23 °C in 4% paraformaldehyde in PBS. Following fixation, cells were washed three times in PBS (5 min each), quenched for 10 min in a solution of 300 mM glycine (in PBS) and permeabilized using a solution of 0.2% Triton-X (in PBS) for 20 min at 23 °C. Following three washes (5 min each) in washing buffer (0.05% Tween-20 in PBS), cells were blocked for 30 min in blocking buffer (5% BSA and 0.05% Tween20 in PBS). Samples were then incubated with a mix of anti-α-tubulin antibodies (1 µg ml−1 clone DM1A (Sigma), 2 µg ml−1 clone 10D8 (BioLegend), 2 µg ml−1 clone AA10, BioLegend) and anti-septin 7 (1 µg ml−1, catalog no. 18991, IBL) for 16 h at 4 °C in blocking buffer. Following three washes (5 min each) in washing buffer, cells were incubated with Alexa Fluor 647 conjugated goat anti-mouse IgG and Alexa Fluor 555 conjugated goat anti-rabbit IgG (6 µg ml−1 in blocking buffer) for 1 h at 23 °C. Cell nuclei were counterstained with Hoechst 33342 (1 µg ml−1). Cells were then washed three times (5 min each) in washing buffer and once in 1× PBS for 10 min. Finally, cells were mounted using glucose oxidase and β-mercaptoethylamine (50 mM Tris, 10 mM NaCl, pH 8.0, supplemented with 50 mM β-mercaptoethylamine, 10% (w/v) glucose, 0.5 mg ml−1 glucose oxidase and 40 μg ml−1 catalase). Data acquisition HUVEC were imaged using a Marianas spinning-disk confocal microscope equipped with a Yokogawa CSU-W1 scanning unit on an inverted Zeiss Axio Observer Z1 microscope, controlled by SlideBook 6 (Intelligent Imaging Innovations, Inc.) (Fig. 1). Images were acquired using an Evolve 512 EMC CD camera (chip size, 512 × 512; Photometrics); the objective used was an M27 ×63/1.4 numerical aperture (NA), oil immersion (Plan-Apochromat). Data acquisition was performed with a Nanoimager microscope (Oxford Nanoimaging) equipped with an Olympus ×100/1.45 NA oil-immersion objective (Extended Data Fig. 8). Imaging was performed using 405-, 488- and 640-nm lasers for Hoechst-33342, AlexaFluor555 and AlexaFluor647 excitation, respectively. Fluorescence was detected using a sCMOS camera (ORCA Flash, 16 bit). For channel 0, a dichroic filter with bands of 498–551 and 576–620 nm was used and, for channel 1, a 665–705-nm dichroic filter. Sequential multicolor acquisition was performed for AlexaFluor647, AlexaFluor555 and Hoechst-33342. Using epifluorescence illumination, a pulse of high laser power (90%) of the 640-nm laser was used, with 10,000 frames immediately acquired. The sample was then excited with the 488-nm laser (13.7% laser power), with 500 frames acquired, followed by 405-nm laser excitation (40% laser power) with acquisition Nature Methods https://doi.org/10.1038/s41592-024-02562-6 of a further 500 frames. For all acquisitions, an exposure time of 10 ms was used. Liquid Engine agent Run times of methods implemented in NanoPyx through Liquid Engine are locally stored on the user’s home folder inside a folder titled .liquid_engine. For OpenCL implementations, the agent also stores an identification of the device and can detect hardware changes. Whenever a method is run through Liquid Engine, the overseeing agent reads the 50 most recent recorded run times. If there are fewer than 50 recorded runs but more than three, the agent will proceed with the available recorded runs. However, if there are fewer than three runs recorded, all Liquid Engine methods will revert to default benchmarks that can be either supplied with the package or defined by the user. For each implementation, the agent then divides the available corresponding run times into two separate sets of equal length, one containing the fastest run times and the other the slowest. We then calculate average and standard deviation for both sets, namely FastAverage, FastStdDev, SlowAverage and SlowStdDev (equations (1–5)). This split in run times helps identify the start or end of a delay. By comparison against the set of fastest run times, we ensure that previous delayed run times do not skew normal average run time. On the other hand, the set of slowest run times, although not guaranteed to be exactly like a delayed run time, helps us estimate a lower bound to that which a higher-than-average run time could look like. Once the method has finished running, the agent checks whether there was a delay (Delay). A delayed implementation is defined by having its run time (Measured Run Time) higher than the recorded average run time of the fastest runs, plus four times the standard deviation of the fastest runs (equation (1)). If a delay is detected (Extended Data Fig. 6), the agent will also calculate the delay factor (DelayFactor, equation (2)) and will activate a probabilistic approach that stochastically selects which method to run. This is performed using a logistic regression model that calculates the probability of the delay being present on the next run (Pdelay), and by adjusting the expected run time of the delayed implementation (Adjusted Run Time) according to equation (3), while still using FastAverage for all nondelayed implementations. The agent then picks which implementation to use, based on probabilities assigned to each implementation (given by PRun Time k for a given implementation k), using 1 over the square of adjusted run time and normalized for the run times of all other implementations (equation (4)). This stochastic approach ensures that the agent will still run the delayed implementation from time to time to check whether that delay is still present. During a subsequent run, the agent will evaluate whether there is a delay. It will consider the delay as over when the measured run time is either (1) lower than the slow average minus the standard deviation (Std) of the slowest runs, or (2) lower than the fast average plus the standard deviation of the fastest runs (as per equation (5)). Once the delay is over, the agent will revert to selecting which implementation to use based on the fast average of each implementation (as shown in Extended Data Figs. 6 and 7). Delay = True if Measured Run Time > (FastAverage + 4 × Std) DelayFactor = Measured Run Time FastAverage (2) Adjusted Run Time = FastAverage × (1 − Pdelay ) + FastAverage × DelayFactor × Pdelay PRun Time k = ∑ (1) 1 1 × 2 Run Time2 Adjusted Run Time2 k (3) (4) ----!@#$NewPage!@#$---- Brief Communication https://doi.org/10.1038/s41592-024-02562-6 Delay = False if (Measured Run Time < (SlowAverage − SlowStdDev)) ∨ (Measured Run Time > (FastAverage + FastStdDev)) (5) Benchmarking run times For laptop benchmarks, a MacBook Air M1 Pro with 16 GB of random-access memory (RAM) and a 512-GB, solid-state drive (SSD) was used. For the professional workstation, a custom-made desktop computer was used containing an Intel i9-13900K, a NVIDIA RTX 4090 with 24 GB of dedicated video memory, a 1 TB SSD and 128 GB of DDR5 RAM. The first benchmark performed (Fig. 1 and Extended Data Fig. 2) was a fivefold upsampling of the input data, using a Catmull–Rom13 interpolator. Benchmarks were performed on three different input images with shapes of 1 × 10 × 10, 10 × 10 × 10, 10 × 100 × 100, 10 × 300 × 300, 100 × 300 × 300 and 500 × 300×300 (time points × height × width). The second benchmarks (Extended Data Fig. 1) were nonlocal means denoising performed on images of 200 × 200, 500 × 500 and 1,000 × 1,000 pixels using, respectively, 10, 100 and 50 as patch distance, with varying patch size (5, 10, 20, 50 and 100). The third benchmarks (Extended Data Figs. 2–5) were 2D convolutions using a kernel of varying size (1, 5, 9, 13, 17, 21), where all kernel values are 1, on images of varying size (100, 500, 1,000, 2,500, 5,000, 7,500, 10,000, 15,000 or 20,000 pixels for both dimensions). Supplementary Tables 1–4 describe ten different hardware set-ups used for benchmarking three different conditions of 2D convolution, Catmull–Rom interpolation and nonlocal means denoising. Benchmarking delay management For evaluation of Liquid Engine’s delay management capabilities, we benchmarked its performance on 2D convolutions and eSRRF reconstructions under induced delay conditions. The hardware used was a high-end desktop with an Intel i9-13900K CPU, NVIDIA RTX 4090 GPU, 128-GB DDR5 RAM and 1-TB SSD. For the 2D convolution task, we applied a 9 × 9 kernel on 6,000 × 6,000-pixel random images. To simulate a delay, we used a separate Python process that allocated >24 GB of GPU memory for irrelevant computations, thus overloading the GPU. We executed 400 sequential convolutions, introducing artificial delay during convolutions 101–200, and compared run times with and without Liquid Engine optimization enabled. Similarly, for eSRRF, we reconstructed a 100 × 100 × 100-pixel random volume with parameters magnification = 5, radius = 1.5 and sensitivity = 1. Artificial delay was induced on reconstructions 51–100 out of a total of 200. Run times were again collected and analyzed with Liquid Engine on and off. In both tasks, Liquid Engine detected abnormal delay during the overloaded period based on run time spikes; it then switched its implementation preference probabilistically to avoid using the delayed GPU code. Reporting summary Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Data availability The datasets used in the figures are either listed in Supplementary Table 8 or are available for download via Zenodo at https://zenodo. org/record/8318395 (ref. 29). Source data are provided with this paper. Code availability The NanoPyx python library and Jupyter Notebooks can be found in our Github repository (https://github.com/HenriquesLab/NanoPyx). The Liquid Engine Python library can be found in our GitHub repository (https://github.com/HenriquesLab/LiquidEngine). The cookiecutter templates can be found in this GitHub repository (https://github. com/HenriquesLab/LiquidEngineCookieCutter). The napari plugin implementing all NanoPyx methods can be found in a separate Github repository (https://github.com/HenriquesLab/napari-NanoPyx). Nature Methods References 29. Bruno S. et al. NanoPyx – figures’ data. Zenodo https://doi.org/ 10.5281/zenodo.8318394 (2023). 30. Pylvänäinen, J. W. et al. Fast4DReg – fast registration of 4D microscopy datasets. J. Cell Sci. 136, jcs260728 (2023). Acknowledgements We thank the previous developers of the NanoJ framework, whose work inspired this study. In addition, we thank L. Royer and J. Nunez-Iglesias for their invaluable feedback and guidance in preparing our work. R. Henriques, P.M.P. and R.P. acknowledge support from LS4FUTURE Associated Laboratory (no. LA/P/0087/2020). R. Henriques, B.M.S. and I.C. acknowledge the support of the Gulbenkian Foundation (Fundação Calouste Gulbenkian); the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 101001332); the European Commission through the Horizon Europe program (AI4LIFE project with grant agreement no. 101057970-AI4LIFE and RT-SuperES project with grant agreement no. 101099654-RT-SuperES); the European Molecular Biology Organization Installation Grant (no. EMBO2020-IG-4734); and the Chan Zuckerberg Initiative Visual Proteomics Grant (no. vpi-0000000044; https://doi.org/10.37921/743590vtudfp). In addition, A.D.B. acknowledges the FCT 2021.06849.BD fellowship. R. Henriques and B.M.S. also acknowledge that this project has been made possible in part by a grant from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundations (Chan Zuckerberg Initiative Napari Plugin Foundations Grants Cycle 2, no. NP2-0000000085). P.M.P. and R.P. acknowledge support from Fundação para a Ciência e Tecnologia (Portugal) project grant no. PTDC/BIA-MIC/2422/2020 and the MOSTMICRO-ITQB R&D Unit (nos. UIDB/04612/2020 and UIDP/04612/2020). P.M.P. acknowledges support from La Caixa Junior Leader Fellowship (no. LCF/BQ/ PI20/11760012), financed by ‘la Caixa’ Foundation (ID 100010434) and the European Union’s Horizon 2020 research and innovation program under Marie Skłodowska-Curie grant agreement no. 847648, and from a Maratona da Saúde award. This study was supported by the Academy of Finland (no. 338537 to G.J.), the Sigrid Juselius Foundation (to G.J.), the Cancer Society of Finland (Syöpäjärjestöt, to G.J.) and the Solutions for Health strategic funding to Åbo Akademi University (to G.J.). This research was supported by InFLAMES Flagship Program of the Academy of Finland (decision no. 337531). Author contributions B.M.S., P.M.P., G.J. and R. Henriques conceived the study in its initial form. B.M.S., I.C., A.D.B. and R. Henriques developed the NanoPyx framework, with code contributions from R. Haase and G.J. B.M.S., I.C., A.D.B. and R. Henriques designed the Liquid Engine optimization approach. B.M.S., I.C. and A.D.B. implemented the Liquid Engine tools. G.F., R.P., P.M.P. and G.J. provided samples, data, critical feedback, testing and guidance. B.M.S., I.C., A.D.B., G.F. and G.J. performed experiments and analysis. B.M.S., P.M.P., G.J. and R. Henriques acquired funding. B.M.S., P.M.P., R. Haase, G.J. and R. Henriques supervised the work. B.M.S., I.C., A.D.B., G.J. and R. Henriques wrote the manuscript, with input from all authors. Competing interests The authors declare no competing interests. Additional information Extended data is available for this paper at https://doi.org/10.1038/s41592-024-02562-6. Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41592-024-02562-6. ----!@#$NewPage!@#$---- Brief Communication https://doi.org/10.1038/s41592-024-02562-6 Correspondence and requests for materials should be addressed to Ricardo Henriques. review of this work. Peer reviewer reports are available. Primary Handling Editor: Rita Strack, in collaboration with the Nature Methods team. Peer review information Nature Methods thanks Christian Tischer and the other, anonymous, reviewer(s) for their contribution to the peer Reprints and permissions information is available at www.nature.com/reprints. Nature Methods ----!@#$NewPage!@#$---- Brief Communication Extended Data Fig. 1 | Run times of non-local means denoising are dependent on their implementation. The fastest run time for non-local means denoising changes according to the parameters defined. The pixel-wise implementation thrives when both the patch distance and patch size is relatively small. Nature Methods https://doi.org/10.1038/s41592-024-02562-6 The patch-wise implementation is virtually independent on the patch size and although it sports higher memory costs its computational efficiency makes it an attractive option for bigger patch sizes. ----!@#$NewPage!@#$---- Brief Communication Extended Data Fig. 2 | Ratio between the run times of OpenCL and other implemented run types. Run times of a 5x Catmull-rom13 interpolation were measured across multiple input data sizes using either a MacBook Air M1 (A), a Professional Workstation (B) or Google Colaboratory (C). Using the fastest Nature Methods https://doi.org/10.1038/s41592-024-02562-6 implementation can lead to up to 10x faster code execution. Area within dashed lines correspond to kernel and image sizes where OpenCL is faster than other implementations. ----!@#$NewPage!@#$---- Brief Communication https://doi.org/10.1038/s41592-024-02562-6 Extended Data Fig. 3 | Ratio between the run times of a 2D convolution. Run times were measured across multiple input data sizes and kernel sizes using either a MacBook Air M1 (A) or a Professional Workstation (B). Areas within dashed lines correspond to kernel and image sizes where OpenCL is faster than threaded CPU. Nature Methods ----!@#$NewPage!@#$---- Brief Communication Extended Data Fig. 4 | Run time of each implementation is dependent on the shape of input data. A 2D convolution was performed on images with increasing size using either a MacBook Air M1 Pro (A) or a professional workstation (B). A 21 by 21 kernel was used for the laptop and a 5x5 for the workstation. The run times Nature Methods https://doi.org/10.1038/s41592-024-02562-6 of each implementation vary according to the size of the input image. Bottom panels correspond to zoomed in windows of top panels, indicated by dotted boxes. ----!@#$NewPage!@#$---- Brief Communication Extended Data Fig. 5 | Kernel size impacts which implementation is the fastest. A 2D convolution was performed on images with varying kernel sizes, ranging from 1 to 21 (every 4) using either a MacBook Air M1 Pro on a 500x500 image (A) or a professional workstation on 2500x2500 image (B). While unthreaded is Nature Methods https://doi.org/10.1038/s41592-024-02562-6 virtually always the slowest implementation, the threaded implementations are only the fastest for small kernel sizes, after which a GPU-based implementation becomes the fastest. Bottom panels correspond to zoomed in windows of top panels, indicated by dotted boxes. ----!@#$NewPage!@#$---- Brief Communication Extended Data Fig. 6 | Schematic of the Agent decision making for delay management. The agent identifies delays when an implementation’s run time exceeds the fastest average plus four standard deviations (Equation 1). Upon detection, it calculates a delay factor (Equation 2) and uses a probabilistic approach with Logistic Regression to adjust run times (Equation 3) and Nature Methods https://doi.org/10.1038/s41592-024-02562-6 select implementations stochastically (Equation 4). This ensures delayed implementations are periodically testes while favoring faster alternatives. A delay is considered resolved when the run time falls below thresholds defined in Equation 5, after which the agent reverts to selecting implementations based on their fast average run times. ----!@#$NewPage!@#$---- Brief Communication Extended Data Fig. 7 | Example of delay management by the Liquid engine. Multiple two-dimensional convolutions (A) and eSRRF analysis (B) were run sequentially in a professional workstation. Starting from two initial benchmarks, the Agent is responsible for informing the Liquid Engine on the best probable Nature Methods https://doi.org/10.1038/s41592-024-02562-6 implementation. An artificial delay was induced by overloading the GPU with superfluous calculations in a separate Python interpreter. Dashed lines represent average run times. ----!@#$NewPage!@#$---- Brief Communication Extended Data Fig. 8 | Microscopy image processing workflow using NanoPyx methods. Through NanoPyx, users can correct drift, generate a super-resolved image using eSRRF21, assess the quality of the generated image using Fourier Ring Correlation (FRC)25, Image Decorrelation Analysis26, and NanoJ-SQUIRREL24 Nature Methods https://doi.org/10.1038/s41592-024-02562-6 metrics. NanoPyx methods are made available as a Python library, Jupyter Notebooks that can be run locally or through Google Colaboratory, and as a napari plugin. Scale bars, 10 µm. ----!@#$NewPage!@#$---- Brief Communication https://doi.org/10.1038/s41592-024-02562-6 Extended Data Fig. 9 | Example channel registration of a calibration slide. NanoPyx allows users to perform channel registration based on the NanoJ22 implementation. Example data of calibration slide obtained from the freely available data in the Fast4DReg30 publication. Nature Methods ----!@#$NewPage!@#$---- ----!@#$NewPage!@#$---- ----!@#$NewPage!@#$---- α α α α κ