Please use this identifier to cite or link to this item:
http://hdl.handle.net/10662/24343
Title: | A model-driven approach for systematic reproducibility and replicability of data science projects |
Authors: | Melchor González, Francisco Javier Rodríguez Echeverría, Roberto Conejero Manzano, José María Prieto Ramos, Álvaro Gutierrez Gallardo, Juan Diego |
Keywords: | Reproducibilidad;Replicabilidad;Proceso;Ciencia de datos;Ingeniería basada en modelos;Reproducibility;Replicability;Process;Data science;Model-driven engineering |
Issue Date: | 2022 |
Publisher: | Springer |
Abstract: | In the last few years, there has been an important increase in the number of tools and approaches to define pipelines that allow the development of data science projects. They allow not only the pipeline definition but also the code generation needed to execute the project providing an easy way to carry out the projects even for non-expert users. However, there are still some challenges that these tools do not address yet, e.g. the possibility of executing pipelines defined by using different tools or execute them in different environments (reproducibility and replicability) or models validation and verification by identifying inconsistent operations (intentionality). In order to alleviate these problems, this paper presents a Model-Driven framework for the definition of data science pipelines independent of the particular execution platform and tools. The framework relies on the separation of the pipeline definition into two different modelling layers: conceptual, where the data scientist may specify all the data and models operations to be carried out by the pipeline; operational, where the data engineer may describe the execution environment details where the operations (defined in the conceptual part) will be implemented. Based on this abstract definition and layers separation, the approach allows: the usage of different tools (even in the same pipeline) improving, thus, process replicability; the automation of the process execution (by using code generation engines), enhancing process reproducibility; and the definition of model verification rules, providing intentionality restrictions. |
Description: | Publicado en: Franch, X., Poels, G., Gailly, F., Snoeck, M. (eds) Advanced Information Systems Engineering. CAiSE 2022. Lecture Notes in Computer Science, vol 13295. Springer, Cham. https://doi.org/10.1007/978-3-031-07472-1_9 |
URI: | http://hdl.handle.net/10662/24343 |
ISBN: | 978-3-031-07472-1 |
DOI: | 10.1007/978-3-031-07472-1 |
Appears in Collections: | DIAYF - Artículos |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
978-3-031-07472-1_9_preprint.pdf | 1,57 MB | Adobe PDF | View/Open |
This item is licensed under a Creative Commons License