INFORME DEL PAPER APLICACION DE ALGORITMOS PARA LA CONFIGURACION DE SOFTWARE

Estadística Aplicada a Problemas Mineros

Universidad Nacional de ingeniería

MINERÍA DE FLUJOS DE TRABAJO CIENTÍFICO PARA TRANSFERENCIAS DE DATOS ANÓMALOS ARBOLEDA GUIVAR WALTER; CORTEZ ORDOÑEZ JOSE; MAMANI SIMEON DENNIS; PEÑA CORREA ERICK; MATOS SILVA RUBENS

Abstract Anomalies and failures in the execution of the workflow cause a loss of scientific production and an inefficient use of the infrastructure. Therefore, detecting, diagnosing and mitigating these anomalies is immensely important for reliable and efficient scientific workflows. Since these workflows rely heavily on highperformance network transfers that require strict QoS restrictions, accurate detection of abnormal network performance is crucial to ensure reliable and efficient workflow execution. To address this challenge, XFLASH, a network anomaly detection tool for faulty TCP workflow transfers, has been developed. X-FLASH incorporates new approaches to data mining and hyperparameter tuning to improve the performance of machine learning algorithms to accurately classify anomalous TCP packets. X-FLASH leverages XGBoost as an ensemble model and combines XGBoost with a sequential optimizer, FLASH, borrowed from search-based software engineering to learn the optimal parameters of the model. As results X-FLASH found configurations that surpassed the existing approach by up to 28%, 29% and 40% relatively for the F measure, the G score and the recovery in less than 30 evaluations. From a great improvement and a simple fit, we recommend future research to have a further fit study as a new standard, at least in the area of detecting anomalies in the scientific workflow Keywords: Scientific workflow, TCP signatures, anomaly detection, hyperparameter tuning, sequential optimization Los sistemas de gestión del flujo de trabajo científico, se utilizan a menudo para orquestar y ejecutar estas aplicaciones complejas en una infraestructura informática distribuida de alto rendimiento.

1. INTRODUCCIÓN Como se sabe los flujos de trabajo científicos modernos se basan en datos y, a menudo, se ejecutan en infraestructuras informáticas distribuidas, heterogéneas y de alto rendimiento.

Organizar y gestionar los movimientos de datos para los flujos de trabajo científicos dentro y a través de este paisaje de infraestructura diversa es un desafío. El problema se ve agravado por diferentes tipos de fallas y anomalías que pueden abarcar todos los niveles de estas infraestructuras altamente distribuidas (infraestructura de hardware, software del sistema, middleware, redes, aplicaciones y flujos de trabajo). Tales fallas agregan gastos adicionales a los científicos que anticipan u obstruyen por completo

Hoy en día, la ciencia computacional se basa cada vez más en datos, lo que lleva al desarrollo de aplicaciones complejas con un uso intensivo de datos que acceden y analizan conjuntos de datos grandes y distribuidos que emanan de sensores e instrumentos científicos. Los flujos de trabajo científicos han surgido como una representación flexible para expresar de manera declarativa aplicaciones tan complejas con datos y dependencias de control. 1

Turn static files into dynamic content formats.

Create a flipbook