Self-Optimizing Distributed Data Pipelines Using Reinforcement Learning

Harish Chava

doi:10.36676/jrps.v14.i5.1659

Authors

Harish Chava harishchava@meta.com

DOI:

https://doi.org/10.36676/jrps.v14.i5.1659

Keywords:

Adaptive optimization, self-optimizing pipelines, reinforcement learning, distributed data systems, intelligent ETL, cloud-native data processing, real-time telemetry, dynamic resource management, workload-aware scheduling.

Abstract

The hypergrowth of data in today's distributed systems has necessitated the development of smarter and self-optimizing data pipelines that respond dynamically to workload fluctuations, available resources, and performance constraints. Current data pipeline optimization techniques employ static rules or manual tuning, which do not scale or respond dynamically to heterogeneous, high-throughput systems. Current research explored heuristics and cost models for pipeline optimization, but these were found to be limited in responsiveness, generalizability across a broad spectrum of workloads, and the ability to learn from execution feedback over time.

References

• Mao, H., Alizadeh, M., Menache, I., & Kandula, S. (2016). Resource management with deep reinforcement learning. Proceedings of the 15th ACM Workshop on Hot Topics in Networks, 50–56. https://doi.org/10.1145/3005745.3005750 DOI: https://doi.org/10.1145/3005745.3005750

• Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Menache, I., & Zaharia, M. (2017). Pipedream: Fast and efficient pipeline parallel DNN training. Proceedings of the 27th ACM Symposium on Operating Systems Principles, 1–15. https://doi.org/10.1145/3132747.3132763 DOI: https://doi.org/10.1145/3341301.3359646

• Xu, Y., Zhao, M., Zhao, D., Zhang, Q., & Xu, M. (2021). Reinforcement learning for resource provisioning in cloud computing: Recent advances and future directions. ACM Computing Surveys, 54(9), 1–36. https://doi.org/10.1145/3469752 DOI: https://doi.org/10.1145/3433000

• Zhang, Y., Shen, H., & Liu, H. (2019). SmartSLA: Cost minimization with SLA-aware resource allocation for cloud data centers. IEEE Transactions on Services Computing, 14(5), 1322–1336. https://doi.org/10.1109/TSC.2019.2942982

• Tuli, S., Mahmud, R., Tuli, S., & Buyya, R. (2020). FogBus: A blockchain-based lightweight framework for edge and fog computing. Journal of Systems and Software, 154, 22–36. https://doi.org/10.1016/j.jss.2019.03.019 DOI: https://doi.org/10.1016/j.jss.2019.04.050

• Wei, L., & He, B. (2016). Concurrent task execution in Spark: A multi-resource scheduling approach. Proceedings of the VLDB Endowment, 9(6), 516–527. https://doi.org/10.14778/2904081.2904086

• Zeng, Y., An, X., & Wen, Y. (2022). Adaptive online job scheduling with deep reinforcement learning for data center networks. IEEE Transactions on Network and Service Management, 19(1), 416–429. https://doi.org/10.1109/TNSM.2021.3099466 DOI: https://doi.org/10.1109/TNSM.2022.3154120

• Dommari, S., & Khan, S. (2023). Implementing Zero Trust Architecture in cloud-native environments: Challenges and best practices. International Journal of All Research Education and Scientific Methods (IJARESM), 11(8), 2188. Retrieved from http://www.ijaresm.com

• Liu, F., Li, Y., Shen, H., & Pan, H. (2020). Scheduling heterogeneous workflows using reinforcement learning for cloud computing. Future Generation Computer Systems, 106, 205–216. https://doi.org/10.1016/j.future.2020.01.008 DOI: https://doi.org/10.1016/j.future.2020.01.008

• Chen, X., Liu, Z., Cheng, J., & Ma, X. (2021). RL-DAG: Task scheduling for DAG-structured jobs on edge–cloud platforms with reinforcement learning. Journal of Parallel and Distributed Computing, 153, 14–25. https://doi.org/10.1016/j.jpdc.2021.02.010 DOI: https://doi.org/10.1016/j.jpdc.2021.02.010

• Kumar, V., & Sood, S. K. (2017). Scheduling using reinforcement learning in cloud computing for independent tasks. Cluster Computing, 20(2), 1011–1023. https://doi.org/10.1007/s10586-017-0792-2

• Peng, C., Tang, J., Liu, J., Zhang, Y., & Li, Y. (2018). Reinforcement learning-based resource management for adaptive computation offloading. IEEE Network, 32(6), 144–151. https://doi.org/10.1109/MNET.2018.1800093

• Wang, Y., & Zhang, Z. (2019). Dynamic optimization for cloud-based stream data processing: A reinforcement learning approach. Concurrency and Computation: Practice and Experience, 31(14), e5071. https://doi.org/10.1002/cpe.5071 DOI: https://doi.org/10.1002/cpe.5071