What We’ll Be Watching
On November 17th, Data + AI Summit heads overseas for 3 packed days of industry mind-melding. Data + AI Summit covers a multitude of topics but most notably is the rise and enterprise proliferation of the Apache Spark open-source community. Apache Spark is the modern data and analytics engine unlocking machine learning, ETL, and stream processing capabilities and is putting data engineers in the drivers seat of these business transformations. This event features over 100 breakout sessions spanning the topics of data engineering, machine learning, data science, and deep learning. To help you navigate the lengthy agenda I have highlighted some of the sessions I am most excited about.
My Top 5 Session Picks for Data + AI Summit
1. Model Experiments Tracking and Registration using MLflow on Databricks
StreamSets Director of Platform and Technical Evangelism Dash Desai will be discussing the topics of MLOps and machine learning in this technical demonstration. Dash will show the design and process of real-time training, tracking and registering machine learning model experiments. A crucial aspect of effective machine learning is the acquisition and preparation of data. The complexity involved in this initial step often delays the delivery of machine learning projects and diminishes the opportunities for self-service. Join Dash to see how you can quickly design smart data pipelines for data ingestion as well as to create trackable, versioned machine learning model experiments on the best data set for your use case.
2. The Future of Finance with Barclays and ESG
In this session data scientist and Finserv Technical Director Antoine Amend discusses how Barclay’s uses machine learning to extract the key ESG initiatives as communicated in yearly PDF reports and compare these with the actual media coverage from news analytics data. For Barclay’s both customer security and speed to market are key requirements for success. Join Antoine for a technical demo and following question and answer session.
3. The Pill for Your Migration Hell
Many companies are currently struggling to modernize to the cloud. Many times it’s the big data systems and data warehouse environments that prove especially difficult to port to the cloud. Roy Leven and Tomer Koren from Microsoft take you on their migration journey: Migrating a machine learning-based product with thousands of paying customers that process Petabytes of network events a day. They will talk about their migration strategy, how they broke down the system into migrationable parts, tested every piece of every pipeline, validated results, and overcame challenges.
4. Scaling Machine Learning with Apache Spark
Apache Spark 2.0 brought a wealth of added functionality to the engine including structured streaming, the unified DataSet API, and data frames. Many data engineers have long anticipated the continued innovation ushered in by the impending Spark 3.0 release. In this session, Holly Smith and Niall Turbit explore the multitude of ways Spark can be used to scale machine learning applications. In particular, through distributed solutions for training and inference, distributed hyperparameter search, deployment issues, and new features for Machine Learning in Apache Spark 3.0.
5. Introducing MLflow for End-to-End Machine Learning on Databricks
MLflow on Databricks is an optimized way to track and version model training code including hyperparameters, model evaluation metrics, and register models. Top Apache Spark contributor Sean Owen will show us how health data can be used to predict life expectancy. The session will start with data engineering in Apache Spark and move to data exploration, model tuning and auto-logging with Hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or REST endpoint. For more on how to easily leverage the MLflow service check out this blog from StreamSets.
Apache Spark and StreamSets
StreamSets helps Apache Spark developers easily build powerful Apache Spark applications with no coding required and helps them operationalize existing Apache Spark development with full visibility and monitoring. Last November, StreamSets announced the release of Transformer which opened up the accessibility of Apache Spark to every developer while de-risking the process of putting Apache Spark into production.
StreamSets Transformer is a data pipeline engine designed for any developer or data engineer to build ETL and ML pipelines that execute on Spark. It allows everyone to harness the power of Apache Spark by eliminating the need for Scala and PySpark coding. Transformer pipelines provide unparalleled visibility into the execution of Spark applications for easy trouble-shooting, reducing the time to design and operate pipelines on Spark for developers of all skill levels.
The global pandemic has put a pause on in-person conferences and training sessions, events that we often look forward to in order to get fresh, relevant training and a peak at where the industry is going. While the health and safety of ourselves and others is paramount, many data engineers are struggling to fill the gap of conferences to advance their knowledge. Luckily, some of these data conferences have been quick to move to digital platforms allowing more people to attend and aiming to deliver the same level of content (technical and business) as a person would get attending a data industry conference. Data+AI Summit EMEA will be valuable multi-day event featuring the latest in machine learning and ETL practices. We hope you enjoy the show and look out for our booth in the Dev Hub + Expo and drop by to see a demo or contact us to schedule a demo at your convenience.