A joke you've likely heard before: "80 percent of time is spent preparing data. The other 20 percent is spent complaining about preparing the data."
Ask a data scientist if they’ve heard that joke and the answer will probably be “yes.”
Why? Because for most organizations, managing the massive amount of unstructured data necessary to make machine learning (ML) a valuable tool is a major hurdle.

While the public cloud has changed the ML landscape in many ways, the most common roadblocks organizations encounter when adopting ML are still:
- High set-up costs, including tools, expertise, and storage
- Siloed data that limits access to those who need it
- Complex and fragmented tools that get in the way of exploring data
- Deployment complexity that leads to difficulties in putting ML models into production
Overcoming these roadblocks requires (outside of the data necessary to run ML models effectively) specific organizational resources and skills to identify and implement solutions for each challenge.
 
This includes, at the very least:
- A business stakeholder to drive the ML adoption process
- A data analyst to identify and make actionable insights from the ML model
- A data engineer to store, clean, and manipulate data
- A data platform architect to build the data environment
- A data scientist to experiment with and construct ML models
In addition, you’ll find that visualization and inspection tools like Jupyter Notebook or Pandas can be invaluable during the process.
Getting your data ready for ML

ML all starts with the data. You may be spending 60 or 70 percent of your time on this initial data preparation, so it’s important to get it right from the start. There are four stages to readying your data:
 Stage 1—Business assessment
Stage 1—Business assessment
Before you start looking for ML solutions, you need to understand your business objectives. Are you looking for customer insights, forecast trends, or organizational efficiencies? Knowing what you want to accomplish will help you narrow down the pools of data your ML analysis swims in.
 Stage 2—Develop and proof
Stage 2—Develop and proof
Next, collect and catalog your data and assemble it into an accessible environment, such as a data warehouse platform. This includes cleaning the data so it’s high quality and filling any gaps. Then you can develop a proof-of-concept (POC) ML model utilizing a small amount of data to verify the results.
 Stage 3—Pilot
Stage 3—Pilot
Once you’ve tested your POC model, it’s time to integrate that model into your processes and tools. This involves running a side-by-side pilot with your existing analytics process and your new ML model, then comparing the effectiveness of each. If your ML model delivers better results, you’re ready to move on.
 Stage 4—Production
Stage 4—Production
With your pilot tests complete, it’s time to put your ML model into production. That means full integration, deployment, and then continuous improvement and refinement.
ML workflow and processes
ML development has a cycle: data preparation, data science, building models, testing and QA, and validation.
To successfully scale this cycle to multiple teams and hundreds of models, you need a workflow that is automated and uses DevOps-like practices in order to make quick iterations.
This means creating a model that encourages ongoing communication between your data scientists and engineers communication that not only ensures both teams are working in concert (a key component of successfully moving ML models into production) but that you have visibility into what each group is doing at all times.
Regardless of how your internal operations take shape, it’s critical that you start your journey with small ML workflows. Pick a problem you want to address, create a single model, and move it through production.
Then, once that’s proven to be successful, you can build upon that success and gradually scale your ML workloads.
Want to get the most out of your unstructured data for technologies like ML and AI? Check out our free guide on managing and scaling your unstructured data through the hybrid cloud.
 
            
             
                     
                     
                     
                     
                       
  

