Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Products or getting models into Production

Data Products or getting models into Production

This is the first iteration of the slides I plan to give at PyData in Florence. I'll be discussing a case study of how I leveraged the ScienceOps platform to drive innovation and produce a novel data product for the air traffic industry. I'll discuss the challenges of data products too.

springcoil

April 17, 2015
Tweet

More Decks by springcoil

Other Decks in Technology

Transcript

  1. Data Products Data Products Or how to get models into

    production PyData track at PyCon Italy Friday 17th of April 2015 [email protected] All opinions my own
  2. Who am I? Who am I? I work as a

    Data Scientist for a large Telecommunications Company Masters in Mathematics Specialized in Statistics and Machine Learning Interned at Amazon Was a consultant for a while I've been an analytics product architect on one product Occasional contributor to Pandas and other projects @springcoil
  3. We can't agree what data science is We can't agree

    what data science is I think a data scientist is someone with enough programming ability to leverage their mathematical skills and domain specific knowledge to turn data into solutions. The solution should ideally be a product
  4. To help the business most To help the business most

    I believe that data science offers the most value when the models are in production. Some of us call this a 'Data Product' In this talk I will explain how to use ScienceOps from Yhat to build a model in production Why should Amazon or Google get all the fun? Or competitive advantage?
  5. The last mile problem The last mile problem Sean Taylor

    at Facebook calls this the 'last mile problem'. Or how do you translate the insight into something people use?
  6. It is hard to incorporate It is hard to incorporate

    data into day to data into day to day operations. day operations.
  7. Data scientists are not software Data scientists are not software

    engineers engineers Although it is not acknowledged by some! Producing models in code is not the same as producing a good web application, you need domain specific knowledge of model building and the challenges that presents.
  8. R and D != Engineering R and D != Engineering

    Many software engineers think that data science is just an engineering problem. However, the scoping of a model building task is hard, you never quite know how to scope it effectively. Takeaway: Make sure your stakeholders are ready for such high risk and high reward projects
  9. Why? Why? The data science process involves something like OSEMIC

    Obtain Scrub Explore Model Interpret Communicate Building the model involved porting code from Matlab and understanding a new domain specific problem. The API data sources were messy and hard to understand
  10. Case study: Problem description Case study: Problem description A client

    was working on a visualization tool and needed to provide the results of a differential equation in a usable form to users. The research problem was already done - so after code was prototyped in Python - what next? One key ingredient was that the results of the 'mathematical engine' had to be incorporated quickly into a Ruby on Rails/ Javascript based product. The challenge therefore is one of interoperability
  11. Write models in Ruby --> Turned out ruby doesn't have

    an ODE solver Possible Solutions (and their Possible Solutions (and their problems) problems) Port code to Java -----> Cross language validation PMML ----> Doesn't have great language support Batch Jobs -------> High maintenance and config More tools, more work, more time More tools, more work, more time
  12. So I did what all data scientists do So I

    did what all data scientists do when stuck... when stuck...
  13. I could use stuff from YHatHQ to I could use

    stuff from YHatHQ to build a model as a service... build a model as a service...
  14. This is a much better solution! This is a much

    better solution! I used Science Ops from YHatHQ Key Tenets 1. Work with the tools you already know 2. Iterate quickly 3. Low touch 4. No rewriting code
  15. Code! Code! http://bit.ly/1J3T4qf import numpy as np A1 = bs

    * ( astr * N ) ** 2 A2 = c1 / tdS A3 = ( 1 + bs ) * ( A4 * N ) ** 2 A4 = A1 * z0 A5 = A3 * z0 A6 = C A7 = 0.5 * ( ( c2 / tt ) + ( c1 / tdS ) ) A8 = ( c2 / tt ) - ( c1 / tdS ) def dX_dt(X, t=0): """ Return the triple ODE calculations """ return array([ - A1 * X[2] + A4, - A2 * X[1] + A3 * X[2] - A5, X[0] - X[1] ]) from scipy import integrate t = linspace(0, 35, 1000) # time X0 = array([0, 1, 0]) # initials conditions X, infodict = integrate.odeint(dX_dt, X0, t, full_output=True) infodict['message']
  16. What are the key takeaways? What are the key takeaways?

    1. The 'magic quickly' problem 2. Lack of a shared language between software engineers and data scientists - but investing in the right tooling by using open standards allows success. 3. To help data scientists and analysts succeed your business needs to be prepared to invest in tooling
  17. Lack of a shared language Lack of a shared language

    Statisticians and software engineers don't necessarily have a shared language. Services like Science Ops help bridge the gap. "Watch for high skew and kurtosis" Think about your team balance in your projects. Math folk versus coders.
  18. Invest in tooling Invest in tooling For your analysts and

    data scientists to succeed you need to invest in infrastructure to empower them. Think carefully how you want your company to spend its innovation tokens and take advantage of the excellent tools available like ScienceOps and AWS. I think there is great scope for entrepreneurs to take advantage of this arbitrage opportunity and build good tooling to empower data scientists by building platforms. Contribute to Open Source Software such as the PyData stack!
  19. Lessons learned Lessons learned I can write a model in

    Python and have it deployed! Software Engineers aren't data scientists and shouldn't be expected to write models in code. Models only provide value when they are in production Getting information from stakeholders is really valuable in improving models.
  20. Successes Successes Within a few months it was possible to

    have an analytics product in production, using information consumed from a variety of API's. I have no idea how else - maybe using PMML that I could deploy models. Total development time took 3 months, with 5 people. Only two (including myself) were working fulltime on this project. That development time includes time for us to learn the domain specific knowledge like models, API sources, etc.
  21. Other kinds of data science Products Other kinds of data

    science Products Credit risk modelling Customer attrition modelling Recommendation engines Airline delay analysis The list goes on....