Data Science: A Primer For Software Developers

Data Science and Machine Learning are big topics. As with any buzzword, the exact definition of “Data Science” can mean many different things to different people.

For software developers, Data Science needs to be understood from the perspective of building applications for our business clients. This Primer takes a cut at explaining Data Science from the developers’ perspective.

Topics discussed include

  1. What is Data Science, exactly?

  2. How do you “do” Data Science?

  3. How do you collect and analyze Big Data?

  4. What does it mean for Software Developers?

A different mindset

The unhelpful pedantic definition of Data Science focuses on the academic and research side of the field. From this angle, Data Science is about the creation of new approaches while Data Analysis is about the use of those techniques. This is like the difference between Computer Science and Software Development, a distinction which has made its way into common usage in the industry.

An example from the history of machine translation helps. Machine translation is the research term for automated translation between human languages, as provided by tools such as babelfish.com or Google Translate.

In the early days of this research, the approach was to create a set of rules and code them in software. In 1954, a set of six rules and a vocabulary of 250 words demonstrated Russian to English translation of 60 phrases. Researchers thought they could scale up this early success to full translation by coding more rules and expanding the vocabulary. About a decade later, they abandoned this unproductive problem-solving strategy. This algorithmic approach exemplifies a software development mindset to solving the problem.

In the 1980s, researchers from IBM started to investigate using statistical phrase-based translation. (Canadian Heritage Moment: the parliamentary Hansard provided the necessary labeled training set of English/French phrases.) This statistical approach signified a move from coding complex rules operating with limited data to learning simple rules created out of large amounts data. By 2007, Google started to use its enormous corpus (or ‘big data’) to create a statistical translation model.

What distinguishes software development from Data Science?

“A more helpful definition is that Data Science is a hypothesis-driven approach to answering specific questions using data analysis methods.”

This research-flavoured definition of Data Science is not the current common usage of the term, except possibly at large corporations that fund extensive research such as Google, Facebook, and Apple.

Confusing Labels

Now we come to another confusion in terminology between software development and Data Science. For software developers, a model is often a domain abstraction used to solve problems using software or not.

For data scientists, a model is a product of their work (a data product). The model takes new inputs to create new outputs. For example, the resulting machine translation model from the previous example would take English input and create French output.

>> translation_model('the big red dog') -> 'le gros chien rouge'

In software development, we would call this a function or if it were larger a module, but not a model as such.

Without addressing the question of how to apply software development automated testing practices or even QA to the world of Data Science, we must recognize that validation has a particular meaning to Data Scientists that differs from what software developers would understand. Validation of a Data Science model is part of an iterative process of refining the parameters of used to create the model. In software development, validation is part of confirming that the software meets the specified needs.

These are only a couple of examples of overlapping terminology. They show part of the challenge in understanding the relationship between software development and Data Science. Accept that they’re different but related.

Process Diagram of Data Science

So, how do you do “Data Science”?

Data Science is a hypothesis-driven approach to answering specific questions using data analysis methods. Here are the different parts of a Data Science project.

1.

Formulating a question that you can answer with data

The question-first approach tries to avoid the problems that can occur by finding spurious patterns in the data. Looking at the data first can be useful, but requires some rigour to avoid the dangers of data dredging where we misuse the data both to suggest hypotheses and then to confirm them.

This part considers; What is the question? Do we have (or can we get) the data answer it Could we use the answer once we have it?

There are three types of questions; What happened? This could be enough to inform future decisions given the assumption that the future will be like the past. What might happen? This allows us to make decisions given possible future circumstances. E.g., what will next year’s sales be if we add another store?

2.

Collecting and cleaning the data

This can include traditional ETL work into a data warehouse or data lake. It includes handling missing values through imputation, and normalization and scaling of data.

This can also include feature engineering. Feature engineering is the process of applying domain knowledge of the data to extract or synthesize more features. E.g., a timestamp field might be more useful if we extract the day of weekand the hour of day as separate fields if we are trying to model the expected ridership on public transit. E.g., the two fields lot width and lot depth might be more useful as a single derived field lot area if we are trying to predict real estate prices.

3.

Analyzing the data

We use unsupervised learning when labels (or example answers) aren’t provided. It is often used for discovering structure within data. Clustering finds groups of similar data. E.g., a clothing market segmentation problem might try to find three clothing sizes that are appropriate for the population’s combinations of height and weight. Clustering is also useful in document search applications finding a document “like” a given document. We use other unsupervised learning algorithms for visualizing high-dimensionality data in 2 or 3 dimensions.

Supervised learning assumes a labeled data set of examples. In the machine translation case, the Hansard provided the correct French/English answers for each example. Regression and classification differ in creating predictions that are real valued vs. categorical. E.g., Use regression to predict tomorrow’s temperature (real valued). Use classification to predict whether it will be sunny or not (categorical).

4.

Communicating the answer

This is the output of a Data Science project, sometimes called the Data Product. In some cases, the net result is a report or presentation. Some organizations emphasize repeatable analyses. In these cases, the report or presentation embeds the executable code used to create the analysis. These sorts of environments would likely use text-based documents in a version control system. This should be familiar to software developers. The analyses have to rely on repeatable data. So, these environments would also need the ability to recall exactly the same data set that was originally used. Another variation on reporting is an interactive application to allow the target audience to explore the data. This might be a custom application, or use of an off-the-shelf tool such as Tableau or Pentaho.

In other cases, a Data Science team might produce a running software module. This might be a prototype that a downstream software development team will productionize, or it might already be in a form useful for production.

Where Data Science fits with Software Development

In addition to “pure” Data Science per se, there are several ways that Software Developers can work with and add value to ongoing Data Science efforts. At a high level, here are three key areas that Developers can help Data projects.

  1. Data engineering and feature engineering.

  2. Productionization of the models coming out of the data science team.

  3. Tooling and process support. Organizations that value repeatable data science projects will want to automate and provide tools for the work of Data Science.

In addition to “pure” Data Science per se, there are several ways that Software Developers can work with and add value to ongoing Data Science efforts. At a high level, here are three key areas that Developers can help Data projects.

Author’s Bio

Ben is a Technology Director and Senior Architect with over 20 years of software development and design experience. Ben has extensive experience delivering technology assessments, roadmaps and solutions. He works closely with IT and business stakeholders to define solutions, direction, and technical project scope. Ben has been the technical architect on projects for a wide range of clients in many different sectors, with a particular focus on enterprise level systems for large financial institutions. Ben holds an Honours Bachelor of Science from University of Toronto.

Ready to turn a corner in your development journey? 

Partner with Intelliware for IT Strategy, Enterprise Solutions, Agile Adoption Services, Fintech Platforming, Modernization Services, or Product Development.