Subscribe here to receive the Data Science Roundup every Sunday morning.

Star Maps

Star Wars: every scene from every movie charted | Australian Broadcasting Co.

In the lead up to the release of the new Star Wars movie, The Force Awakens, Katie Franklin, Simon Elvery and Ben Spraggon were inspired to log every character interaction and build out charts for every scene from every Star Wars movie. Every chart is built with the horizontal axis representing time, and each line representing a character. The vertical bars show when the corresponding characters appear together. The trio said they were inspired by the popular nerd comic xckd’s hand-drawn movie character interactions, but they wanted to write code that could adequately replace the artist’s illustrations.

RJMetrics Data Science Roundup: Every scene from every #StarWars movie charted https://goo.gl/hExRyt via @abcnews

How Clean is the Cloud?

The Environmental Toll of a Netflix Binge | The Atlantic

Ingrid Burrington reports on the data centers that power the internet and how they measure their power-usage, or more specifically, “power-usage effectiveness” (PUE), which measures a data center’s total power usage to the power usage of the center’s IT equipment. Yet, Burrington points out that PUE looks only at the internal operations of the data center vs. where the energy is coming from and how much energy they are using. “Ultimately, all of these varied metrics and environmental concerns introduce questions about time, maintenance, and scale that the tech industry doesn’t often face. The Cloud optimizes for real-time, not geologic time. While much of today’s network infrastructure was built atop the infrastructures of the 19th century, that infrastructure wasn’t necessarily built to last.”

RJMetrics Data Science Roundup: @lifewinning reports on the data centers that power the internet https://goo.gl/hExRyt via @TheAtlantic

How to Scale Data Ingestion

Data superpowers for the masses | Medium

Nick Larusso, Data Architect at Graphiq, breaks down how the Graphiq team confronted the challenge of having to build thousands of custom data pipelines. Larusso explains why and how their “entire product team (composed of 50 smart, capable individuals with no technical training, most of which are English, Political Science, and Economics majors) is writing code that is run in our production pipeline.” The Graphiq team set out to reduce the “overhead in communication costs” that comes out of telling a compelling story with data. “By empowering domain experts to own this entire process, we reduce that overhead significantly, allowing them to iterate much faster, which usually results in clearer, more compelling presentations of our data.”

RJMetrics Data Science Roundup: Nick Larusso on how the @GraphiqHQ team scales data ingestion https://goo.gl/hExRyt

Becoming a Data Scientist

Podcast Episode: 00 | Becoming a Data Scientist

Renee Teate created the Becoming a Data Scientist blog to document her path from “SQL Data Analyst pursuing an Engineering Master’s Degree” to “Data Scientist.” She has also given talks about data science, built a Data Science Learning Directory, and has now launched the Data Science Learning Club along with a new podcast. In the first episode, Teate shares details about her background and what led her to gravitate toward becoming a data scientist.

RJMetrics Data Science Roundup: The #datascience learning club and the 1st episode of @BecomingDataSci’s new podcast https://goo.gl/hExRyt

The End of Data Theory?

The Next Decade of Data Science | Medium
Vyacheslav Polonski, researcher at the Oxford Internet Institute, makes the case that in the coming years data scientists will need to resist temptation to give in to the “wondrous promises of big data” and make certain they are staying true to the scientific method. Polonski references Chris Anderson’s popular Wired article which, in Polonski’s view, proclaimed that “the vast availability of digital traces and other data sources of unprecedented form and scale will fundamentally transform the realm of the social sciences and industry research. […] In other words, given the availability of fine-grained time-stamped records of human behaviour at the level of individual events, data analysts could increasingly succeed without “outdated” theoretical models.” However, Polonski makes the argument that it will be more important than ever to be aware of the challenges these theoretical models face, and suggests that instead of reaching an end of theory, we are in fact “entering a new renaissance for the scientific method in industry research and the social sciences.”

RJMetrics Data Science Roundup: @slavacm on the next decade of #datascience & being aware of the challenges to come https://goo.gl/hExRyt

The Data Scientist’s Crib Sheet

Common Probability Distributions | Cloudera

Sean Owen, Director of Data Science at Cloudera, jokes that the rise of the data scientist has led to engineers being “left out of the chat about confidence intervals instead of tutting at the analysts who have never heard of the Apache Bikeshed project for distributed comment formatting.” Owen continues, that in order to “fit in” and be the “life and soul of the party again” engineers now need a crash course in statistics, and probability distributions are the best place to start. “Probability distributions are fundamental to statistics, just like data structures are to computer science.” Owen explains the 15 probability distributions that turn up most consistently in practice and also shares an even more detailed map of all univariate distributions for those who want to dive even deeper.

RJMetrics Data Science Roundup: @sean_r_owen on probability distributions and the #datascientists crib sheet https://goo.gl/hExRyt

Each week we surface, summarize, and share the most interesting stories and biggest news from the world of data science. Have articles or podcasts that you think we should be covering in our Data Science Roundup? Send them to editor@rjmetrics.com.

If you’re not signed up to receive the Data Science Roundup, subscribe here.

ds-cta