Subscribe here to receive the Data Science Roundup every Sunday morning.
Star Maps
Star Wars: every scene from every movie charted | Australian Broadcasting Co.
In the lead up to the release of the new Star Wars movie, The Force Awakens, Katie Franklin, Simon Elvery and Ben Spraggon were inspired to log every character interaction and build out charts for every scene from every Star Wars movie. Every chart is built with the horizontal axis representing time, and each line representing a character. The vertical bars show when the corresponding characters appear together. The trio said they were inspired by the popular nerd comic xckd’s hand-drawn movie character interactions, but they wanted to write code that could adequately replace the artist’s illustrations.
How Clean is the Cloud?
The Environmental Toll of a Netflix Binge | The Atlantic
Ingrid Burrington reports on the data centers that power the internet and how they measure their power-usage, or more specifically, “power-usage effectiveness” (PUE), which measures a data center’s total power usage to the power usage of the center’s IT equipment. Yet, Burrington points out that PUE looks only at the internal operations of the data center vs. where the energy is coming from and how much energy they are using. “Ultimately, all of these varied metrics and environmental concerns introduce questions about time, maintenance, and scale that the tech industry doesn’t often face. The Cloud optimizes for real-time, not geologic time. While much of today’s network infrastructure was built atop the infrastructures of the 19th century, that infrastructure wasn’t necessarily built to last.”
How to Scale Data Ingestion
Data superpowers for the masses | Medium
Nick Larusso, Data Architect at Graphiq, breaks down how the Graphiq team confronted the challenge of having to build thousands of custom data pipelines. Larusso explains why and how their “entire product team (composed of 50 smart, capable individuals with no technical training, most of which are English, Political Science, and Economics majors) is writing code that is run in our production pipeline.” The Graphiq team set out to reduce the “overhead in communication costs” that comes out of telling a compelling story with data. “By empowering domain experts to own this entire process, we reduce that overhead significantly, allowing them to iterate much faster, which usually results in clearer, more compelling presentations of our data.”
Becoming a Data Scientist
Podcast Episode: 00 | Becoming a Data Scientist
Renee Teate created the Becoming a Data Scientist blog to document her path from “SQL Data Analyst pursuing an Engineering Master’s Degree” to “Data Scientist.” She has also given talks about data science, built a Data Science Learning Directory, and has now launched the Data Science Learning Club along with a new podcast. In the first episode, Teate shares details about her background and what led her to gravitate toward becoming a data scientist.
The End of Data Theory?
The Next Decade of Data Science | Medium
Vyacheslav Polonski, researcher at the Oxford Internet Institute, makes the case that in the coming years data scientists will need to resist temptation to give in to the “wondrous promises of big data” and make certain they are staying true to the scientific method. Polonski references Chris Anderson’s popular Wired article which, in Polonski’s view, proclaimed that “the vast availability of digital traces and other data sources of unprecedented form and scale will fundamentally transform the realm of the social sciences and industry research. […] In other words, given the availability of fine-grained time-stamped records of human behaviour at the level of individual events, data analysts could increasingly succeed without “outdated” theoretical models.” However, Polonski makes the argument that it will be more important than ever to be aware of the challenges these theoretical models face, and suggests that instead of reaching an end of theory, we are in fact “entering a new renaissance for the scientific method in industry research and the social sciences.”
The Data Scientist’s Crib Sheet
Common Probability Distributions | Cloudera
Sean Owen, Director of Data Science at Cloudera, jokes that the rise of the data scientist has led to engineers being “left out of the chat about confidence intervals instead of tutting at the analysts who have never heard of the Apache Bikeshed project for distributed comment formatting.” Owen continues, that in order to “fit in” and be the “life and soul of the party again” engineers now need a crash course in statistics, and probability distributions are the best place to start. “Probability distributions are fundamental to statistics, just like data structures are to computer science.” Owen explains the 15 probability distributions that turn up most consistently in practice and also shares an even more detailed map of all univariate distributions for those who want to dive even deeper.
Each week we surface, summarize, and share the most interesting stories and biggest news from the world of data science. Have articles or podcasts that you think we should be covering in our Data Science Roundup? Send them to editor@rjmetrics.com.
If you’re not signed up to receive the Data Science Roundup, subscribe here.