Subscribe here to receive the Data Science Roundup every Sunday morning.

Life in NYC from Pickup to Drop Off

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance | Todd W. Schneider

Todd Schneider published a comprehensive deep dive into a historical dataset released by the New York City Taxi & Limousine Commission covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015. Schneider’s analysis involved mapping the coordinates of every taxi trip to local census tracts and neighborhoods, incorporating publicly available Uber data covering 19 million Uber rides, building out multiple interactive maps, and then extracting stories and meaning from the data. Schneider notes that it wasn’t until recently that the idea of “downloading, processing, and analyzing 267 GB of raw data containing 1.1 billion rows on a commodity laptop would have been almost laughably naive.” However, Schneider points out that he was able to implement his analysis with a MacBook Air, using PostgreSQL, and R, and points out that “increasingly, the limiting factor of data analysis is not computational horsepower, but human curiosity and creativity.”

RJMetrics Data Science Roundup: @todd_schneider’s analysis tells the story of life in NYC from pickup to dropoff: https://goo.gl/xQI1tk

The Shrinking Bull’s-Eye

“Shrinking bull’s-eye” algorithm speeds up complex modeling from days to hours | MIT News

Jennifer Chu reports on a new algorithm developed by MIT researchers that significantly reduces the computation involved when working with complex computational models. “The algorithm can be applied to any complex model to quickly determine the probability distribution, or the most likely values, for an unknown parameter.” MIT associate professor, Youssef Marzouk, explains: “For an intractable problem, if you had two months and a huge computer, you could get some answer, but you would not necessarily know how accurate it was. Now for the first time, we can say that if you run our algorithm, you can guarantee that you’ll find the right answer, and you might be able to do it in a day. Previously that guarantee was absent.”

RJMetrics Data Science Roundup: Jennifer Chu reports on the “shrinking bull’s-eye” algorithm research out of @MIT https://goo.gl/xQI1tk

A Trivial Invention

The Discovery of Statistical Regression | Priceonomics

Dan Kopf of Priceonomics tells the story of the discovery of statistical regression, which became “one of the most famous priority disputes in the history of science” between two legendary mathematicians, Carl Friedrich Gauss and Adrien-Marie Legendre. Kopf explains how Gauss’ considered his original discovery of the least squares method (which is the essential element of statistical regression) to be “trivial,” and therefore assumed he was not the first to use it. It was only after Legendre published on the method that Gauss made public his own claim to having discovered the method that would evolve into the “crown jewel of statistical analysis.”

RJMetrics Data Science Roundup: @dkopf’s story on the discovery of statistical regression https://goo.gl/xQI1tk

Plotly.js’s Graphing Library Goes Open-Source

Plotly.js Open-Source Announcement | Plotly.js

Plotly announced that they have “open-sourced plotly.js, the core technology and JavaScript graphing library behind Plotly’s products.” The announcement on their blog states: “By open-sourcing Plotly’s core technology, everyone benefits from peer-review and Plotly’s products will continue to be the most cutting-edge offering for exploratory visualization. Plotly.js has the quality, accessibility, and scope to be the charting standard for the Web, but we can only achieve this breadth by working across communities and making the distribution truly unencumbered, portable, and free.”

RJMetrics Data Science Roundup: @plotlygraphs’s graphing library goes open-source https://goo.gl/xQI1tk

Pinterest Introduces Visual Search

Introducing a new way to visually search on Pinterest | Pinterest Engineering

Pinterest rolled out a “visual search tool that lets you zoom in on a specific object in a Pin’s image and discover visually similar objects, colors, patterns and more.” Software Engineer, Andrew Zhai, writes that the project was a joint effort between the Pinterest team and members of the Berkeley Vision and Learning Center. The team “built a distributed index and search system (using open-source tools) that allows us to scale to billions of images and find thousands of visually similar results in a fraction of a second.”

RJMetrics Data Science Roundup: @PinterestEng introduces visual search https://goo.gl/xQI1tk

Bayesian Reasoning and Deep Learning

Memory-based Bayesian reasoning and deep learning | Google DeepMind

Shakir Mohamed, Research Scientist in Statistical Machine Learning at Google, shares slides from his recent talk exploring the convergence of deep learning and Bayesian inference.

RJMetrics Data Science Roundup: @shakir_za’s slides exploring the convergence of deep learning & Bayesian inference https://goo.gl/xQI1tk

Politics and the New Machine

What the turn from polls to data science means for democracy | The New Yorker

Jill Lepore reports on the history of polling in American politics, and how data science is playing a role in addressing the fact that polling “has never been less reliable or more influential than it is now.” Lepore looks at companies like Civis Analytics and Crowdpac, as well as the impact of Nate Silver’s work in aggregating and giving more weight to reliable polls. But ultimately, Lepore suggests that “data science can’t solve the biggest problem with polling, because that problem is neither methodological nor technological. It’s political.”

RJMetrics Data Science Roundup: @NewYorker’s Jill Lepore on the turn from politcal polls to #datascience https://goo.gl/xQI1tk

Machine Learning at Quora

Machine Learning at Quora | LinkedIn

Xavier Amatriain, VP of Engineering at Quora, shares his team’s approach to the machine learning challenges they face in attempting to “share and grow the world’s knowledge” through their question and answer paradigm.

RJMetrics Data Science Roundup: @xamat on @Quora’s approach to tackling #machinelearning challenges https://goo.gl/xQI1tk

The Future of AI is Data

Google Open-Sourcing TensorFlow Shows AI’s Future is Data | Wired

Cade Metz follows up on Google’s announcement last week of open-sourcing their machine learning system, TensorFlow, by talking with Lukas Biewald, founder of CrowdFlower. Biewald, points out that Google’s decision indicates that “when it comes to AI, the real value lies not so much in the software or the algorithms as in the data needed to make it all smarter. Google is giving away the other stuff, but keeping the data.” For more on TensorFlow, check out last week’s roundup here.

RJMetrics Data Science Roundup: @CadeMetz talks with @l2k on why the future of AI is data via @WIRED https://goo.gl/xQI1tk

Each week we surface, summarize, and share the most interesting stories and biggest news from the world of data science. Have articles or podcasts that you think we should be covering in our Data Science Roundup? Send them to editor@rjmetrics.com.

If you’re not signed up to receive the Data Science Roundup, subscribe here.

ds-cta