Subscribe here to receive the Data Science Roundup every Sunday morning.
Life in NYC from Pickup to Drop Off
Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance | Todd W. Schneider
Todd Schneider published a comprehensive deep dive into a historical dataset released by the New York City Taxi & Limousine Commission covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015. Schneider’s analysis involved mapping the coordinates of every taxi trip to local census tracts and neighborhoods, incorporating publicly available Uber data covering 19 million Uber rides, building out multiple interactive maps, and then extracting stories and meaning from the data. Schneider notes that it wasn’t until recently that the idea of “downloading, processing, and analyzing 267 GB of raw data containing 1.1 billion rows on a commodity laptop would have been almost laughably naive.” However, Schneider points out that he was able to implement his analysis with a MacBook Air, using PostgreSQL, and R, and points out that “increasingly, the limiting factor of data analysis is not computational horsepower, but human curiosity and creativity.”
The Shrinking Bull’s-Eye
“Shrinking bull’s-eye” algorithm speeds up complex modeling from days to hours | MIT News
Jennifer Chu reports on a new algorithm developed by MIT researchers that significantly reduces the computation involved when working with complex computational models. “The algorithm can be applied to any complex model to quickly determine the probability distribution, or the most likely values, for an unknown parameter.” MIT associate professor, Youssef Marzouk, explains: “For an intractable problem, if you had two months and a huge computer, you could get some answer, but you would not necessarily know how accurate it was. Now for the first time, we can say that if you run our algorithm, you can guarantee that you’ll find the right answer, and you might be able to do it in a day. Previously that guarantee was absent.”
A Trivial Invention
The Discovery of Statistical Regression | Priceonomics
Dan Kopf of Priceonomics tells the story of the discovery of statistical regression, which became “one of the most famous priority disputes in the history of science” between two legendary mathematicians, Carl Friedrich Gauss and Adrien-Marie Legendre. Kopf explains how Gauss’ considered his original discovery of the least squares method (which is the essential element of statistical regression) to be “trivial,” and therefore assumed he was not the first to use it. It was only after Legendre published on the method that Gauss made public his own claim to having discovered the method that would evolve into the “crown jewel of statistical analysis.”
Plotly.js’s Graphing Library Goes Open-Source
Plotly.js Open-Source Announcement | Plotly.js
Plotly announced that they have “open-sourced plotly.js, the core technology and JavaScript graphing library behind Plotly’s products.” The announcement on their blog states: “By open-sourcing Plotly’s core technology, everyone benefits from peer-review and Plotly’s products will continue to be the most cutting-edge offering for exploratory visualization. Plotly.js has the quality, accessibility, and scope to be the charting standard for the Web, but we can only achieve this breadth by working across communities and making the distribution truly unencumbered, portable, and free.”
Pinterest Introduces Visual Search
Introducing a new way to visually search on Pinterest | Pinterest Engineering
Pinterest rolled out a “visual search tool that lets you zoom in on a specific object in a Pin’s image and discover visually similar objects, colors, patterns and more.” Software Engineer, Andrew Zhai, writes that the project was a joint effort between the Pinterest team and members of the Berkeley Vision and Learning Center. The team “built a distributed index and search system (using open-source tools) that allows us to scale to billions of images and find thousands of visually similar results in a fraction of a second.”
RJMetrics Data Science Roundup: @PinterestEng introduces visual search https://goo.gl/xQI1tk
Bayesian Reasoning and Deep Learning
Memory-based Bayesian reasoning and deep learning | Google DeepMind
Shakir Mohamed, Research Scientist in Statistical Machine Learning at Google, shares slides from his recent talk exploring the convergence of deep learning and Bayesian inference.
Politics and the New Machine
What the turn from polls to data science means for democracy | The New Yorker
Jill Lepore reports on the history of polling in American politics, and how data science is playing a role in addressing the fact that polling “has never been less reliable or more influential than it is now.” Lepore looks at companies like Civis Analytics and Crowdpac, as well as the impact of Nate Silver’s work in aggregating and giving more weight to reliable polls. But ultimately, Lepore suggests that “data science can’t solve the biggest problem with polling, because that problem is neither methodological nor technological. It’s political.”
Machine Learning at Quora
Machine Learning at Quora | LinkedIn
Xavier Amatriain, VP of Engineering at Quora, shares his team’s approach to the machine learning challenges they face in attempting to “share and grow the world’s knowledge” through their question and answer paradigm.
The Future of AI is Data
Google Open-Sourcing TensorFlow Shows AI’s Future is Data | Wired
Cade Metz follows up on Google’s announcement last week of open-sourcing their machine learning system, TensorFlow, by talking with Lukas Biewald, founder of CrowdFlower. Biewald, points out that Google’s decision indicates that “when it comes to AI, the real value lies not so much in the software or the algorithms as in the data needed to make it all smarter. Google is giving away the other stuff, but keeping the data.” For more on TensorFlow, check out last week’s roundup here.
Each week we surface, summarize, and share the most interesting stories and biggest news from the world of data science. Have articles or podcasts that you think we should be covering in our Data Science Roundup? Send them to editor@rjmetrics.com.
If you’re not signed up to receive the Data Science Roundup, subscribe here.