Using statistics and data science to build a crowdsourcing data platform

by Ankur Gupta, Machine Learning Scientist, Premise Data, San Francisco.

Increased internet connectivity has allowed large numbers of people to work towards a single goal in a distributed fashion. This practice is called crowdsourcing and we see successful examples of crowdsourcing everywhere. The most famous example is perhaps Wikipedia, which allows anyone in the world to create and edit articles. The result of this crowdsourcing model is the largest and most popular encyclopedia.

The popularity of internet-connected smartphones with built-in GPS has taken crowdsourcing to another level. While global personal computer shipments have decreased in recent years, smartphone penetration has been increasing. This is especially true in developing countries where smartphones, being cheaper and more intuitive to use, are the first internet-connected digital devices for many users.

Unlike personal computers, smartphones are carried by people almost everywhere. People use smartphones for everything from taking pictures and browsing social media to getting real-time traffic information while driving. All of these activities generate data points that can be tagged with important meta information such as a timestamp and a geolocation (thanks to that nifty GPS chip in the smartphone), without requiring time-consuming manual input from the user. This meta information alone provides copious amounts of useful data, resulting in a flourishing mobile crowdsourcing industry.

One example of geolocation-based mobile crowdsourcing is Waze, which provides navigation with real-time traffic information including updates about traffic accidents. Simply driving around with the Waze app turned on allows drivers to share real-time traffic information. This information is then used to calculate metrics like average speed and to improve directions. Another example is Premise Data, a technology startup based out of San Francisco, which uses smartphones to collect various types of data from almost anywhere in the world.

Premise has built a platform to collect real-time, ground-truth data that may be difficult to obtain by other means. Premise has a global network of on-the-ground, everyday people who use their smartphones to collect individual data points through the Premise mobile app. Premise then collates this crowdsourced data to generate actionable insights that help other organizations, such as the World Bank, make informed policy decisions.


Premise has collected a wide variety of data, ranging from prices of food products in Nigeria to mapping mobile money access in several African countries to Zika virus surveillance in Colombia. Premise’s process of data collection is simple. Premise puts out tasks for users to perform on its mobile app. The users select and perform the tasks they like and are paid a small amount of money as compensation. There are various kinds of tasks available in the Premise marketplace, each collecting a different type of information.

A simple example of a task is that of collecting food prices to generate a real-time food-price index. Users are asked to go to a store or a supermarket, take a picture of a particular food item (like milk or eggs) and key in the price using their smartphones. The smartphone stores the time and geolocation corresponding to this picture and uploads it to Premise’s cloud platform. Other tasks can be more involved such as those which require users to walk a specified route and report any points of interest such as banks, ATMs, and mobile money stores along the route. For pandemic monitoring such as Zika surveillance in Cali, Colombia, Premise allows vector control workers to easily record information on their smartphones about which sewers (or sumideros) contain larva of the mosquitoes known to spread Zika.

Similar to other crowdsourcing companies, Premise combines the individual data points reported by users into a coherent dataset and provides real-time dashboards showing useful metrics. As with any real-world data, data collected through crowdsourcing is messy and requires processing before it can be used. For example, human inputted prices can have typographical errors (such as having an extra significant digit) that need to be distinguished from true price outliers. Text input data, such as names of banks or the brand of a food item, have spelling mistakes which need to be corrected before individual data points from different users can be combined properly. Occasionally, the submitted information is unusable because of severe data entry mistakes or fraudulent users. Premise deals with these issues in two stages — first an automatic, computer-only, quality control stage which weeds out egregious fraud or mistakes without the need for a human, and second, a human-powered quality control process which guards against more fine-grained/novel user mistakes and fraud.

To clean messy data and to control for mistakes and fraud, the Premise Data Science team leverages open source data science software in both R and Python programming languages. R is used for more statistical problems and Python for problems that require more software engineering. Given the geospatial nature of most of the data collected via smartphones, the Premise Data Science team uses various geospatial sampling and visualization packages in both R and Python, including writing internal statistical software based on the latest research papers.

The curated data points are then used to build metrics such as food-price indices, dashboards showing heatmaps of Zika risk, or geospatial maps showing locations of financial institutions in developing countries. These metrics and dashboards are updated in real-time as new data trickles in and are shared with other organizations such as World Bank, private banks, and local governments to make decisions.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s