How is data science being used against crime?

There is no doubt that in the past two decades, the way mankind obtains information has changed dramatically. Until the 2000s, the main responsible for the generation of data were companies, institutions, and the public service, which nevertheless did it internally, having only this small universe for capturing useful knowledge, and even so, they did not have the necessary technologies to explore it broadly. Still expanding, the internet also collaborated by making its share of data available, especially with the creation of the indexers (Altavista, Yahoo, Google, and the less known Cadê are some good examples) that allowed searches on a global scale, a feature hitherto considered impossible at any other point in history.
But what really strengthened the Information Age was the computational implementation of the social network behavioral concept. These established a drastic change in the way humanity dealt with the content generated on its own devices, because instead of their creations being limited only to professional means, each person started to become, voluntarily, a data generator. All types of media started to be shared incessantly, in volume and velocity never seen before. Photos, audios, texts, or videos, the lives and habits of each user started to be exposed in a virtual showcase for anyone who wanted to "take a look".
Words that did not previously belong to the vocabulary or were just references to futuristic science fiction works have permeated the lives of ordinary citizens. Business Intelligence, Big Data, Data Science, and Data Lakes have become synonymous with business value generation, behavioral, predictive analysis, and standards creation. New business models began to emerge, based not on the value of a physical product, such as a car, a house, or a computer, but on the ephemeral value, which was that of generating knowledge from information. Suggestions about products purchased based on purchase profiles, behavioral analysis of users of streaming services for recommending films, and using machine learning to assist business decision-making were some of the focuses incorporated by these companies, which they then found a market niche with an immense capacity for exploration.
The quote "knowledge is power, and information is liberating" by Ghanaian diplomat Kofi Annan proved to be concrete through the exponential growth of those who had data as their source of work. More "classic" companies (considering a relatively short timeline) like Google, Facebook, and Twitter started their careers based on analyzing and selling information, while service providers like Uber and Airbnb are considered, respectively, some of the largest companies in their fields of action. However, to reach these levels, these companies have mastered the use of data analysis and machine learning technologies.
There are differences between the two main concepts related to obtaining information (Business Intelligence and Data Science) that need to be clarified so that the maximum value of data possessed by public security can be explored. Business Intelligence refers to the process of collecting, organizing, analyzing, sharing, and monitoring information that supports business management. It is a set of techniques and tools to assist in transforming raw data into meaningful and useful information, to allow an easy interpretation of the large volume of data, identifying new opportunities, and implementing an effective strategy based on the information generated to promote business with advantage competitive in the market and long-term stability. This concept is one of the first to define the most complex analyzes of the Information Age, having been its first application registered in relational databases and Data Warehouses. However, with the appearance of Big Data, techniques needed to evolve, with Business Intelligence becoming just a cog in a much more complex data exploration mechanism.
According to Gartner Research, the term Big Data has become one of the most used in information technology worldwide, referring to the massive amounts of information that need to be processed by companies. Data management has evolved mainly around two fundamental problems: volume and processing capacity. However, the challenges have changed, and it is no longer a question of storage or even processing power, but as the data has become more complex and with a variety of new sources, it is collected at record speed; this creates a tree of dimensions that can be defined as: "a set of information of high volume, velocity and variety that demand innovative ways of processing information to improve ideas and economic decision-making", (whose main concepts are presented in the figure below), deepening the complexity of data analysis and leading to the natural development of Data Science.
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge from data in various forms. This field makes use of statistics, Artificial Intelligence, databases, and scientific visualization to be able to reach its objectives successfully, which is to extract the greatest possible value from the stored data. Based on this knowledge, it is stated that while Business Intelligence already works with data processed and made available in a more structured way, Data Science works with them also in an unstructured or semi-structured way, making its application in an institution requires a longer time, given the possible specificity of the data stored by it (unlike the more superficial and generic approach of its predecessor).
With the possibility of exploring the concept of Big Data within companies, the velocity, volume, and variety of stored data grew exponentially, allowing the application of techniques that guaranteed the discovery of information that was not possible before this technology. However, an institution may have several sources of Big Data and extremely varied data types, such as those from social networks and sensor networks, for example. Thus, natural evolutions spread from these conditions, one of them being Data Lakes. In a broader approach, these can be described as huge repositories of data covering a range of heterogeneous topics and business domains. Such repositories need to be effectively organized, so that value can be gained from them; and they require the application of various techniques to extract information and knowledge, thus preventing them from becoming a set of unusable data (Data Swamps). Conceptualizing in a more specific way and with a more technical focus, a Data Lake is generally considered as a central storage location in an organization, company, or institution, for which any types of data of any size can be copied at any data rate, using any import method (initial, batch, streaming) in its original format (native, raw). An Azure Data Lake model, implemented by Microsoft as a service, for example: this technology initially collects the data (1), processes it using Hadoop or Spark (2), combines them in different ways (3), performs transformation for analysis (4), prepares them for visualization and publication, generating information (5), and, finally, distributing it for use on-demand (6).
But what do each of these technologies mean and, above all, what is their impact on public security? Once they are fully implemented and mastered,The figure below, for example, presents would it be possible to analyze suspicious patterns and predict or prevent certain criminal actions, through police force operations guided by data analysis? The answer to this question is already positive through the use of algorithms in isolation by several governments, such as the crime prediction algorithm in Italy and the Interpol voice biometrics system, as in recent years the governments have returned part of their focus to the collection, analysis, and combination of data inserted in their systems, aiming to find standards that can be useful for an improvement in their service provision. But what data is possessed by a public institution, and how to use it in the best possible way?
Just to present practical applicability, consider data from Public Security Forces: documents, police reports, police operations, and vehicle data are some of the known stores. There are too some data on the prison system, and health conditions on all those registered there. By itself, it is already possible to identify a myriad of patterns in isolation only with the data from this institution themselves (the relationship between documents, vehicles, and thefts, for example). If this data were integrated into a single Data Lake, together with data from open sources (such as social networks) and a team of data scientists were responsible for implementing and applying specific algorithms for pattern discoveries, this could modify the impact of the entire police investigative system as it is known, only as a reflection of this action. Consider an escape from a prisoner, for example. With the interconnected data, predictive analyzes could be created to find out the detainee’s health condition (and where he will possibly need to go to get medicines according to the region), their correlations, possible crimes to be carried out based on the previous ones, possible resting points (family and friends) or vehicles that could be used for escape and a percentage prediction of what actions he will take based on his behavior. This technology would allow agents to focus their efforts on certain actions, save unnecessary travel and provide a solid "starting point" for the individual’s recapture. And this is just one of the possible applications.
In conclusion, then, data analysis technologies have the capacity to add great value to the service of public security (and many others), thus allowing greater integration of agents also with solid technologies to support their work, which will increase efficiency, reduce risks and save public money. For those who think that Tom Cruise’s Minority Report (where police forces captured a criminal before he committed a crime) is just a fantasy, it is because they fail to contemplate the possibilities of patterns that can be found in a dataset about a person and the power of a regression or classification analysis. The police of the future is no longer a sci-fi plot, but a reality.
References
Alserafi, A., Calders, T., Abello, A., and Romero, O. (2017). DS-Prox: Dataset Proximity Mining for Governing the Data Lake. volume 8199, pages 284–299.
BBC News (2018). Can We Predict When and Where a Crime Will Take Place? URL: https://www.bbc.com/news/business-46017239.
Beyer, M. and Laney, D. (2012). The Importance of ‘Big Data: A Definition. Gartner. URL: https://www.gartner.com/en/documents/2057415/the-importance-of-big-data-a-definition#:~:text=%22Big%20data%22%20warrants%20innovative%20processing,to%20business%20goals%20and%20objectives.
Carson, B. (2018). Old Unicorn, New Tricks: Airbnb has a Sky-High Valuation. Here’s its Audacious Plan to Earn It. Forbes. URL: https://www.forbes.com/sites/bizcarson/2018/10/03/old-unicorn-new-tricks-airbnb-has-a-sky-high-valuation-heres-its-audacious-plan-to-earn-it/?sh=78f39ee6fa30
Chen, L. (2015). Uber wants to Conquer the World, but These Companies are Fighting Back. Forbes. URL: https://www.forbes.com/sites/liyanchen/2015/09/09/uber-wants-to-conquer-the-world-but-these-companies-are-fighting-back-map/?sh=436feeae4fe1
Dhar, V. (2013). Data Science and Prediction. Commun. ACM, 56(12):64–73.
Koffman, A. (2018). Interpol Rolls Out International Voice Identification Database Using Samples from 192 Law Enforcement Agencies. The Intercept. URL: https://theintercept.com/2018/06/25/interpol-voice-identification-database/
Rud, O. P. (2009). Business Intelligence Success Factors: Tools for Aligning Your Business in the Global Economy. Wiley, 1st edition.