Making Sense of Big Data
Balancing Security, Information Integrity, and UX

Atomicity, Consistency, Isolation, and Durability
Database management has many nuances that can determine the difference between an efficient database and an unstructured data repository. One of the main purposes of indexing is to reduce the number of steps necessary to find a desired piece of information within the context of a database. In an example given, a database manager optimized the index search and was able to reduce the number of steps for a query from 300,000 to 30 (Wenzel, 2020). This type of drastic improvement is just one example of the usefulness of indexing concerning improving large-database query efficiency.
Another reason that indexes are useful is that it makes queries more accurate in their results. In the case that an index has been adequately defined and had the subgroup trees properly articulated, a query in that context would yield more accurate search results than the same query within a database that had not been properly indexed. In the example using a deck of cards, employing indexing by creating two levels of subgroups to define different categories reduces the number of card flips necessary to find a specific card from 26 to 9 (Wenzel, 2020).
The B+ tree is the standard method of subdividing databases to establish indexed categories (Wenzel, 2020). In establishing the operational node structure to then have indexed data articulated by pages, the B+ tree makes it possible to drastically reduce the time and effort associated with a database query (Wenzel, 2020).
While database indexes make queries more efficient, there is a point of diminishing returns on how deep the B+ tree should be indexed. In most databases the administrators or managers will not index every single column and table, as that would result in an inordinate amount of time being spent defining the indexes. The database manager will usually attempt to find the best level of analysis that optimizes query speed without taking too much time defining the levels within the database. Every database is unique in that the usage for the data will be specific to the context of the user. This would result in needing to ensure that every database that is being accessed is indexed specifically for the users and their desired needs rather than arbitrarily indexing every database out to every column, table, and cell.
Web DBMS
The Web presents a number of benefits and drawbacks concerning operating a DBMS. Since HTML is a stateless protocol, it operates without a memory of previous requests to deal with the limited resources allocated to memory (Begg, 2014). This presents a hurdle for the general needs of how DBMS protocols tend to operate. Additionally, the dynamic nature of the Web makes a DBMS inherently more vulnerable to malicious actors than DBMS that operate within an intranet or local area network. While the dynamic nature of the Web can be a risk, it also makes the databases more accessible to users than exclusive access points.
Another benefit of operating a DBMS using the Web is the lightweight protocols prevent unnecessary bloat in hardware usage. The nature of Web programming results in DBMS protocols that have to work around the restrictions of the Web and thus the by-product is a more agile product for the user without increasing the hardware demands. Another benefit of the Web DBMS is the interoperability between languages. It is not uncommon to have HTML used in tandem with multiple other languages to create a DBMS. This interoperability creates more dynamic capabilities within the limitations of the Web.
One of the downsides of Web DBMS is the reliability of the Internet (Begg, 2014). If a DBMS is only accessible via the Web and Internet access gets disconnected, then the DBMS is inoperative. As the proliferation of the smart-phone made Internet access more ubiquitous, the over-reliance on Web DBMS could be a risk that is too dangerous for most businesses to take.
While it is not practical for a business to operate completely on a closed network in modern times, it would be beneficial to articulate demilitarized zones and militarized zones so that sensitive data does not make it into a Web DBMS unnecessarily. As Web DBMS can be leveraged for less sensitive data, it stands to reason that an enterprise would want to use Web DBMS for low level operations to cut overhead while still establishing a closed network to protect sensitive data. This combination gets the best of both approaches without neglecting the benefits or drawbacks of either.
There are specific instances in which eXtensible markup language (XML) will be more useful for the functionality of a build than hypertext markup language (HTML). In the case that users need object persistence in which the data produced by or for an object is stored within the object, XML is likely going to be the better language for that user. Further, the needs of the users will determine if the XML is used as the native XML database (NXD) or if there is a Relational Database Management System (RDBMS) utilized to create a layer to store versions of the data so that it does not get destroyed when it is modified (Westphal, 1999).
While the difference may seem trivial, there is a significant functional difference between utilizing the RDBMS and the NXD approach. In having the NXD approach, the data that is represented in the application layer UI is the same data that is in the XML data layer. This approach differs from the RDBMS in that there will be at least one other language used to modify and represent the data within the XML databases. An RDBMS can utilize more than one language to work with XML databases as a developer might opt to use multiple languages to achieve the desired functionality for a project. While XML is sufficient to display data and achieve basic functionality in representing data, an RDBMS will be much more effective at executing complex transactions or analysis of the data that could not be achieved in the XML layer alone.
The pros and cons of using each approach would strictly depend on the functional needs of the users and the capacity of the developers to utilize different aspects of each approach for the benefit of the users. The NXD approach would give the users more streamlined access to their data, and would most likely utilize fewer computing resources to access the database. On the other hand, the RDBMS approach would be beneficial in the situation where the users needed more functional analysis or transactional relationships between data points without necessarily destroying or modifying the data in the database. NXD would not be good for analysis or transaction, and RBDMS would not be as effective at object persistence, so this is where the developers have to weigh the downfalls of each for the needs of the users.
Business Intelligence
It is useful to articulate the differences between Business Intelligence (BI), Data Mining, Data Warehousing, and Online Analytics Processing (OLAP) and where these practices overlap. While BI includes utilizing different aspects of these practices in tandem to produce a holistic view of operations, mining and warehousing are essential aspects of the infrastructure that must exist in order to produce a useful OLAP report. Further, these three elements must be present in tandem to provide accurate data for a Decision Support System (DSS).
To understand how these three elements work together, it will be beneficial to understand the problems they are solving. Enterprise-scale organizations that have their employees stationed around the globe have to have consistent data repositories between the different branches to reduce redundant work as well as needing this type of data integrity to keep daily operations possible. Beyond this need for integrity of data, the connectivity between branches as a necessity becomes a major risk for a potential security breach (Zhou, 2019). Due to these seemingly contradictory needs, it becomes necessary to compartmentalize the flow and analysis of data within an organization to make data management more accessible while simultaneously protecting sensitive data within the organizational flow (Zhou, 2019).
In this context, the needs of a person within the organization that is looking to analyze data about the customers or product will be different than the needs of a manager looking to analyze the data flow between employees to improve efficiency (Begg, 2014). In establishing a unified BI interface, many organizations will likely need to have independent access to the OLAP, data warehouse, and DDS application layers to operate at an enterprise scale. It is in the context of understanding that the needs of the Human Resources department which may heavily rely on data warehousing will be functionally different than the Marketing department which may need more OLAP or data mining access rather than access to the data warehouses themselves.
Additionally, a DSS will need high-level access to all of these elements to give decision-makers the proper information they would need to make an informed decision. While decisions about OLAP data could be made without referencing the data warehouse, the potential impacts to operations make it necessary to have all relevant information accessible to ensure all avoidable risks are mitigated.
Schemas and Schemaless Databases
The discussion of NoSQL inherently addresses the differences between relational databases and schemaless databases. While both approaches to storing and analyzing data have their places, it is useful to place the context for when schemaless databases would be more appropriate than a relational database.
One way to articulate the benefits of a schemaless database is to look at the drawbacks of a relational database to determine if the schemaless approach offers some superior function. One of the most widely known drawbacks of the relational database approach is the need to go through format changes when data is moving between the in-memory state to the database or vice versa (Fowler, 2012). This approach is useful when there is time to develop a schema and move data between states to properly analyze it, but this would not be appropriate in situations where data moves quickly and needs to be analyzed against other data sets more often. While schemaless databases may not be the best approach for financial services, this approach may be much more beneficial in an environment with overlapping development teams where the schemaless database takes on a scrum approach.
In the context of a NoSQL deployment that was meant to be used by multiple departments on a single development project, the interaction with the data would need to go through a single layer application to prevent data inconsistency. Additionally, it would be necessary for the administrators dealing with the database to articulate operation protocol so that redundant data does not create inconsistencies or ruin data integrity. This becomes one of the biggest points of schemaless databases’ weakness, in that data integrity can quickly break down if the database is accessed from multiple application interfaces (Fowler, 2012). It is in the predisposition for multiple application interfaces to cause conflicts that developers sometimes opt to utilize a single web service interface as the middle layer between the NoSQL database and multiple applications. This approach gives users the flexibility of the NoSQL database with less likelihood of data corruption due to incompatible application layers.
When looking at the distributed database management system in comparison to the centralized database management system, there are some clear benefits and tradeoffs that come with employing a distributed database. One of the frameworks put forth to describe the limitations of DDBMS is the CAP theorem, which states that out of the three states of consistency, availability, and partition tolerance; DDBMS can only have two out of the three states present at one time (Sadat, 2018).
Partition tolerance is an integral aspect of databases that are deployed in the wild, as it indicates a malfunctioning node would not affect any other partitions within the database’s network (Sadat, 2018). In a system with high partition tolerance that sacrificed availability, the database would be more consistent by proxy in that the individual nodes would have reduced capacity to have a node compromised from an outside actor (Sadat, 2018).
Centralized vs. Decentralized Concurrency Control
While centralized databases have their benefits, one of the superior aspects of distributed databases is the near 100% uptime compared to centralized data centers that frequently have to sever access to complete system updates (Sadat, 2018). Additionally, the ability for decentralized systems to scale horizontally more easily than centralized systems makes more powerful processing more accessible for users without increasing the financial barrier for entry.
On the other hand, users might need a distributed database that is highly available and also partition resistant. In this case, concurrent input may cause conflicting or confusing output that does not represent chronological consistency of record entry, but the records will be highly available to users (Sadat, 2018).
Considering the CAP theorem, a distributed database approach might be more suited for applications like collaborative work sandboxes. This may come in the form of a spreadsheet or bulletin board. These types of information exchanges that allow input, output, and distributed access to records give the opportunity for many different types of teams to collaborate in this type of digital space. This is in comparison to other types of global transactions that need to be more centralized such as financial records. Financial records may track exchanges between multiple parties around the world, but the databases need to be maintained and monitored in such a way that makes centralized DBMS the more appropriate approach.
Looking at the differences between the implications of doing a nested transaction on an optimistically concurrent database relative to a conservatively concurrent database offers different benefits and drawbacks in each scenario. In order to understand the benefits and drawbacks, it is important to contextualize the nuance between the types of concurrency control. In the optimistic concurrency control scenario, the transaction will be assumed to not disrupt the integrity of the database and will be serialized even if there are multiple transactions happening at the same time (Block, 2018). This would be in comparison to a conservative concurrency control approach which time-stamps each transaction to ensure that there are no conflicts that occur as the data is serialized.
The optimistic approach would be a much faster approach when dealing with large volumes of transactions, and in the case of an individual nested transaction it would be faster to update the final database than a conservative concurrency approach. On the other hand, the conservative concurrency approach would likely yield a more accurate throughput of data that does not result in the database needing to be rolled back into a previous state as frequently. While this approach would result in more accurate databases, it would not be ideal for databases that need to store constantly changing data for large volumes of users.
While the optimistic concurrency approach is more suitable for high-frequency and high volume transaction scenarios, the assumption that incoming data will not cause a conflict in the state of the database makes it more vulnerable to spam attacks in which the functioning of the database is impeded by malicious attempts to cause the database to roll back its state to get it stuck in a perpetual rollback loop. One would likely look at the context of the data that was being stored to determine which approach would likely be more appropriate to employ. As an example, it would likely be more appropriate to use a conservative concurrency approach to update a database that dealt with financial transactions. An example of an application layer that would use an optimistic concurrent approach to the database would likely be something like the data that goes into the comments on a social media site. That type of approach would be useful in an environment where the database needs to be updated fairly frequently with multiple entries happening simultaneously in a layer that does not allow for the user to create a dangerous state change that could trigger a roll back.
Two-Phase Commit (2PC) Protocol and Three-Phase Commit (3PC) Protocol are two different approaches to committing and validating transactions within a database. 2PC was meant to establish a method of preparing distributed databases to update with new transaction information (Atif, 2009). However, the original approach had a design flaw which led to a system-wide resource freeze which became known as blocking (Atif, 2009). In the 2PC approach, if a node malfunctioned or a transaction was compromised in such a way that the participating nodes are in an uncertain state then the participant nodes will lock their resources until the next message is received from the coordinating node (Atif, 2009).
Another problem with 2PC was the state inconsistency issue that could occur if a rollback occurred while one of the nodes is in a pre-commit stage and fails (Atif, 2009). A failure at this point can cause the system to show both the commit and abort commands creating an inconsistent global state (Atif, 2009).
3PC was introduced to address these issues by establishing a new state which is known as the pre-commit (Atif, 2009). The introduction of the pre-commit state added another layer of consistency so that the distributed databases all agree on pre-commit readiness before moving to the commit writing phase. Relative to 2PC, 3PC gives more opportunity for transaction to be terminated. As 2PC only allows committing or aborting, there is only really one opportunity for a transaction to be actively terminated without attempting to cause a node failure to then create a system-wide roll back. On the other hand, 3PC would give the chance to terminate a transaction in the pre-commit state so that the network does not move to commit the transaction.
Even after the pre-commit state had been passed, there would theoretically still be another opportunity to terminate a transaction by causing a system-wide roll back. While the system-wide roll back is not optimal, the next situation in which there is complete power-failure brings up that exact scenario. In a 2PC, if there are no active commits then a power-failure may not cause problems. However, any sort of abort or commit commands that were in queue before the power outage may cause a blocking problem on start-up and the data that was in transit may be lost forever. The 3PC pre-commit state should theoretically make it more resilient to succumbing to a blocking problem in the case of a power outage with all nodes likely rolling back if there were any in a pre-commit state before the outage.
Aggregate Modeling
The aggregate model is a useful model in which the aggregate is seen as a single data capturing unit. All data defining the aggregate points are treated as the key-values and, subsequently, the key-value pairs (Fowler, 2012). An example would be an aggregate which represented a customer and all their orders (Fowler, 2012). With the inherent clustering of data, the aggregate’s relationships establish the data points to be referenced. This ordering becomes problematic when attempting to establish connections between data that exists in two separate aggregates. The aggregate model is useful for checking data within aggregates, but the model necessitates modification when higher-level relationships need to be examined.
The column-family aggregate establishes a group of columns called a "column family," which then breaks down the columns by row identifiers giving each row a specific data value (Fowler, 2012). A key-value database allows a user to store data under a key. In contrast, a document database gives users the capacity to store documents within the database with no restrictions on structure (Fowler, 2012).
The aggregate model gives a method of storing metadata to define relationships between aggregates in what is known as a riak, or a key-value store (Fowler, 2012). This higher-level metadata definition capability makes aggregate-oriented databases able to retain the atomicity of aggregates without restricting the user from establishing relationships between separate aggregates.
There are benefits and drawbacks to the aggregate approach. One of the most significant benefits is the capacity to store large clusters of closely related data. As mentioned earlier, it would benefit a merchant to keep all of a customer’s orders in a single aggregate while separating different customers into different aggregates. Further, the riak option would give a merchant the capacity to analyze customers’ purchases or preferences between aggregates via the metadata layer. On the other hand, this type of approach may reduce the ability for timely relationship analyses between aggregates compared to a relational database dedicated explicitly to relationship analysis.
Machine Learning
Supervised and unsupervised Machine Learning both have their places in the context of data analysis. Supervised ML is best suited for situations in which the desired outcome has a set of metrics already known for the subject of research (Lieber et al., 2013). An example will be if a person is looking to predict the likelihood of events like weather patterns or traffic patterns. Suppose a researcher has pre-existing data that the ML program can analyze for patterns. In that case, the algorithm can put out predictive analyses that present the most accurate data to create a forecast.
Unsupervised ML is best suited for situations in which there are no known patterns available to observe. The ML algorithm will discover any important non-random data that might exist. Unsupervised ML methods like visualization, clustering, outlier detection, or dimension reduction are often utilized as the first line of analysis for data mining complex data sets. This type of ML would be more appropriate for user behavior analysis to look for emergent patterns or looking for the same kind of emergent pattern in a data lake.
In analyzing these ML approaches, there are straightforward applications for each strategy in which one type has benefits over the other. One of the issues related to unsupervised ML is an algorithm’s potential to bloat and create an imbalanced load on resources. If an unsupervised ML algorithm is not capped somehow, it can potentially overload resources to stall a query or completely crash a system. Conversely, a supervised ML algorithm could have its capacity to discover data impeded by its restricted parameters.
Between these two significant drawbacks of each approach, there is a potential to utilize supervised and unsupervised tandem to make supervised ML more robust while reining in the unsupervised ML’s utilization of resources. Additionally, the availability of ML cloud services makes it much more reasonable to utilize unsupervised and supervised ML. Cloud services give organizations the capacity to use these ML platforms without expanding infrastructure, specifically for this purpose. Unless an organization is developing an application that will need to have on-board ML, cloud ML platforms will likely be the most viable option for many organizations.
When balancing security and usability, there is a fine line which marks the distinction between securing a system and making the user experience more difficult. One of the problems with crossing that line and making the user’s experience more difficult is that users often respond to this scenario by working around the obtuse security measures. Ultimately, if users feel too inconvenienced by security and work around them then the security team has unintentionally made users less secure as the goal is to make a system usable in the most secure manner possible. If usability is impeded, then the security measures are a problem regardless of how secure the platform is in theory.
References:
Atif, M. (2009, October). Analysis and verification of two-phase commit & three-phase commit protocols. In 2009 International Conference on Emerging Technologies (pp. 326–331). IEEE.
Begg, T.C. C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management. [Colorado Technical University (CTU)]. Retrieved from https://coloradotech.vitalsource.com/#/books/9781323135761/
Block, S. (2018, June 10). Conservative concurrency vs. optimistic concurrency control. Retrieved from https://shannonblock.org/conservative-concurrency-control-vs-optimistic-concurrency-control/
Ezhilchelvan, P., Aldweesh, A., & van Moorsel, A. (2018, June). Non-blocking two phase commit using blockchain (pp. 36–41). Proceedings of the 1st Workshop on Cryptocurrencies and Blockchains for Distributed Systems. Retrieved from https://doi-org.proxy.cecybrary.com/10.1145/3211933.3211940
Fowler, P.J.S. M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. [Colorado Technical University (CTU)]. Retrieved from https://coloradotech.vitalsource.com/#/books/9781323137376/
Lieber, Daniel & Stolpe, Marco & Konrad, Benedikt & Deuse, Jochen & Morik, Katharina. (2013). Quality Prediction in Interlinked Manufacturing Processes based on Supervised & Unsupervised Machine Learning. Procedia CIRP. 7. 193–198. 10.1016/j.procir.2013.05.033.
Reimer, M. (1983, October). Solving the Phantom Problem by Predicative Optimistic Concurrency Control. In VLDB (Vol. 83, pp. 81–88).
Sadat, S. (2018, April 24). CAP theorem and distributed database management systems. Medium. Retrieved from https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e
Shannonblock3, A. (2018, July 07). Comparison of Persistence Mechanisms (XML, RDBMS, NXD). Retrieved October 08, 2020, from https://shannonblock.biz/?p=62
Wenzel, K. (2020). Database indexes explained. Retrieved from https://www.essentialsql.com/what-is-a-database-index/
Westphal, R. (1999, September 08). Using XML for Object Persistence. Retrieved October 08, 2020, from https://www.xml.com/pub/a/1999/09/serialization/index.html
Zhuo, C. H. E. N. (2019). An Approach for Developing Platform of OLAP. DEStech Transactions on Computer Science and Engineering, (aicae).