*Are you fluent in geek? Download the top terms in our handy Geek Glossary!*

**ACID Test:** A test applied to data for atomicity, consistency, isolation and durability.

**Aggregation:** A process of searching, gathering and presenting data.

**Algorithm:** A mathematical formula or statistical process used to perform analysis of data.

**Alpha Risk:** The maximum probability of making a Type I error. This probability is established by the experimenter and often set at 5%.

**Alternative Hypothesis (Ha):** Statement of a change or difference; assumed to be true if the null hypothesis is rejected.

**Anomaly Detection:** The process of identifying rare or unexpected items or events in a dataset that do not conform to other items in the dataset and do not match a projected pattern or expected behavior. Anomalies are also called outliers, exceptions, surprises or contaminants and they often provide critical and actionable information.

**Anonymization**: Making data anonymous; severing of links between people in a database and their records to prevent the discovery of the source of the records.

**ANOVA**: One-way ANOVA is a generalization of the 2-sample t-test, used to compare the means of more than two samples to each other.

**ANOVA Table**: The ANOVA table is the standard method of organizing the many calculations necessary for conducting an analysis of variance.

**API (Application Program Interface)**: A set of programming standards and instructions for accessing or building web-based software applications.

**Application**: Software that enables a computer to perform a certain task.

**Artificial Intelligence**: The apparent ability of a machine to apply information gained from previous experience accurately to new situations in a way that a human would.

**Batch Processing**: Batch data processing is an efficient way of processing high volumes of data where a group of transactions is collected over a period of time. Hadoop is focused on batch data processing.

**Bayes Theorem**: A theorem based on conditional probabilities. It uses relevant evidence, also known as conditional probability, to determine the probability of an event, based on prior knowledge of conditions that might be related to the event.

**Beta Risk**: The risk or probability of making a Type II error.

**Big Data**: Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.

**Business Intelligence**: The general term used for the identification, extraction and analysis of data.

**Cassandra**: A popular open source database management system managed by The Apache Software Foundation. Designed to handle large amounts of distributed data across commodity servers while providing a highly available service. It is a NoSQL solution that was initially developed by Facebook.

**Classification Analysis**: A systematic process for obtaining important and relevant information about data (metadata) and assigning data to a particular group or class.

**Clickstream Analytics**: The analysis of users’ web activity through the items they click on a page.

**Cloud**: A broad term that refers to any internet-based application or service that is hosted remotely.

**Cloud Computing**: A distributed computing system hosted and running on remote servers and accessible from anywhere on the internet.

**Cluster Computing**: Computing using a ‘cluster’ of pooled resources of multiple servers. Getting more technical, we might be talking about nodes, cluster management layer, load balancing and parallel processing etc.

**Clustering Analysis**: The process of identifying objects that are similar to each other and clustering them in order to understand the differences as well as the similarities within the data.

**Coefficient of Variation**: Standard deviation normalized by the mean: ?/?.

**Columnar Database or Column-oriented Database**: A database that stores data by column rather than by row. In a row-based database, a row might contain a name, address and phone number. In a column-oriented database, all names are in one column, addresses in another and so on. A key advantage of a columnar database is faster hard disk access.

**Comparative Analysis**: Data analysis that compares two or more data sets or processes to detect patterns within very large data sets.

**Confidence Interval**: A range of values which is likely to contain the population parameter of interest with a given level of confidence.

**Continuous Data**: Data from a measurement scale that can be divided into finer and finer increments (e.g. temperature, time, pressure). Also known as variable data.

**Correlation Analysis**: A means to determine a statistical relationship between variables, often for the purpose of identifying predictive factors among the variables. A technique for quantifying the strength of the linear relationship between two variables.

**Dashboard**: A graphical representation of analyses performed by algorithms.

**Dark Data**: All the data that is gathered and processed by enterprises but not used for any meaningful purposes.

**Data**: A quantitative or qualitative value. Common types of data include sales figures, marketing research results, readings from monitoring equipment, user actions on a website, market growth projections, demographic information and customer lists.

**Data Aggregation**: The process of collecting data from multiple sources for the purpose of reporting or analysis.

**Data Analyst**: A person responsible for the tasks of modelling, preparing and cleaning data for the purpose of deriving actionable information from it.

**Data Analytics**: The process of examining large data sets to uncover hidden patterns, unknown correlations, trends, customer preferences and other useful business insights. The end result might be a report, an indication of status or an action taken automatically based on the information received. Businesses typically use the following types of analytics:

*Behavioral Analytics*: Using data about people’s behavior to understand intent and predict future actions.*Descriptive Analytics*: Condensing big numbers into smaller pieces of information. This is similar to summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.*Diagnostic Analytics*: Reviewing past performance to determine what happened and why. Businesses use this type of analytics to complete root cause analysis.*Predictive Analytics*: Using statistical functions on one or more data sets to predict trends or future events. In big data predictive analytics, data scientists may use advanced techniques like data mining, machine learning and advanced statistical processes to study recent and historical data to make predictions about the future. It can be used to forecast weather, predict what people are likely to buy, visit, do or how they may behave in the near future.*Prescriptive Analytics*: Prescriptive analytics builds on predictive analytics by including actions and make data-driven decisions by looking at the impacts of various actions.

**Data Architecture and Design**: How enterprise data is structured. The actual structure or design varies depending on the eventual end result required. Data architecture has three stages or processes: (1) conceptual representation of business entities, (2) the logical representation of the relationships among those entities and (3) the physical construction of the system to support the functionality.

**Data as a Service (DaaS)**: Treat data as a product. DaaS providers use cloud solutions to give on-demand access of data to customers.

**Data Center**: A physical facility that houses a large number of servers and data storage devices. Data centers might belong to a single organization or sell their services to many organizations.

**Data Cleansing**: The process of reviewing and revising data to delete duplicate entries, correct misspelling and other errors, add missing data and provide consistency.

**Data Ethical Guidelines**: Guidelines that help organizations be transparent with the data, ensuring simplicity, security and privacy.

**Data Feed**: A means for a person to receive a stream of data such as a Twitter feed or RSS.

**Data Governance**: A set of processes or rules that ensure data integrity and that data management best practices are met.

**Data Integration**: The process of combining data from different sources and presenting it in a single view.

**Data Integrity**: The measure of trust an organization has in the accuracy, completeness, timeliness and validity of the data.

**Data Lake**: A large repository of enterprise-wide data in raw format. Supposedly data lakes make it easy to access enterprise-wide data. However, you really need to know what you are looking for and how to process it and make intelligent use of it.

**Data Mart**: The access layer of a data warehouse used to provide data to users.

**Data Warehouse**: A repository for enterprise-wide data but in a structured format after cleaning and integrating with other sources. Data warehouses are typically used for conventional data (but not exclusively).

**Data Mining**: Finding meaningful patterns and deriving insights in large sets of data using sophisticated pattern recognition techniques. To derive meaningful patterns, data miners use statistics, machine learning algorithms and artificial intelligence.

**Data Modelling**: A data model defines the structure of the data for the purpose of communicating between functional and technical people to show data needed for business processes, or for communicating a plan to develop how data is stored and accessed among application development team members.

**Data Scientist**: Someone who can make sense of big data by extracting raw data, massaging it and come up with insights. Skills needed are statistics, computer science, creativity, story-telling and understanding of business context.

**Data Science**: A discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning and database engineering to solve complex problems.

**Data Set**: A collection of data, very often in tabular form.

**Database**: A digital collection of data and the structure around which the data is organized. The data is typically entered into and accessed via a database management system.

**Database Management System (DBMS)**: Software that collects and provides access to data in a structured format.

**Demographic Data**: Data relating to the characteristics of a human population.

**Discrete Data**: Data which is not measured on a continuous scale. Examples are binomial (pass/fail), Counts per unit, Ordinal (small/medium/large) and Nominal (red/green/blue). Also known as attribute or categorical data.

**Discriminant Analysis**: A statistical analysis technique used to predict cluster membership from labelled data.

**Distributed File System**: A data storage system meant to store large volumes of data across multiple storage devices and will help decrease the cost and complexity of storing large amounts of data.

**Empirical Model**: An equation derived from the data that expresses a relationship between the inputs and an output (Y=f(x)).

**ETL (Extract, Transform and Load)**: The process of extracting raw data, transforming by cleaning/enriching the data to make it fit operational needs and loading into the appropriate repository for the system’s use. Even though it originated with data warehouses, ETL processes are used while taking/absorbing data from external sources in big data systems.

**Event**: A set of outcomes of an experiment (a subset of the sample space) to which a probability is assigned.

**Exploratory Analysis**: An approach to data analysis focused on identifying general patterns in data, including outliers and features of the data that are not anticipated by the experimenter’s current knowledge or preconceptions. EDA aims to uncover underlying structure, test assumptions, detect mistakes and understand relationships between variables.

**External Data**: Data that exists outside of a system.

**F-test**: A hypothesis test for comparing variances.

**Fit**: The average outcome predicted by a model.

**Grid Computing**: Connecting different computer systems from various locations, often via the cloud, to reach a common goal.

**Hadoop**: An open source software framework administered by Apache that allows for storage, retrieval and analysis of very large data sets across clusters of computers.

**High Performance Computing**: Using supercomputers to solve highly complex and advanced computing problems.

**Histograms**: Representation of frequency of values by intervals.

**In-database Analytics**: The integration of data analytics into the data storage layer.

**In-memory Computing**: A technique of moving the working datasets entirely within a cluster’s collective memory and avoid writing intermediate calculations to disk. This results in very fast processing, storing and loading of data.

**IoT (Internet of Things)**: The network of physical objects or “things” embedded with electronics, software, sensors and connectivity to enable it to achieve greater value and service by exchanging data with the manufacturer, operator and/or other connected devices. Each thing is uniquely identifiable through its embedded computing system but is able to interoperate within the existing Internet infrastructure.

**Juridical Data Compliance**: Use of data stored in a country must follow the laws of that country. Relevant when using cloud solutions with data stored in difference countries or continents.

**Key Value Databases**: Storing data with a primary key, a uniquely identifiable record, which makes it easy and computationally efficient to look up.

**Latency**: Any delay in a response or delivery of data from one point to another.

**Load Balancing**: The process of distributing workload across a computer network or computer cluster to optimize performance.

**Location Analytics**: Location analytics brings mapping and map-driven analytics to enterprise business systems and data warehouses. It allows you to associate geospatial information with datasets.

**Location Data**: GPS data describing a geographical location.

**Log File**: A file that a computer, network or application creates automatically to record events that occur during operation. For example, the time a file is accessed.

**Logistic Regression**: Investigates the relationship between response (Y’s) and one or more predictors (X’s) where Y’s are categorical, not continuous and X’s can be either continuous or categorical. Types of logistic regression are:

*Binary Logistic Regression*: Y variable takes on one of two outcomes (levels), e.g. pass/fail, agree/disagree.*Ordinal Logistic Regression*: Y variable can have more than two levels. Levels are rank ordered, e.g. Low/Medium/High, 1-5 preference scale.*Nominal Logistic Regression*: Y variable can have more than two levels. There is no implied order to the levels, e.g. Blue/Yellow/Green, Company A/B/C/D.

**Machine-generated Data:** Data automatically created by machines via sensors or algorithms or any other non-human source.

**Machine Learning**: A method of designing systems that can learn, adjust and improve based on the data fed to them. Using predictive and statistical algorithms that are fed to these machines, they learn and continually zero in on “correct” behavior and insights and they keep improving as more data flows through the system.

**MapReduce**: A programming model for processing and generating large data sets. This model does two distinct things. First, the “Map” includes turning one dataset into another, more useful and broken down dataset made of parts called tuples. Tuples may typically be processed independently from each other across multiple processors. Second, “Reduce” takes all of the broken down, processed tuples and combines their output into a usable result. The result is a practical breakdown of processing.

**Massively Parallel Processing (MPP)**: Using many different processors (or computers) to perform certain computational tasks at the same time.

**Mean**: The weighted average of data. The population mean is denoted by ? (Greek letter mu) and the sample mean is denoted by x?.

**Median**: The middle value of a data set when arranged in order of magnitude.

**Metadata**: Data about data; it gives information about what the data is about. For example, where data points were collected.

**Mode**: The measurement that occurs most often in a data set.

**Multi-dimensional Databases**: A database optimized for data online analytical processing (OLAP) applications and for data warehousing.

**Naïve Bayes**: A classification technique based on Bayes Theorem with an assumption of independence among predictors. In simple terms, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

**Network Analysis**: Analyzing connections between nodes in a network and the strength of their ties.

**Neural Network**: Models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

**Normal Distribution**: The most important continuous probability distribution in statistics is the normal distribution (a.k.a. Gaussian distribution). The normal distribution is the familiar bell curve. Once m and s are specified, the entire curve is determined.

**NoSQL (Not ONLY SQL)**: A broad class of database management systems identified by non-adherence to the widely used relational database management system model. NoSQL databases are not built primarily on tables and generally do not use SQL for data manipulation. Database management systems that are designed to handle large volumes of data and are often well-suited for big data systems because of their flexibility and distributed-first architecture needed for large unstructured databases.

**Null Hypothesis (H0)**: Statement of no change or difference; assumed to be true until sufficient evidence is presented to reject it.

**One Sample T-test**: Statistical test to compare the mean of one sample of data to a target. Uses the t-distribution

**Operational Databases**: Databases that carry out regular operations of an organization that are generally very important to the business. They typically use online transaction processing that allows them to enter, collect and retrieve specific information about the organization.

**Optimization Analysis**: The process of finding optimal problem parameters subject to constraints. Optimization algorithms heuristically test a large number of parameter configurations in order to find an optimal result, determined by a characteristic function (also called a fitness function).

**Outlier Detection**: An object that deviates significantly from the general average within a dataset or a combination of data. It is numerically distant from the rest of the data and therefore indicates that something unusual and generally requires additional analysis.

**Paired T-test**: A test used to compare the average difference between two samples of data that are linked in pairs. Special case of the 1-sample t-test. Uses the t-distribution.

**Pattern Recognition**: Identifying patterns in data via algorithms to make predictions about new data coming from the same source.

**Pig**: A data flow language and execution framework for parallel computation.

**Population**: A dataset that consists of all the members of some group. Descriptive parameters (such as ?, ?) are used to describe the population.

**Power (1-beta)**: The ability of a statistical test to detect a real difference when there is one; the probability of correctly rejecting the null hypothesis. Determined by alpha and sample size.

**Predictive Modelling**: The process of developing a model that will most likely predict a trend or outcome.

**Probability**: The likelihood of a given event’s occurrence, which is expressed as a number between 1 and 0.

**Probability Distribution**: A statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. Probability distributions may be discrete or continuous.

**Public Data**: Public information or data sets that were created with public funding.

**Query**: Asking for information to answer a certain question.

**R**: An open source programming language used for statistical computing and graphics. It is a GNU project which is similar to the S language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques and is highly extensible. It is one of the most popular languages in data science.

**Range**: Difference between the largest and smallest measurement in a data set.

**Real Time Data**: Data that is created, processed, stored, analyzed and visualized within milliseconds.

**Regression Analysis**: A modelling technique used to define the association between variables. It assumes a one-way causal effect from predictor variables (independent variables) to a response of another variable (dependent variable). Regression can be used to explain the past and predict future events.

**Residual**: The difference between reality (an actual measurement) and the fit (model output).

**Sample**: A data set which consists of only a portion of the members from some population. Sample statistics are used to draw inferences about the entire population from the measurements of a sample.

**Scalability**: The ability of a system or process to maintain acceptable performance levels as workload or scope increases.

**Semi-structured Data**: Data that is not structured by a formal data model, but provides other means of describing the data hierarchies (tags or other markers).

**Sentiment Analysis**: The application of statistical functions and probability theory to comments people make on the web or social networks to determine how they feel about a product, service or company.

**Significant Difference**: The term used to describe the results of a statistical hypothesis test where a difference is too large to be reasonably attributed to chance.

**Single-variance Test (Chi-square Test)**: Compares the variance of one sample of data to a target. Uses the Chi-square distribution.

**Software as a Service (SaaS)**: Enables vendors to host an application and make it available via the internet (cloud servicing). SaaS providers provide services over the cloud rather than hard copies.

**Spark (Apache Spark)**: A fast, in-memory open source data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce.

**Spatial Analysis**: Analyzing spatial data such geographic data or topological data to identify and understand patterns and regularities within data distributed in a geographic space.

**SQL (Structured Query Language)**: A programming language for retrieving data from a relational database.

**Standard Deviation**: The positive square root of the variance:

- Population: ?
- Sample: s

**Stream Processing**: Stream processing is designed to act on real-time and streaming data with “continuous” queries. Combined with streaming analytics (i.e. the ability to continuously calculate mathematical or statistical analytics on the fly within the stream), stream processing solutions are designed to handle high volumes in real time.

**Structured Data**: Data that is organized according to a predetermined structure.

**Sum of Squares**: In ANOVA, the total sum of squares helps express the total variation that can be attributed to various factors. From the ANOVA table, %SS is the Sum of Squares of the Factor divided by the Sum of Squares Total. Similar to R2 in Regression.

**Terabyte**: 1024 gigabytes. A terabyte can store approximately 300 hours of high-definition video.

**Test for Equal Variance (F-test)**: Compares the variance of two samples of data against each other. Uses the F distribution.

**Test Statistic**: A standardized value (Z, t, F, etc.) which represents the likelihood of H0 and is distributed in a known manner such that the probability for this value can be determined.

**Text Analytics**: The application of statistical, linguistic and machine learning techniques on text-based data-sources to derive meaning or insight.

**Time Series Analysis**: Analysis of well-defined data measured at repeated measures of time to identify time based patterns.

**Topological Data Analysis**: Analysis techniques focusing on the theoretical shape of complex data with the intent of identifying clusters and other statistically significance trends that may be present.

**Transactional Data**: Data that relates to the conducting of business, such as accounts payable and receivable data or product shipments data.

**Two Sample t-test**: A statistical test to compare the means of two samples of data against each other. Uses the t-distribution.

**Type I Error**: The error that occurs when the null hypothesis is rejected when, in fact, it is true.

**Type II Error**: The error that occurs when the null hypothesis is not rejected when it is, in fact, false.

**Unstructured Data**: Data that has no identifiable structure, such as email message text, social media posts, audio files (recorded human speech, music), etc.

**Variance**: The average squared deviation for all values from the mean:

- Population: 2
- Sample: s2

**Variety**: The different types of data available to collect and analyze in addition to the structured data found in a typical database. Categories include machine generated data, computer log data, textual social media information, multimedia social and other information.

**Velocity**: The speed at which data is acquired and used. Not only are companies and organizations collecting more and more data at a faster rate, they want to derive meaning from that data as soon as possible, often in real time.

**Veracity**: Ensuring that data used in analytics is correct and precise.

**Visualization**: A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively. Visuals created are usually complex, but understandable in order to convey the message of data.

__References__

- Augur, H. (2016, May). A Beginner’s Guide to Big Data Terminology. Retrieved from http://dataconomy.com.
- Dontha, R. (2017, January). 25 Big Data Terms You Must Know to Impress Your Date (or whomever you want). Retrieved from http://www.datasciencecentral.com.
- NT, B. (2014, July). Big Data A to Z: A glossary of Big Data terminology. Retrieved from http://bigdata-madesimple.com.
- Analytics and Big Data Glossary for the Enterprise. (2017, March). Retrieved from http://data-informed.com.
- An Extensive Glossary of Big Data Terminology. Retrieved from https://datafloq.com.