, Age and Earnings
, Cubic Approximation
, Credit Score
, Patient Classification
, Distribution Fitting
, Wine Quality
, Credit Wizard
, Median Height
, Clean CSV
, Birthday Cards
, Free Throws
, Bacterial Growth
Python Data Science
Python Data Science
The Python programming language and its libraries contain a lot of functionality that's useful to data scientists. Powerful libraries like Numpy, Pandas, and Scipy are valuable tools for data scientists who use Python.
Grouping is the process of separating items into different groups. Developers and data scientists often need to group data so they can examine them separately.
NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy is an essential library for any data scientist who works with Python.
General Data Science
Pandas is a library for the Python programming language that’s used for data manipulation and analysis. It is an essential library for any data scientist who works with Python.
General Data Science
When we need to discover the information hidden in vast amounts of data, or make smarter decisions to deliver even better products, data scientists hold the key to the answers you need.
Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring within a fixed interval of time and/or space, if these events occur with a known average rate and independently of the time since the last event. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
Probability theory is the foundation of most statistical and machine-learning algorithms.
Linear regression is one of the most frequently used methods for data analysis due to its simplicity and applicability to a wide variety of problems.
Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It’s important for all tasks where it’s infeasible to construct conventional algorithms, which is often the case in Data Science.
Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. Since many problems are not linear, nonlinear regression is important for machine learning practitioners.
Scikit-learn (or sklearn) is a machine learning library for the Python programming language. Every data scientist who works with Python and tasks such as classification, regression, and clustering algorithms should know how to use it.
Classification is the problem of identifying to which set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. As one of the common tasks in machine learning, it’s important for all data scientists.
An important Data Science algorithm, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression.
A receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at all possible decision boundaries. It is useful for selecting possibly optimal models and to discard suboptimal ones prior to specifying decision boundaries.
In a binary classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.
Binomial distribution is the discrete probability distribution of the number of successes in a sequence of independent yes/no experiments, each of which yields success with a given probability.
An important concept, p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true.
Cauchy distribution is the distribution of the ratio of two independent normally distributed Gaussian random variables. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
Exponential distribution is the probability distribution that describes the time between events in a process in which events occur continuously and independently at a constant average rate. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
Normal distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
SciPy is a Python library used for scientific and technical computing. Every data scientist who uses Python as a programming language should know how to use it for tasks such as optimization, linear algebra, integration, etc.
Correlation is any statistical relationship, whether causal or not, between two random variables or two sets of data. As one of the fundamentals of Data Science, correlation is an important concept for all Data Scientists to be familiar with.
Multicollinearity is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. As such, it’s important for all data scientists to check for collinear variables when looking at individual predictor variables in multiple regression models.
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences. It is usually a tool for displaying an algorithm that contains only conditional control statements and is a must-know for every data scientist.
Data cleaning or data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records. Data scientists should be familiar with it to avoid incorrect records that can affect analysis.
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. Processing CSV files is a common task when working with tabular data.
Every programmer should be familiar with data-sorting methods, as sorting is very common in data-analysis processes.
Data aggregation is the process of gathering and summarizing information in a specified form. It is a common component of most statistical analysis processes.
Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points. This is basic knowledge of every data scientist.