전체 페이지뷰
2016년 4월 25일 월요일
python setting error
Intellij PyCharm error
PyCharm : Py_Initialize: can't initialize sys standard streams
try this: File->Setting->Editor->File Encodings change the Project encoding to UTF-8
2016년 4월 11일 월요일
Top 10 Essential Books for the Data Enthusiast
http://www.kdnuggets.com/2016/04/top-10-essential-books-data-enthusiast.html
A unique top 10 list of book recommendations, for each of 10 categories this list provides a top paid and top free book recommendation. If you're interested in books on data, this diverse list of top picks should be right up your alley.
By Matthew Mayo, KDnuggets.
The true data enthusiast has a lot to read about: big data, machine learning, data science, data mining, etc. Besides these technology domains, there are also specific implementations and languages to consider and keep up on: Hadoop, Spark, Python, and R, to name a few, not to mention the myriad tools for automating the various aspects of our professional lives which seem to pop up on a daily basis. There are a lot of topics to keep abreast of. Fortunately (unfortunately?) there is no shortage of books available on all of these subjects.
There are a lot of lists available of the top books in particular categories related to data. In fact, KDnuggets has previously, and rather recently, put together such lists on data mining, databases & big data, statistics, AI & machine learning, and neural networks. But these were based on Amazon top sellers in narrow categories, without editorial discretion or consideration for freely-available content and e-books.
First off, let's get this out of the way: the title of this post is misleading. This inclusive list of essential books for the data enthusiast (or practitioner) recommends a top paid and free resource in each of 10 categories. Let's face it: though we may work or be otherwise directly involved in a limited number data avenues, we generally tend to have an understanding of a greater number of these avenues, as both a practical matter and one of interest.
So, while a Hadoop expert may not need expert-level insight into deep learning, chances are that they have a more-than-passing interest in the subject. This post is a chance to solidify these interests and provide material suggestions for the data enthusiast looking to widen their knowledge base.
Editor's note: It is important to point out that KDnuggets receives no incentive, financial or otherwise, related to any of these recommendations, nor does it take part in any affiliate sales programs. These recommendations are made solely in the interest of our readers.
Keep in mind that there may be overlap in many of these categories, which is inevitable (see: The Data Science Puzzled, Explained). Often the focus of the material determines its categorization, as opposed to simply the material itself.
Data Science
Top Paid Recommendation: Data Science for Business
When trying to learn about a new field, one of the most common difficulties is to find books (and other materials) that have the right "depth". All too often one ends up with either a friendly but largely useless book that oversimplifies or a heavy academic tome that, though authoritative and comprehensive, is condemned to sit gathering dust in one's shelves. "Data Science for Business" gets it just right.
Top Free Recommendation: The Art of Data Science
This book describes the process of analyzing data in simple and general terms. The authors have extensive experience both managing data analysts and conducting their own data analyses, and this book is a distillation of their experience in a format that is applicable to both practitioners and managers in data science.- Official Website
Big Data
Top Paid Recommendation: Big Data: Principles and Best Practices of Scalable Realtime Data Systems
I have rarely seen a thorough discussion of the importance of data modelling, data layers, data processing requirements analysis, and data architecture and storage implementation issues (along with other "traditional" database concepts) in the context of big data. This book delivers a refreshing comprehensive solution to that deficiency.
Top Free Recommendation: Big Data Now: 2015 Edition
In the four years that O’Reilly has produced its annual Big Data Now report, the data field has grown from infancy into young adulthood. Data is now a leader in some fields and a driver of innovation in others, and companies that use data and analytics to drive decision-making are outperforming their peers.- Official Website
Apache Hadoop
Top Paid Recommendation: Hadoop: The Definitive Guide
I appreciate that this book covers high-level concepts as well as dives deep into the technical details that you will need to know for the design, implementation and day-to-day running of Hadoop and its various associated technologies.
Top Free Recommendation: Hadoop Explained
Hadoop is one of the most important technologies in a world that is built on data. Find out how it has developed and progressed to address the continuing challenge of Big Data with this insightful guide.- Official Website
Apache Spark
Top Paid Recommendation: Learning Spark
The information that is available on the Internet is great, but this book brings much of it together in one place. If you want to learn to think like a Spark programmer--*not* the same as thinking like a programmer--this is the place to begin.
Top Free Recommendation: Mastering Apache Spark
This collections of notes (what some may rashly call a "book") serves as the ultimate place of mine to collect all the nuts and bolts of using Apache Spark. The notes aim to help me designing and developing better products with Spark.- Official Website
By Matthew Mayo, KDnuggets.
Theoretical Machine Learning
Top Paid Recommendation: Pattern Recognition and Machine Learning
The author is an expert, this is evidenced by the excellent insights he gives into the complex math behind the machine learning algorithms. I have worked for quite some time with neural networks and have had coursework in linear algebra, probability and regression analysis, and found some of the stuff in the book quite illuminating.
Top Free Recommendation: Elements of Statistical Learning
The good news is, this is pretty much the most important book you are going to read in the space. It will tie everything together for you in a way that I haven't seen any other book attempt.
Practical Machine Learning
Top Paid Recommendation: Python Machine Learning
This is a fantastic book, even for a relative beginner to machine learning such as myself. The first thing that comes to mind after reading this book is that it was the perfect blend (for me at least) of theory and practice, as well as breadth and depth.
Top Free Recommendation: An Introduction to Statistical Learning with Applications in R
This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.- Official Website
Deep Learning
As the selection of paid deep learning books is slim at the moment, here are a pair of free selections.
Top Free Recommendation #1: Neural Networks and Deep Learning
Neural Networks and Deep Learning is a free online book. The book will teach you about:
- Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data
- Deep learning, a powerful set of techniques for learning in neural networks
- Official Website
Top Free Recommendation #2: Deep Learning
The in-preparation, likely to-be definitive deep learning book of the near future, written by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. The development version is updated monthly, and will be freely available until publication.
Data Mining
Top Paid Recommendation: Data Mining: Concepts and Techniques, Third Edition
Data Mining is a comprehensive overview of the field, and I think it is best for a graduate class in data mining, or perhaps as a reference book. The book's focus is on technique (i.e., how to analyze data, including preparation), and it addresses all the major topics in the field including data storage and pre-processing. However, the book is really about classification methods, and the 2 chapters on cluster analysis are particularly strong and thorough.
Top Free Recommendation: Mining of Massive Datasets
The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.- Official Website
SQL
Top Paid Recommendation: Learning SQL, Second Edition
If you're writing any type of database driven code and you think that you don't need to understand SQL, read this book. You do need to understand it, and this book teaches it very well.
Top Free Recommendation: Learn SQL The Hard Way
This book will teach you the 80% of SQL you probably need to use it effectively, and will mix in concepts in data modeling at the same time. If you've been fumbling around building web, desktop, or mobile applications because you don't know SQL, then this book is for you. It is written for people with no prior database, programming, or SQL knowledge, but knowing at least one programming language will help.- Official Website
Statistics for Data Science
Top Paid Recommendation: Statistics in Plain English, Third Edition
I work as a Data Analyst and deal with statistics on a daily basis. I am expected to know all the models and algorithms. Although statistical software does everything for me, figuring out the numbers the software chews out becomes the tricky part. I majored in Biotechnology and was alien to these statistics for the major part of my life. Long story short, I required a solid foundation guide that would help me get acclimatized to the concepts.
Top Free Recommendation: Think Stats: Probability and Statistics for Programmers, Second Edition
- Official WebsiteThink Stats emphasizes simple techniques you can use to explore real data sets and answer interesting questions. The book presents a case study using data from the National Institutes of Health. Readers are encouraged to work on a project with real datasets.
21 Must-Know Data Science Interview Questions and Answers
http://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html
Answer by Matthew Mayo.
Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. (see also KDnuggets posts onOverfitting)
This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.
Xavier Amatriain presents a good comparison of L1 and L2 regularization here, for those interested.
Fig 1: Lp ball: As the value of p decreases, the size of the corresponding L-pspace also decreases.
Answer by Gregory Piatetsky:
This question does not have a correct answer, but here is my personal list of 12 Data Scientists I most admire, not in any particular order.
Geoff Hinton, Yann LeCun, and Yoshua Bengio - for persevering with Neural Nets when and starting the current Deep Learning revolution.
Demis Hassabis, for his amazing work on DeepMind, which achieved human or superhuman performance on Atari games and recently Go.
Jake Porway from DataKind and Rayid Ghani from U. Chicago/DSSG, for enabling data science contributions to social good.
DJ Patil, First US Chief Data Scientist, for using Data Science to make US government work better.
Kirk D. Borne for his influence and leadership on social media.
Claudia Perlich for brilliant work on ad ecosystem and serving as a great KDD-2014 chair.
Hilary Mason for great work at Bitly and inspiring others as a Big Data Rock Star.
Usama Fayyad, for showing leadership and setting high goals for KDD and Data Science, which helped inspire me and many thousands of others to do their best.
Hadley Wickham, for his fantastic work on Data Science and Data Visualization in R, including dplyr, ggplot2, and Rstudio.
There are too many excellent startups in Data Science area, but I will not list them here to avoid a conflict of interest.
Here is some of our previous coverage of startups.
Answer by Matthew Mayo.
Proposed methods for model validation:
Answer by Gregory Piatetsky:
Here is the answer from KDnuggets FAQ: Precision and Recall:
See also a very good explanation of Precision and recall in Wikipedia.
Fig 4: Precision and Recall.
ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION) and is commonly used to measure the performance of binary classifiers. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more representative picture of performance. See also this Quora answer: What is the difference between a ROC curve and a precision-recall curve?.
Answer by Anmol Rajpurohit.
Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the principles of scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are confirmed without rigorous validation. One such scenario is the case that given the task of improving an algorithm to yield better results, you might come with several ideas with potential for improvement.
An obvious human urge is to announce these ideas ASAP and ask for their implementation. When asked for supporting data, often limited results are shared, which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in test data).
Data scientists do not let their human emotions overrun their logical reasoning. While the exact approach to prove that one improvement you've brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand, there are a few common guidelines:
One common way to achieve the above guidelines is through A/B testing, where both the versions of algorithm are kept running on similar environment for a considerably long time and real-life input data is randomly split between the two. This approach is particularly common in Web Analytics.
Answer by Gregory Piatetsky:
According to Wikipedia,
Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas, such as healthcare, project management, or software testing.
Here is a useful Root Cause Analysis Toolkit from the state of Minnesota.
Essentially, you can find the root cause of a problem and show the relationship of causes by repeatedly asking the question, "Why?", until you find the root of the problem. This technique is commonly called "5 Whys", although is can be involve more or less than 5 questions.
Fig. 5 Whys Analysis Example, from The Art of Root Cause Analysis .
Answer by Gregory Piatetsky:
Those are economics terms that are not frequently asked of Data Scientists but they are useful to know.
Price optimization is the use of mathematical tools to determine how customers will respond to different prices for its products and services through different channels.
Big Data and data mining enables use of personalization for price optimization. Now companies like Amazon can even take optimization further and show different prices to different visitors, based on their history, although there is a strong debate about whether this is fair.
Price elasticity in common usage typically refers to
Similarly, Price elasticity of supply is an economics measure that shows how the quantity supplied of a good or service responds to a change in its price.
Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale.
Wikipedia defines
Tools like Google Trends, Alexa, Compete, can be used to determine general trends and analyze your competitors on the web.
Here are useful resources:
Answer by Gregory Piatetsky:
Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true.
To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).
Here are some tools to calculate statistical power.
Answer by Gregory Piatetsky:
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample.
Resampling refers to methods for doing one of these
See more in Wikipedia about bootstrapping, jackknifing.
See also How to Check Hypotheses with Bootstrap and Apache Spark
Here is a good overview of Resampling Statistics.
Answer by Devendra Desale.
It depends on the question as well as on the domain for which we are trying to solve the question.
In medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive.
For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.
Answer by Matthew Mayo.
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.
Q1. Explain what regularization is and why it is useful.
Answer by Matthew Mayo.
Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. (see also KDnuggets posts onOverfitting)
This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.
Xavier Amatriain presents a good comparison of L1 and L2 regularization here, for those interested.
Fig 1: Lp ball: As the value of p decreases, the size of the corresponding L-pspace also decreases.
Q2. Which data scientists do you admire most? which startups?
Answer by Gregory Piatetsky:
This question does not have a correct answer, but here is my personal list of 12 Data Scientists I most admire, not in any particular order.
Geoff Hinton, Yann LeCun, and Yoshua Bengio - for persevering with Neural Nets when and starting the current Deep Learning revolution.
Demis Hassabis, for his amazing work on DeepMind, which achieved human or superhuman performance on Atari games and recently Go.
Jake Porway from DataKind and Rayid Ghani from U. Chicago/DSSG, for enabling data science contributions to social good.
DJ Patil, First US Chief Data Scientist, for using Data Science to make US government work better.
Kirk D. Borne for his influence and leadership on social media.
Claudia Perlich for brilliant work on ad ecosystem and serving as a great KDD-2014 chair.
Hilary Mason for great work at Bitly and inspiring others as a Big Data Rock Star.
Usama Fayyad, for showing leadership and setting high goals for KDD and Data Science, which helped inspire me and many thousands of others to do their best.
Hadley Wickham, for his fantastic work on Data Science and Data Visualization in R, including dplyr, ggplot2, and Rstudio.
There are too many excellent startups in Data Science area, but I will not list them here to avoid a conflict of interest.
Here is some of our previous coverage of startups.
Q3. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
Answer by Matthew Mayo.
Proposed methods for model validation:
- If the values predicted by the model are far outside of the response variable range, this would immediately indicate poor estimation or model inaccuracy.
- If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data.
- Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure.
- Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions.
- Use jackknife resampling if the dataset contains a small number of instances, and measure validity with R squared and mean squared error(MSE).
Q4. Explain what precision and recall are. How do they relate to the ROC curve?
Answer by Gregory Piatetsky:
Here is the answer from KDnuggets FAQ: Precision and Recall:
Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong:
- TN / True Negative: case was negative and predicted negative
- TP / True Positive: case was positive and predicted positive
- FN / False Negative: case was positive but predicted negative
- FP / False Positive: case was negative but predicted positive
Makes sense so far? Now you count how many of the 10,000 cases fall in each bucket, say:
Predicted Negative
|
Predicted Positive
| |
Negative Cases
|
TN: 9,760
|
FP: 140
|
Positive Cases
|
FN: 40
|
TP: 60
|
Now, your boss asks you three questions:
- What percent of your predictions were correct?
You answer: the "accuracy" was (9,760+60) out of 10,000 = 98.2% - What percent of the positive cases did you catch?
You answer: the "recall" was 60 out of 100 = 60% - What percent of positive predictions were correct?
You answer: the "precision" was 60 out of 200 = 30%
See also a very good explanation of Precision and recall in Wikipedia.
Fig 4: Precision and Recall.
ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION) and is commonly used to measure the performance of binary classifiers. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more representative picture of performance. See also this Quora answer: What is the difference between a ROC curve and a precision-recall curve?.
Q5. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?
Answer by Anmol Rajpurohit.
Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the principles of scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are confirmed without rigorous validation. One such scenario is the case that given the task of improving an algorithm to yield better results, you might come with several ideas with potential for improvement.
An obvious human urge is to announce these ideas ASAP and ask for their implementation. When asked for supporting data, often limited results are shared, which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in test data).
Data scientists do not let their human emotions overrun their logical reasoning. While the exact approach to prove that one improvement you've brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand, there are a few common guidelines:
- Ensure that there is no selection bias in test data used for performance comparison
- Ensure that the test data has sufficient variety in order to be symbolic of real-life data (helps avoid overfitting)
- Ensure that "controlled experiment" principles are followed i.e. while comparing performance, the test environment (hardware, etc.) must be exactly the same while running original algorithm and new algorithm
- Ensure that the results are repeatable with near similar results
- Examine whether the results reflect local maxima/minima or global maxima/minima
One common way to achieve the above guidelines is through A/B testing, where both the versions of algorithm are kept running on similar environment for a considerably long time and real-life input data is randomly split between the two. This approach is particularly common in Web Analytics.
Q6. What is root cause analysis?
Answer by Gregory Piatetsky:
According to Wikipedia,
Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event's outcome, but is not a root cause.
Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas, such as healthcare, project management, or software testing.
Here is a useful Root Cause Analysis Toolkit from the state of Minnesota.
Essentially, you can find the root cause of a problem and show the relationship of causes by repeatedly asking the question, "Why?", until you find the root of the problem. This technique is commonly called "5 Whys", although is can be involve more or less than 5 questions.
Fig. 5 Whys Analysis Example, from The Art of Root Cause Analysis .
Q7. Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.
Answer by Gregory Piatetsky:
Those are economics terms that are not frequently asked of Data Scientists but they are useful to know.
Price optimization is the use of mathematical tools to determine how customers will respond to different prices for its products and services through different channels.
Big Data and data mining enables use of personalization for price optimization. Now companies like Amazon can even take optimization further and show different prices to different visitors, based on their history, although there is a strong debate about whether this is fair.
Price elasticity in common usage typically refers to
- Price elasticity of demand, a measure of price sensitivity. It is computed as:
Price Elasticity of Demand = % Change in Quantity Demanded / % Change in Price.
Similarly, Price elasticity of supply is an economics measure that shows how the quantity supplied of a good or service responds to a change in its price.
Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale.
Wikipedia defines
Competitive intelligence: the action of defining, gathering, analyzing, and distributing intelligence about products, customers, competitors, and any aspect of the environment needed to support executives and managers making strategic decisions for an organization.
Tools like Google Trends, Alexa, Compete, can be used to determine general trends and analyze your competitors on the web.
Here are useful resources:
- Competitive Intelligence Metrics, Reports by Avinash Kaushik
- 37 Best Marketing Tools to Spy on Your Competitors from Kissmetrics
- 10 best competitive intelligence tools from 10 experts
8. What is statistical power?
Answer by Gregory Piatetsky:
Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true.
To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).
Here are some tools to calculate statistical power.
9. Explain what resampling methods are and why they are useful. Also explain their limitations.
Answer by Gregory Piatetsky:
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample.
Resampling refers to methods for doing one of these
- Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
- Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
- Validating models by using random subsets (bootstrapping, cross validation)
See more in Wikipedia about bootstrapping, jackknifing.
See also How to Check Hypotheses with Bootstrap and Apache Spark
Here is a good overview of Resampling Statistics.
10. Is it better to have too many false positives, or too many false negatives? Explain.
Answer by Devendra Desale.
It depends on the question as well as on the domain for which we are trying to solve the question.
In medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive.
For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.
11. What is selection bias, why is it important and how can you avoid it?
Answer by Matthew Mayo.
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.
피드 구독하기:
글 (Atom)