FACTOID # 149: Norwegians consume more than 15 times as much coffee per person as the Irish.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

FACTS & STATISTICS    Simple view

  1. Select countries to view: (hold down Control key and click to select several)

     

     

    Compare:

     

     

  1. Select fact or statistic: (* = graphable)

     

     

     

  2. (OPTIONAL) Compare to statistic: (both need to be graphable)

     

     

     

  3. View result as:

     

       
(OR) SEARCH ALL encyclopedia, stats & forums:   

Encyclopedia > Data dredge

Data dredging is the term used to refer to the unscrupulous search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with a substantially different meaning, so the term data dredging is now used instead. Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. ...


Conventional statistical procedure is to formulate a research hypothesis, (such as 'people in higher social classes live longer') then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance. In statistics, a result is significant if it is unlikely to have occurred by chance, given that a presumed null hypothesis is true, but is not improbable if the null hypothesis is false. ...


A key point is that one is not allowed to formulate the hypothesis as a result of seeing the data. If you want to work this way, you have to collect a data set, then partition it into two subsets, A and B. Subset A is held back and subset B is examined for interesting hypotheses. Once a hypothesis has been formulated it can be tested on subset A, since it was not used to construct the hypothesis.


Any large data set contains some chance features which will not be present in similar data sets, and to simply declare these as 'facts' is spurious. A typical example would be a TV marketing campaign intended to drive up sales. The campaign is run in one television area but not in another, which serves as a control group. Suppose that upon analysis it is found that sales in the treatment group are not significantly higher than in the control group. The analyst, fearful of telling the bad news to the sales director, analyses subgroups of the data and finds that sales did go up for left-handed Chinese males in the month of August, and the result is 'statistically significant'. This is then reported to the sales director in an attempt to offset the overall bad news. From Latin ex- + -periri (akin to periculum attempt). ...


It is important to realise that the alleged statistical significance here is completely spurious - significance tests do not protect against data dredging. You are testing a data set on which the hypothesis is known to be true, and that is therefore not a representative data set and any resulting significance levels are false.


  Results from FactBites:
 
Journal of Shellfisheries Research: A comparison of dredge and patent tongs for estimation of oyster populations (1474 words)
Data for the study were taken from 1993 to 2001 surveys conducted in the James River, Virginia, by the Virginia Institute of Marine Science and the Virginia Marine Resources Commission wherein the same stations were sampled by both techniques.
Given both the extensive historical and spatial coverage of dredge assessment, the question was posed as to the possibility of developing a conversion function relating dredge to patent tong data, thus allowing hindcasting of absolute densities of oyster populations using historical dredge data sets.
Quantification of dredge data is more difficult than patent tong data in that dredges accumulate organisms as they move over the bottom, may not sample with constancy throughout a single dredge haul, and may fill before completion of the haul, thereby providing biased sampling in favor of the "early" portion of the haul.
Nat' Academies Press, Effects of Trawling and Dredging on Seafloor Habitat (2002) (10246 words)
Two fundamental data needs for scaling the observed effects of trawling and dredging on marine habitats to the ecosystem level are the type and magnitude of the effects of specific gear on different habitats, and the spatial and temporal extent of fishing activity.
However, data collection at various times during the 1990s for four of the six fishery management regions with significant trawl or dredge fisheries has allowed the trawl and dredge effort to be mapped by statistical reporting areas for that period.
CONCLUSION Domestic trawl and dredge fisheries are conducted along most of the continental shelf and slope adjacent to the United States, although the level of fishing effort, and hence the amount of area affected, varies widely by region and by the spatial distribution of the fishing grounds.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.