A serious mistake in this type of inter-board reliability is that it does not take into account random agreement and overestimates the degree of compliance. This is the main reason why the percentage of concordance should not be used for scientific work (e.g. B doctoral theses or scientific publications). Third, the researcher must indicate the unit of analysis to which the CPI results apply, i.e. whether the ICC is to quantify the reliability of the ratings on the basis of average values of the ratings of several programmers or on the basis of ratings of a single encoder. In studies where all subjects are coded by multiple evaluators and the average of their ratings is used for hypothesis testing, KICs with average dimensions are appropriate. However, in studies where a subset of subjects is coded by multiple reviewers and the reliability of their assessments needs to be generalized to subjects assessed by a coder, a single CCI should be used. Just as the average of several measures tends to be more reliable than a single measure, KICs with average dimensions tend to be higher than CICs with individually measured measurements. In cases where CICs are low with isolated measures, but CICs are high with medium dimensions, the researcher may report both CICs to demonstrate this disparity (Shrout & Fleiss, 1979). Cohen (1968) offers an alternative weighted kappa that allows researchers to punish disagreements differently based on the extent of disagreement.
Cohen`s weighted cappa is typically used for categorical data with an ordinal structure, for example.B. in a scoring system categorizing the high, medium, or low presence of a particular attribute. In this case, a subject considered high by one programmer and a low subject by another should lead to a lower IRR estimate than one subject considered high by one programmer and another as average. Norman and Streiner (2008) show that the use of a weighted kappa with square weights for ordinal scales is identical to a two-sided mixed consistency CCI and that both can be replaced. This interchangeability has a particular advantage when three or more encoders are used in a study, as CICs may contain three or more encoders, while weighted kappa can only absorb two encoders (Norman &Streiner, 2008). While evaluators tend to agree, the differences between evaluators` observations are close to zero. When one appraiser is generally higher or lower than the other by a consistent amount, the distortion of zero is different. If evaluators tend to disagree, but in the absence of a consistent model where one rating is higher than the other, the average is close to zero.
Confidence limits (usually 95%) can be calculated both for distortion and for each of the compliance limits. Contrary to the validity of parents` and teachers` assessments of expression vocabulary, their reliability is not sufficiently demonstrated, especially with regard to guardians other than parents. Given that a significant number of young children are regularly cared for outside their families, the ability of different facilitators to enable a reliable assessment of behaviour, performance or performance levels with established tools is relevant for the screening and monitoring of a large number of developmental characteristics (e.g. B Gilmore and Vance, 2007). . . .