Evaluating the Construct Validity of an EFL Rating Scale Using Multitrait-Multimethod Analysis

Amy D. Yamashiro

Speech communication skills, essential for success in school and work, require careful planning and research for instruction. Currently in Japan, there is heightened interest in teaching public speaking and other forms of speech communication within English as a foreign language (EFL). The revised Guidelines for Study in Senior High School by the Japanese Ministry of Education (Monbusho, 1989) recommend including public speaking and debate within Japanese secondary school curricula. However, many secondary and post-secondary teachers in Japan may lack the requisite knowledge and experience in the field of speech communication; thus, they may need supplementary training, information, and practical advice to help them incorporate public speaking into their EFL courses. This paper investigates the construct validity of a performance test to determine the extent to which the public speaking rating scale (Yamashiro & Johnson, 1997) provides reliable and valid assessments. There are three key elements to performance assessment (McNamara, 1996): the performance, a method for assessment (e.g. a rating scale), and the rater. To implement performance testing, EFL educators need to understand the connection between behavioral objectives (i.e. the content or skills included in the syllabus; Brown, 1995) and criterion-referenced testing (i.e. measuring the extent to which the students learned the content or skills; see Brown, 1995, 1996).

Speaking Area	Comments
Voice Control 1. Projection 2. Pace 3. Intonation 4. Diction	Speaking loud enough (not too loud nor too soft) Speaking at a good rate (not too fast nor too slow) Speaking using proper pitch patterns and pauses Speaking clearly (no mumbling or interfering accent)
Body Language 5. Posture 6. Eye Contact 7. Gesture	Standing with back straight and looking relaxed Looking each audience member in the eye Using few, well-timed gestures, nothing distracting
Content of Oral Presentation 8. Introduction 9. Body 10. Conclusion	Including attention-getting device, thesis statement Using academic writing structure and transitions Including restatement/summation & closing statement
Effectiveness 11. Topic Choice 12. Language Use 13. Vocabulary 14. Purpose	Picking a topic that is interesting to the audience Varying types of clear and correct sentence forms Using vocabulary appropriate to the audience Fulfilling the purpose of the speaking task

Figure 1: 14 Points for Public Speaking (Yamashiro & Johnson, 1997, p. 14)

Educators in the field of speech communication have researched the implementation of behaviorally-stated performance objectives (e.g. Grunner, 1968; Kibler, Barker, & Cegala, 1970a, 1970b) for the past thirty years since the introduction of criterion-referenced assessment. Yamashiro and JohnsonÕs (1997) EFL public speaking course used 14 points for public speaking as the behavioral objectives (see Figure 1) and employs a corresponding 14-point rating scale as the criterion-referenced assessment of student performance (see Appendix).

Because it is essential to establish the reliability and validity of a rating scale that is used as part of an EFL course, in this case for teacher, peer, and self ratings, this study was conducted to address this issue.

In the field of first language (L1) speech communication, researchers such as Carpenter (1986) have investigated rating scales, while others such as Tatum (1992) focused on trying to control for judge differences. Williams and Stewart (1994) looked at the differences between panel ratings and individual instructor ratings of student public speaking ability. Zeman (1986) looked at using peer assessment as part of a public speaking course. However, few studies look at the reliability and/or the validity of ratings in an L2 speech communication course (Griffee, 1998; Yamashiro, 1998a, 1998b, 1999). The problem for language teachers in the Japanese EFL context has been to develop viable ways to implement speech communication goals in the form of public speaking and debate activities as well as ways to assess student performance.

Speech Communication in EFL

In speech communication, speaking is just one of the four skills employed. EFL teachers often start by focusing on voice quality and may concurrently introduce aspects of body language: posture, eye contact, and gestures.¹Because delivery is the most physical and basic part of public speaking, students mistakenly think this is the only language skill developed in the course. For instance, it is essential for L2 students to develop their listening comprehension to become more effective public speakers (Yamashiro & Johnson, 1997). Furthermore, EFL students can also benefit from judging in-class speeches and debates (Yamashiro, 1998c; Yamashiro & McLaughlin, 1996, 1998) if they understand the elements of public speaking.² Thus, EFL students need to develop their listening comprehension skills, so that they can participate by making peer and self assessments (Yamashiro & Johnson, 1997). To improve listening comprehension, many EFL instructors may prescribe strategy training to direct students to identify main and supporting ideas, to recognize discourse markers, and to develop notetaking skills for academic survival. EFL students' listening skills should be developed to recognize a variety of grammatical structures and guess the meaning of key vocabulary from context.

For teachers who value process-oriented activities, such as peer and self assessment the time spent on directing student attention to organizational features, rhetorical signaling devices, and notetaking seems worthwhile. Yamashiro's (1998c) study on the notetaking behavior of EFL peer debate judges offers support for raising student awareness of the speakers' usage of rhetorical signaling devices and clear organization.³ Dunkel (1986) synthesized the available L2 research and proposed that it is essential to help EFL students to develop script competence (i.e. recognizing genre and types of discourse) and the need for activating student schemata to aid comprehension when listening to lectures or speeches. Through a greater awareness of the organizational structure of formal speeches, EFL students are better able to comprehend aural input, and thus may be more capable at providing accurate feedback, or ratings, as peer judges (Yamashiro 1998a, 1998b, 1999).

The EFL Public Speaking Course: Objectives and Assessment

Within speech communication there are common elements (see Figure 1) in public speaking that are useful for EFL students to learn. It is recommended to have a sequence of two or more presentations, such as a short news brief and an informative speech (Yamashiro & Johnson, 1997), to build student confidence through experiential learning. By applying Ur's (1997, p. 65) model for optimal teacher learning to the L2 classroom, EFL students can benefit from an enhanced experiential-learning process whereby the teacher provides models before the oral presentations (concrete experience); offers constructive feedback in addition to the teacher and peer rating sheets as each student does the self-assessment (reflective observation); directs the students to review the 14 points to decide three goals to improve for the next oral presentation (abstract conceptualization); then, has the students prepare another oral presentation with the new goals in mind (active experimentation). In this way, students not only benefit from the teacher's expertise and experience, but the students also profit from the practical aspects of a student-centered approach, since students individually decide their research topics and can spend more class time developing their language and delivery skills.

Rating Scale MTMM Construct Validation Study

After discussing construct validation along with the classic correlational design employing the multitrait-multimethod (MTMM) matrix (Bachman, 1990; Bachman & Palmer, 1981, 1982, 1989; Buck, 1992; Campbell & Fiske, 1959; Henning, 1987), I will introduce the research questions and methods for collecting the data before analyzing the MTMM matrix and discussing the results and limitations

Construct Validation: Multitrait-Multimethod Design

Construct validity is central to the appropriate interpretation of the extent to which performance on a given test is consistent with theoretical views of abilities (Bachman, 1990). In other words, we must identify and define what we intend to measure, and through this process we define the construct. Within language testing, Bachman (1990) explains that construct validity provides the evidential support for interpreting a proficiency or criterion-referenced test as an indicator of ability. Perhaps, the most familiar approach for construct validation uses the MTMM matrix (Campbell & Fiske, 1959). In this approach, the researcher examines a matrix of correlations (see Table 5) created from correlating multiple traits and multiple methods to determine construct validity.

Within L2 research, MTMM data has been analyzed through direct inspection of the matrix (Bachman, 1990; Bachman & Palmer, 1981, 1982; Buck, 1992; Yamashiro, 1998b) and confirmatory factor analysis (Bachman & Palmer, 1989). In this paper, the MTMM data will be analyzed using two methods outlined by Henning (1987): 1) the Eyeball Method and 2) Means and Ratio Method. In these methods, the researcher examines the matrices for patterns of convergence (i.e. high positive correlations among indicators for a particular trait or method) and discrimination (i.e. low or zero correlations between measures of different traits using different methods of measurement). These methods will be discussed in depth in the analysis sections.

Research Questions:

To what extent is the public speaking rating scale reliable for teacher, peer, and self-assessment when used as designed within an EFL course?
To what extent is the public speaking rating scale valid for teacher, peer, and self-assessment when used as designed within an EFL course?

Participants

The data for this multitrait-multimethod (MTMM) analysis study were gathered during two public speaking courses as outlined in the syllabus provided in Yamashiro and Johnson's (1997) article. The teacher and peer ratings were conducted in-class as an on-line listening task. Each speech was videotaped. For this study two additional teacher ratings were collected for each student speech to determine inter-rater reliability and generalizability. The student self ratings and the additional teacher ratings were done from watching the videotaped speech performances. The 61 participants attend a highly-ranked private university in the Kanto Region of Japan. They enrolled in an elective mixed-proficiency English language courses having a speech communication focus and admitting students ranging from lower-intermediate to near-native speaker returnee students (kikokushijo, students who lived and studied overseas for two or more years due to parental work obligations). The public speaking syllabus was taught in classes ranging from 25 to 30 students meeting once-a-week for 90-minute lessons over one academic semester.

Factor Analysis

Yamashiro (1998a) used item analysis to reduce the rating scale to 12 items and proceeded to use principal components analysis, which yielded three traits (see Table 1): non-verbal delivery (nvd), verbal delivery (vd), and organization/purpose (op).

Table 1. Three Traits identified through Factor Analysis (N=61)

Non-verbal Delivery (nvd)	Verbal Delivery (vd)	Organization/Purpose (op)
Pace (Pace) Posture (Post) Eye Contact (EyeC) Gestures (Gest)	Diction (Dict) Intonation (Into) Language Use (Lang) Vocabulary (Voca)	Introduction (Intr) Body (Body) Conclusion (Conc) Purpose (Purp)

Principal components analysis was used to investigate the constructs found by Yamashiro (1998a) using the rating scale data collected in this study for the three teacher ratings (Table 2). For the factor analysis below Factor 1 represents Verbal Delivery, Factor 2 is Organization and Purpose, and Factor 3 is Non-Verbal Delivery. A star has been placed next to the highest loading factor for each item (i.e. to indicate which factor each item falls under). The italicized column under communality (h2) indicates the amount of variance that is accounted for per item. The italicized row at the bottom of Table 2 indicates the amount of variance that can be attributed to each of the three factors. The results of this factor analysis replicates the three factors found by Yamashiro (1998a).

Table 2. Rotated Factor Loadings for Three Teacher Ratings (N=61)

	Factor 1	Factor 2	Factor 3	h²
DICT INTO LANG VOCA PACE POST EYEC GEST INTR BODY CONC PURP	.76069 * .73996 * .82043 * .76948 * .50335 .44337 -.05858 .33495 .08856 .23859 .15053 .35515	.29766 .28677 .00719 .26074 .20699 .08881 .36907 .2477 .76219 * .75474 * .75312 * .74223 *	.11965 .33755 .25528 .11324 .62387 * .69583 * .70806 * .62765 * .15515 .23624 .17626 .20908	.68156 .74372 .73832 .67291 .68542 .68864 .64099 .56749 .61285 .68237 .62092 .72076
% of variance	47.4	12.3	7.4	67.1

* Highest loading.

The Reliability of the Rating Scale

To establish the inter-rater reliability of the rating scale among EFL teachers, three teacher ratings were collected for each speech. Rater training was provided to explain how to use the rating scale, which uses a five-point Likert measure. In addition, each rater received detailed descriptors for rating each of the 14 points and practiced rating sample speeches from videotape. Yamashiro (1998a) found inter-rater reliability for the three teachers at a strong .89 (N=37, p < .05). In this study, the three methods of assessmentÑteacher, peer, and selfÑwere correlated to determine the inter-rater reliability which was an acceptable .76 (N=61, p < .05).

In addition, the three rating methods were analyzed to find the alpha reliabilities for the three traits: non-verbal delivery (nvd), verbal delivery (vd), and organization/purpose (op). These are reported in Table 3.

Table 3. Descriptive Statistics and Reliabilities for Three Public Speaking Traits
Measured by Three Rating Methods (N=61)

	Description	k	M	SD	Total Poss.	a
Tnvd Tvd Top Pnvd Pvd Pop Snvd Svd Sop	Teacher Rating--Non-Verbal Delivery Teacher Rating--Verbal Delivery Teacher Rating--Organization/Purpose Peer Rating--Non-Verbal Delivery Peer Rating--Verbal Delivery Peer Rating--Organization/Purpose Self Rating--Non-Verbal Delivery Self Rating--Verbal Delivery Self Rating--Organization/Purpose	4 4 4 4 4 4 4 4 4	14.18 15.3 15.62 15.47 16.44 16.86 13.49 13.61 14.62	2.43 2.34 2.44 1.77 1.57 1.37 2.72 2.77 2.44	20 20 20 20 20 20 20 20 20	0.81 0.88 0.9 0.76 0.95 0.86 0.63 0.85 0.74

GENOVA (Generalized Analysis of Variance)

Yamashiro (1998a) conducted a GENOVA (Generalized Analysis of Variance) D Study on three teachersÕ ratings using the 12-item rating scale (N=37) to yield the following generalizability coefficients: .80 for one rater, .88 for two raters, and .91 for three. In this study, the researcher used a D Study to investigate the differing levels obtained for one, two, or three raters using rating scale with the 12 items. By examining the column entitled "GEN. COEF." (generalizability coefficient) and looking at the bottom three numbers for the 12 item rating scale, the generalizability is .74 for one rater, .83 for two raters, and .86 for three raters (see Table 4).

Table 4. GENOVA D Study for Teacher, Peer, and Self Rating Methods

Multitrait Multimethod (MTMM) Analysis

Henning (1987) discusses two methods for conducting MTMM analyses: the Eyeball Method and the Means and Ratio Method. Although multitrait-multimethod analysis may seem conceptually simple at the surface, there are many procedural steps, so they are detailed clearly in outline form. In this way, researchers interested in applying either methodology to future studies could refer to this section or consult Henning (1987).

1. Prepare a full correlation matrix to report the correlations among all traits and all methods (see Table 5).

Table 5. Full correlation matrix for the three methods of assessment and the three
traits on the public speaking rating scale (N=61)

2. Partition the correlation matrix into triangles (see Table 5) with solid or broken lines. The solid-line triangles enclose correlations among different methods within the same trait or monotrait-heteromethod correlations. The broken-line triangles enclose correlations among different traits and different methods and are termed heterotrait-heteromethod correlations. The diagonals outside the broken-line triangles report the correlations among the different traits using the same method and are termed heterotrait-monomethod correlations.

3. Convergent Validity (Validity of Traits)

Inspect the monotrait-heteromethod correlation coefficients (in the diagonals, italicized in Table 5), which are also called the convergent validity coefficients. These should be significantly higher than zero (see Table 5 for the significance levels). All of the convergent validity coefficients in this study show validity using this criterion.

4. Heterotrait-Monomethod Discriminant Validity (Inter-rater Reliability)

Eyeball Method: Compare the convergent validity coefficients in the diagonals with the heterotrait monomethod coefficients within the solid triangles (see Table 5). The r1,4 convergent validity coefficient (.6639, in bold) in Table 5 is compared with variable one heterotrait-monomethod coefficients (.6472, .6727, bold line) and with variable four heterotrait-monomethod coefficients (.7163, .7004, bold line). If the convergent validity coefficient exceeds all of these corresponding heterotrait-monomethod coefficients, it would be said to manifest heterotrait-monomethod discriminant validity. In this example, the r1,4 convergent validity fails to exceed three of the heterotrait-monomethod coefficients.

Means & Ratio Method: If the eyeball method fails, it is necessary to compute the heterotrait-monomethod meansÑconvert each correlation into a Fisher-Z score, calculate the mean, and revert Fisher-Z score into a correlation coefficientÑthen divide each mean into its respective monotrait-heteromethod mean for each trait (see Table 7) to produce the heterotrait-monomethod mean discriminant validity ratio. When ratings are used, such as in this study, this validity ratio indicates inter-rater reliability. In other studies, each resulting ratio would need to exceed 1.00 to claim heterotrait-monomethod discriminant validity, or support the validity of each method.

5. Heterotrait-Heteromethod Discriminant Validity (Independence: Each Trait & Method)

Eyeball Method: Compare the convergent validity coefficients with the heterotrait-heteromethod coefficients within the broken-line triangles (see Table 5). The r1,4 convergent validity (.6639, in bold) in Table 3 would be compared with variable one heterotrait-heteromethod coefficients (.6056, .5448, .3156, .3592, double lines) and with variable four heterotrait-heteromethod coefficients (.4212, .6212, .3266, .3609, double lines). If the convergent validity coefficient exceeds all of these corresponding heterotrait-heteromethod coefficients, it would be said to manifest heterotrait-heteromethod discriminant validity. In this example, the r1,4 convergent validity coefficient exceeds all of the heterotrait-heteromethod coefficients.

Means & Ratio Method: If the eyeball method fails, it is necessary to compute the heterotrait-heteromethod meansÑconvert each correlation into a Fisher-z score, calculate the mean, and revert Fisher-z score into a correlation coefficientÑthen divide each mean into its respective monotrait-heteromethod mean for each trait (see Table 7) to produce the heterotrait-heteromethod discriminant validity ratio. If each resulting ratio exceeds 1.00 then heterotrait-monomethod discriminant validity could be claimed to exist.

6. A final procedural step advocated by Campbell and Fiske (1959) is to inspect the pattern of the magnitudes of the correlation coefficients within the triangles. If the pattern of magnitude is the same within triangles, this final criterion of validity will be met. However, Henning (1987) argues that since this rarely happens, as was the case in this study (the pattern is not consistent in all triangles), it may be an unproductive exercise.

7. In other MTMM studies, where reliabilities differ among the tests being used for deriving the correlation matrix, it may be necessary to correct for attenuation before the procedures are followed for comparing the correlation coefficients. Because this study did not readily pass the eyeball method a matrix was produced with the correlations corrected for attenuation (see Table 6). From this matrix of correlations corrected for attenuation, another means and ratio table was produced (see Table 8).

Table 6. Multitrait-Multimethod Matrix, Corrected for Attenuation: Three Traits by Three Methods (N=61)

	Teacher Ratings			Peer Ratings			Self Ratings
	Tnvd	Tvd	Top	Pnvd	Pvd	Pop	Snvd	Svd	Sop
Tnvd Tvd Top Pnvd Pvd Pop Snvd Svd Sop	0.813 0.647 0.673 0.664 0.606 0.545 0.352 0.316 0.359	0.764 0.882 0.561 0.421 0.605 0.483 0.233 0.295 0.275	0.786 0.629 0.902 0.621 0.551 0.598 0.303 0.296 0.437	0.845 0.514 0.75 0.76 0.716 0.7 0.395 0.327 0.361	0.689 0.66 0.595 0.842 0.952 0.792 0.27 0.472 0.338	0.651 0.554 0.678 0.865 0.874 0.862 0.287 0.348 0.346	0.491 0.312 0.401 0.57 0.348 0.389 0.632 0.553 0.562	0.379 0.34 0.337 0.406 0.523 0.406 0.753 0.854 0.617	0.461 0.339 0.533 0.48 0.401 0.432 0.819 0.774 0.745

The Eyeball-Method for MTMM Analysis

This section discusses the results of the Eyeball-Method for analyzing the MTMM correlation matrix using the three criteria outlined by Campbell and Fiske (1959).

Criterion 1. Coefficients in the validity diagonals should be significantly greater than zero. For the three methods and traits, all of the correlations are significantly greater than zero (p<.05, two-tailed significance) and sufficient to indicate convergent validity.

Criterion 2. Coefficients in the validity diagonals should be higher than correlations between that variable and other variables with neither trait nor method in common. Although more than half of the validity coefficients, corrected and not corrected for attenuation, are less than the coefficients in the validity diagonal, these results do not meet this first requirement of discriminant validity.

Criterion 3. Coefficients in the validity diagonals should be higher than correlations between that variable and other variable that measure different traits using the same method. Very few of the validity coefficients meet this second requirement of discriminant validity for each of the three traits and methods.

In summary, for the Eyeball-Method MTMM analysis, only convergent validity was supported. However, overall it did not meet all of the criteria outlined in Campbell and Fiske (1959) for convergent and discriminant validity on either the original MTMM correlation matrix or the one corrected for attenuation.

The Means and Ratio Method for MTMM Analysis

Henning (1987) provides an excellent introduction to computing the mean convergent and discriminant validity coefficients (means and ratios) for multitrait-multimethod analysis and how to interpret the results.

Table 7. Mean Convergent and Discriminant Validity Coefficients
for the Three Methods and the Three Traits on the Public Speaking
Rating Scale (N=61) (Matrix Not Corrected for Attenuation)

	Monotrait-Heteromethod Mean (MHM) Convergent Validity Validity of Traits	Heterotrait-Heteromethod Mean (HHM) Discriminant Validity Ratio [MHM/HHM] Indicates Independence for Traits & Methods	Heterotrait-Heteromethod Mean (HHM) Discriminant Validity Ratio [MHM/HHM] Indicates Independence for Traits & Methods
I.NVD 1. T 2. P 3. S II. VD 4. T 5. P 6. S III. OP 7. T 8. P 9. S	.48 * .52 * .54 * .41 * .47 * .46 * .54 * .44 * .46 * .52 * .48 * .44 *	.74 * .81 * .84 * .59 * .72 * .72 * .84 * .67 * .71 * .79 * .74 * .60 *	1.22 1.32 1.38 0.96 1.24 1.22 1.43 1.03 1.15 1.27 1.20 0.98

* p < .05

Validity of the three traits on the public speaking rating scale. Examining the significance levels of the convergent validity coefficients in Tables 5 and 6 and the monotrait-heteromethod mean convergent validity coefficients in Tables 7 and 8, it appears evident that this study can claim convergent validity, so the validity of the three traits is supported.

Validity of the three rating methods. From the analyses performed in Step 3 of the previous section to produce the heterotrait-monomethod mean discriminant validity ratio and the heterotrait-heteromethod discriminant validity ratio (see Tables 7 and 8, middle column), all of the ratios in the heterotrait-monomethod mean discriminant validity ratio, which indicate inter-rater reliability, are significant (p < .05), so the validity of the three methods is supported.

Validity of independence for traits and methods. From the analyses performed for Step 5, which produced the heterotrait-heteromethod mean discriminant validity ratio (see Tables 7 and 8, column on the right) all of the ratios in Table 7 for teacher and peer assessment exceed the requisite 1.00, and two of the self assessment scores NVD and OP do not. In a similar study using a three-point oral presentation rating scale, Griffee (1998) found that peer assessments were more highly correlated than self assessments; however, the self assessments were reflective and the performanced were not videotaped. Blanche (1988) and Yamashita (1996) found that more advanced language learners of showed a tendency to underrate their performance in roleplays. In terms of Japanese cultural identity, Yamashita (1996) suggests that there is a ÒJapanese tendency to be modest about their own performance skillsÓ (p. 73).

Table 8. Mean Convergent and Discriminant Validity Coefficients (N=61) (Matrix Corrected for Attenuation)

	Monotrait-Heteromethod Mean (MHM) Convergent Validity Validity of Traits	Heterotrait-Heteromethod Mean (HHM) Discriminant Validity Ratio [MHM/HHM] Indicates Independence for Traits & Methods	Heterotrait-Heteromethod Mean (HHM) Discriminant Validity Ratio [MHM/HHM] Indicates Independence for Traits & Methods
I.NVD 1. T 2. P 3. S II. VD 4. T 5. P 6. S III. OP 7. T 8. P 9. S	.66 * .71 * .74 * .53 * .52 * .52 * .59 * .43 * .55 * .60 * .56 * .48 *	.83 * .88 * .91 * .66 * .66 * .66 * .76 * .56 * .69 * .76 * .70 * .60 *	1.44 1.54 1.60 1.15 1.27 1.27 1.43 1.06 1.15 1.26 1.18 1.00

* p < .0

However, in Table 8 each heterotrait-heteromethod mean discriminant validity ratio exceeds equals or exceeds 1.00; thus, this study can claim heterotrait-heteromethod discriminant validity with limited support for self-assessments.

Discussion

1. To what extent is the public speaking rating scale reliable for teacher, peer, and self-assessment when used as designed within an EFL course?

Interrater reliability of the 12 item rating scale for the three rating methods is fairly respectable at .76. The alpha reliabilities of the three traits are good overall for teacher ratings: NVD .81, VD .88, and OP .90 (see Table 3). For the peers, they are less consistent: NVD .76, VD .95, and OP .86. The self ratings are the least consistent overall: NVD .63, VD .85, and OP .74. This may suggest a need for more instruction and/or practice on using the rating scale for in-class peer and self evaluation. From GENOVA on the 12 item rating scale, the generalizability is .74 for one rater, .83 for two raters, and .86 for three raters: teacher, peer, and self (see Table 4). These results indicate that language programs that use the public speaking rating scale should consider having two or more raters for high-stakes testing.

2. To what extent is the public speaking rating scale valid for teacher, peer, and self-assessment when used as designed within an EFL course?

Factor analysis of the three teachers ratings offers support to Yamashiro's (1998a) three traits on the public speaking rating scale. From the MTMM analysis, the significant levels of the convergent validity coefficients in the diagonals in Tables 5 and 6 and the monotrait-heteromethod mean convergent validity coefficients in Tables 7 and 8 indicate that this study can claim convergent validity, so the validity of the three traits is supported. All of the ratios in the heterotrait-monomethod mean discriminant validity ratio, which indicate inter-rater reliability, are significant (p < .05), so the validity of the three methods is supported. All of the ratios in Table 7 for teacher and peer assessment exceed 1.00; however, two of the self assessment scores nvd and op do not. However, in Table 8 each heterotrait-heteromethod mean discriminant validity ratio exceeds equals or exceeds 1.00; thus, this study can claim heterotrait-heteromethod discriminant validity with limited support for the rating scale traits as measured by the self rating method. In other words, it is not fully supported that students are capable of performing the self assessments based on the results of this study along.

Limitations and Implications

This study was performed in an elective English language course with a speech communication focus having students of wide-ranging proficiency from lower-intermediate to near-native speaking returnees at a private university. It is not clear how far this study can be generalized into the general EFL population in Japan based solely on the results of this study. In a similar study, Griffee (1998), found that peer assessments agreed with the teacherÕs assessments but the self assessments did not, even after correcting for attenuation. Griffee (1998) identified the following problems to his study: using a restricted three-point Likert scale, having a small sample size (N=19), and using an unvalidated instrument. However, research in the speech communication field (Hirshfield, 1968; Porter & King, 1972; Roberts, 1972; Smythe, Kibler, & Hutchings, 1973) advocates the use of videotape for student self-assessment to increase accuracy in self-ratings of performance. This study suggests that for EFL students electing to study speech communication at a highly-ranked private university, the teacher, peer, and self-assessments using the public speaking rating scale have acceptable levels of reliability and validity.

Conclusion

This study examined and supported the validity of YamashiroÕs reduced 12-item rating scale, when used as described for teacher, peer, and self-assessment in Yamashiro and JohnsonÕs (1997) EFL public speaking course. Future research on rating EFL oral presentations should look into the other construct validity inquiry factors outlined by Chapelle (1998) as well as the raters and ratings in oral language assessment (Reed & Cohen, in press). Investigations into the relationship among test-taker characteristics, such as gender (Hill, 1998) language proficiency, attitudes, motivation (Kunnan, 1995, 1998), and cognitive strategies (Purpura, 1998a, 1998b) with EFL speech communication, as well as continued inquiry into creating effective methods for providing reliable and valid performance assessments in the L2 classroom. To this end, further research using structural equation modeling for confirmatory factor analysis on MTMM data (Bachman & Palmer, 1989) should be employed. As language educators, it is our responsibility to determine the reliability and validity of the assessments we make within our courses and curricula.

Notes

1 For more information on how to teach public speaking, consult Ayres and Miller (1983); Kovacs, Mortensen, Remes, Tunstall, and Wulff (1984); Payne and Carlin (1994) Harrington and LeBeau (1996); Lenning (1996); and Yamashiro and Johnson (1997).
2 Consult Yamashiro (1998a, 1999) and Yamashiro and Johnson (1997) for rating speeches in EFL. See Ulrich (1991), Yamashiro (1998c), and Yamashiro and McLaughlin (1996, 1998b) for an in-depth discussion on judging academic debates.
3 Amato and Ecroyd (1975) provide a detailed discussion on organization and rhetoric in speech communication.

References

Amato, P. P., & Ecroyd, D. H. (1975). Organizational patterns and Strategies in speech communication. Skokie, IL: National Textbook Company.
Ayres, J., & Miller, J. (1983). Effective public speaking, fourth edition. Madison, WI: Brown & Benchmark Publishers.
Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, L., & Palmer, A. S. (1981). The construct validation of the FSI oral interview. Language Learning, 31(1), 67-86.
Bachman, L., & Palmer, A. S. (1982). The construct validation of some components of communicative proficiency. TESOL Quarterly, 16 (4), 449-465.
Bachman, L., & Palmer, A. S. (1989). The construct validation of self-ratings of communicative language ability. Language Testing, 6 (1), 14-29.
Blanche, P. (1988). Self-assessment of foreign language skills: Implications for teachers and researchers. RELC Journal, 19 (1), 75-96.
Brown, J. D. (1995). The elements of language curriculum: A systematic approach to program development. Boston, MA: Heinle & Heinle Publishers.
Brown, J. D. (1996) Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.
Buck, G. (1992). Listening comprehension: Construct validity and trait characteristics. Language learning, 42 (3), 313-357.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Carpenter, E. C. (1986, April). Measuring speech communication skills. Paper presented at the Annual Meeting of the Central States Speech Association, Cincinnati, OH.
Cohen, A. D. (1994). Assessing language ability in the classroom (2nd ed.). Boston, MA: Heinle & Heinle Publishers.
Dunkel, P. (1986). Developing listening fluency in L2: Theoretical principles and pedagogical considerations. The Modern Language Journal, 70 (2), 99-106.
Griffee, D. T. (1998). Classroom self-assessment: A pilot study. JALT Journal, 20 (1), 115-125.
Grunner, C. R. (1968). Behavioral objectives for the grading of classroom speeches, The Speech Teacher, 17, 207-209.
Harrington, D., & LeBeau, C. (1996). Speaking of speech: Basic presentation skills for beginners. Tokyo: Macmillan Languagehouse, Ltd.
Henning, G. (1987). A guide to language testing: Development, evaluation, research (MTMM analysis pp. 101-105). Boston, MA: Heinle & Heinle Publishers.
Hill, K. (1998). The effect of test-taker characteristics on reaction to and performance on an oral English proficiency test. In A. J. Kunnan (Ed.), Validation in language assessment: Selected papers from the 17th Language Testing Research Colloquium, Long Beach (pp. 209-229). Mahwah, NJ: Lawrence Erlbaum Associates, Publishers.
Hirshfield, A. G. (1968). Videotape recordings for self-analysis in che speech classroom. The Speech Teacher, 17, 116-118.
Kibler, R. J., Barker, L. L., & Cegala, D. J. (1970a) Behavioral objectives and speech-communication instruction. Central States Speech Journal, 21, 71-80.
Kibler, R. J., Barker, L. L., & Cegala, D. J.(1970b) A rationale for using behavioral objectives in speech-communication. The Speech Teacher, 19, 245-256.
Kovacs, M. A., Mortensen, R., Remes, H., Tunstall, J. P., & Wulff, D. H. (1984). Speech: Skill, process, practice. USA: Center for Learning.
Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural modeling approach. Cambridge: Cambridge University Press.
Kunnan, A. J. (1998). An introduction to structural equation modelling for language assessment research. Language Testing, 15 (3), 295-332.
Lenning, M. (1996). Getting started in speech communication. Lincolnwood, IL: National Textbook Company.
McNamara, T. (1996). Measuring second language performance. London: Longman.
Monbusho. (1989). Guidelines for study in senior high school. Tokyo: Japanese Ministry of Education.
Payne, J., & Carlin, D. P. (1994). Getting starting in public speaking, third edition. Lincolnwood, IL: National Textbook Company.
Purpura, J. E. (1998a). The development and construct validation of an instrument designed to investigate selected cognitive background characteristics of test-takers. In A. J. Kunnan (Ed.), Validation in language assessment: Selected papers from the 17th Language Testing Research Colloquium, Long Beach (pp. 111-139). Mahwah, NJ: Lawrence Erlbaum Associates, Publishers.
Purpura, J. E. (1998b). Investigating the effects of strategy use and second language test performance with high- and low-ability test takers: A structural modelling approach. Language Testing, 15 (3), 333-379.
Reed, D. J. & Cohen, A. D. (In press). Revisiting raters and ratings in oral language assessment. In C. Elder, K. Hill, T. McNamara, N. Iwashita, E. Grove, A. Brown, T. Lumley, & K. O'Loughlin(Eds.), Experimenting with uncertainty. Cambridge: Cambridge University Press.
Roberts, C. (1972). The effects of self-confrontation, role playing, and response feedback on the level of self-esteem. Speech Teacher, 21, 22-38.
Smythe, M. J., Kibler, R. J., & Hutchings, P. W. (1973). A comparison of norm-referenced and criterion-referenced measurement with implications for communication instruction. The Speech Teacher, 22 (1), 1-17.
Tatum, D. S. (1992, April). Controlling for judge differences in the measurement of public speaking ability. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA. ED 353 273.
Ulrich, W. (1991). Judging Academic Debate. Lincolnwood, IL: National Textbook Company.
Ur, P. (1997). Teacher training and teacher development: A useful dichotomy? The Language Teacher, 21 (10), 59-67.
Williams, D. E., & Stewart, R. A. (1994). An assessment of panel vs. individual instructor ratings of student speeches. Basic Communication Course Annual, 6, 87-104. (ERIC Document Reproduction Service No. ED 378 631)
Yamashiro, A. D. (1998a). Item analysis on an EFL public speaking rating scale. Proceedings of the 9th Conference on Second Language Research in Japan (pp. 1-16). International University Japan.
Yamashiro, A. D. (1998b, May). Validating a Rating Scale Using Multitrait Multimethod Analysis. Paper presented at the 1998 Temple University Applied Linguistics Colloquium, Tokyo.
Yamashiro, A. D. (1998c). Redefining speech communication: Integrating debate in EFL. Speech Communication Education, Vol. XI (pp. 105-123). The Communication Association of Japan.
Yamashiro, A. D. (1999). Integrating American Speech Communication Research into EFL. Speech Communication Education, Vol. XII (pp. 111-131). The Communication Association of Japan.
Yamashiro, A. D., & Johnson, J. (1997). Public speaking in EFL: Elements for course design. The Language Teacher, 21 (4), 13-17.
Yamashiro, A. D., & McLaughlin, J. W. (1996). Adapting debate to the EFL classroom: From activities to tournaments. English Education for Developing Communication '96 (pp. 65-78). Tokyo: Kanda Institute of Foreign Languages.
Yamashiro, A. D., & McLaughlin, J. W. (1998a). Testing listening comprehension: Implications for teaching speech communication. Journal of Heisei International University, 2, 1-14.
Yamashiro, A. D., & McLaughlin, J. W. (1998b). Getting started in debate: An EFL teachers guide. JALT 1997 Conference Proceedings (pp. 153-160). Summer 1998. JALT.
Yamashita, S. O. (1996). Six measures of JSL pragmatics. Honolulu, HI: Second Language Teaching & Curriculum Center University of Hawai'i at Manoa.
Zeman, J. V. (1986, November). A method of using student evaluation in the basic [speech] course. Paper presented at the 72nd Annual Meeting of the Speech Communication Association, Chicago, IL. (ERIC Document Reproduction Service No. ED 279 035)

Appendix: 14-Item Rating Sheet and Three Goals for Improvement
(adapted from Yamashiro & Johnson, 1997, p. 15)

Rating Sheet

Rater's Name: ________________________ Class:______________

Student ID Number: __________________ Date: ___ / ___ / ___

Speaker's Name: __________________________________________________________

Title/Topic of Speech: ______________________________________________________

Use the following five-point scale: 5 (very good), 4 (good), 3 (okay),

2 (so so), and 1 (needs work).

Voice Control

Projection 5 4 3 2 1

Pace 5 4 3 2 1

Intonation 5 4 3 2 1

Diction 5 4 3 2 1

Body Language

Posture 5 4 3 2 1

Eye Contact 5 4 3 2 1

Gesture 5 4 3 2 1

Content of Oral Presentation

Introduction 5 4 3 2 1

Body 5 4 3 2 1

Conclusion 5 4 3 2 1

Effectiveness

Topic Choice 5 4 3 2 1

Language Use 5 4 3 2 1

Vocabulary 5 4 3 2 1

Purpose 5 4 3 2 1

Total : [____________/70]

Three Goals for Improvement

After rating your performance (from video), decide on three areas that you want to

improve when you present your next speech. This is due: _____/_____/_____.

Graduate College of Education

Evaluating the Construct Validity of an EFL Rating Scale Using Multitrait-Multimethod Analysis

Amy D. Yamashiro

Speech Communication in EFL

The EFL Public Speaking Course: Objectives and Assessment

Rating Scale MTMM Construct Validation Study

Construct Validation: Multitrait-Multimethod Design

Research Questions:

Participants

Factor Analysis

The Reliability of the Rating Scale

GENOVA (Generalized Analysis of Variance)

Multitrait Multimethod (MTMM) Analysis

The Eyeball-Method for MTMM Analysis

The Means and Ratio Method for MTMM Analysis

Discussion

Limitations and Implications

Conclusion

References

Temple University, Japan Campus
Graduate College of Education

Graduate College of Education

Evaluating the Construct Validity of an EFL Rating Scale Using Multitrait-Multimethod Analysis

Amy D. Yamashiro

Speech Communication in EFL

The EFL Public Speaking Course: Objectives and Assessment

Rating Scale MTMM Construct Validation Study

Construct Validation: Multitrait-Multimethod Design

Research Questions:

Participants

Factor Analysis

The Reliability of the Rating Scale

GENOVA (Generalized Analysis of Variance)

Multitrait Multimethod (MTMM) Analysis

The Eyeball-Method for MTMM Analysis

The Means and Ratio Method for MTMM Analysis

Discussion

Limitations and Implications

Conclusion

References

Temple University, Japan CampusGraduate College of Education

Temple University, Japan Campus
Graduate College of Education