Introduction
An explosion of visual images characterizes the 21st century public sphere. Individuals capture 57,000 photos every second (Broz, 2023), and daily share over 3 billion images on social media (JasenkaG, 2023) and one million memes on Instagram (Mcgil, 2024). Out of the more than 400 million daily terabytes of online data, video content accounts for more than half (Duarte, 2024). On YouTube alone, users upload more than 500 hours of video every minute (Ceci, 2024). Today, visuals constitute an imperative mode of messaging for virtually any communicator seeking to reach target audiences (Knobloch et al., 2003).
Visuals are also key to strategic, persuasive messaging. Compared to text, visuals appear closer to the truth, rendering them a ready means of proof and authentication (Barthes, 1981; Messaris & Abraham, 2001). They communicate ideological propositions (Edwards & Winkler, 1997), create positive images of political causes (Sontag, 2003), evoke emotional responses (Perlmutter, 1998), constitute cultural memories and meanings (McClancy, 2013; Trachtenberg, 1985), and mobilize supporters (Hariman & Lucaites, 2007; Mattoni & Teune, 2014).
While manual coding methods have helped better understand visual messaging strategies for decades, the sheer number of digital images poses a challenge. By minimizing time and resources needed to manually code images, computational methods are more efficient. Unsupervised learning can help reveal visual themes and categories within big datasets via a bag-of-visual words model (Torres, 2023; Zhang & Peng, 2022). Scholars have used such automated approaches to detect clusters in politicians’ social media posts (Joo & Steinert-Threlkeld, 2022; Peng, 2021), smartphone screen activities (Muise et al., 2022), online protest images (Zhang & Peng, 2022), ring-wing memes (Lokmanoglu et al., 2023), and migrant photographs (Torres, 2023). Supervised machine learning, by contrast, involves training an AI model using labeled visual data before examining similar imagery. Studies have used this approach to identify protest depictions (Zhang & Pan, 2019), image-text mismatches (Ha et al., 2020), differentiations between state and protester violence (Steinert-Threlkeld et al., 2022), and objects and faces including their emotions, gender, age, and visual aesthetics (e.g., Bakhshi & Gilbert, 2015; Dietrich & Ko, 2022; Peng, 2018; Peng & Yingdan, 2023). Nonetheless, computational methods also come with their own fair share of challenges, such as validation, algorithmic bias, and privacy concerns (Chen et al., 2024, 2024; Williams et al., 2020; Zou & Schiebinger, 2018).
Acknowledging that W. J. T. Mitchell’s (1995) pictorial turn has now entered a computational phase, this article manually examines several thousand images that Islamist extremist groups distributed to provide insights for improving AI’s usefulness in understanding online extremist content. After highlighting the usefulness of visual framing practices and discussing key challenges facing computational visual analyses today, we explicate the study’s comparative methodology. Then, we lay out five lessons emergent from the comparison useful for improving visual computational methods. The study concludes with a discussion of how mixed visual methods of AI and manual coding can bring better understandings to the levels of visual framing.
Visual Framing
Rodriguez and Dimitrova’s (2011) four-tiered model is one of the most cited works in the literature on visual framing (Walter & Ophir, 2024), as it moves beyond atheoretical or exclusively semiotic approaches to engage with how image selection, cropping, and editing can capture attention, evoke emotions, carry meanings, and influence perceptions (Coleman, 2010; Geise & Baden, 2015). Their model disaggregates visual framing into four tiers. The denotative level constitutes meaning by identifying salient frames through who and what an image depicts based on the scene and surrounding texts (Schwalbe, 2006). Denotative frames can be context-specific (Goffman, 1974, 1979) or function as equivalency-based dyads, such as gain versus loss and war versus peace (Cacciatore et al., 2016). The semiotic level suggests social meanings by examining visual grammar and conventions, including camera angles, perceived viewer distance, body posture, eye contact, and facial expressions (Forgas & East, 2008; Hall, 1966; Kress & Leeuwen, 2006). The connotative tier considers abstract and figurative symbols (e.g., metaphors) in the shot that can “combine, compress and communicate social meaning” (Rodriguez & Dimitrova, 2011, p. 96). Finally, the ideological level expands on the visual elements, stylistic choices, and symbols to provide a holistic interpretation that underscores the political, religious, economic and/or demographic underpinnings of the visual constructions. Combined, the four-tiered model allows for a nuanced understanding of visual messaging.
Studies of visual media campaigns by Islamist extremist groups have examined all four levels of the 4-tiered model, albeit with less emphasis on semiotics. At the denotative level, for example, Al-Qaeda’s visual campaign encompassed predominantly militant visual frames, such as training, operations, and martyrdom (Farwell, 2010; Center, 2005), highlighting political Islamist connotations associated with objects like flags and swords (Coleman, 2006). At the semiotic level, studies of ISIS dissected several pictorial stylistic conventions, including viewer distance, camera angle, eye contact, facial expressions, point-of-view shots, and dynamic versus static imagery (e.g., Impara, 2018; Winkler et al., 2019). At the connotative level, ISIS and al-Qaeda utilized symbols like AK47s, the monotheism hand gesture, and depictions of death and dying to convey meaning (Wignell et al., 2017; Winkler et al., 2018). Ideologically, both al-Qaeda and ISIS espoused an extreme Islamist lens steeped in a clash of civilization narrative that promoted religious rule as an alternative to the nation state system (Ciovacco, 2009), but had differing views on the nature and timing of the Caliphate (Kuznar, 2015). Yet, existing literature on al-Qaeda and ISIS’s media campaigns, while providing nuanced understandings of the four visual framing tiers, has been mainly limited to manual coding.
Computational Visual Analysis
Computational visual analysis on its own is not yet capable of fully engaging the four-tier model of visual framing. Computational attempts identify some denotative and semiotic elements (e.g., humans, race, age, color, public figures, rifles, umbrellas, facial expressions) (Chen et al., 2022; Joo & Steinert-Threlkeld, 2022; Muise et al., 2022; Zhang & Peng, 2022). Yet, such analyses stop short of gauging key symbolic and ideological visual components of scenes and their contexts due to what Peng, Lock, and Salah (2024) rightly argue is an automation-theoretical disconnect. This study addresses this gap by comparing manual and automated coding in the online extremism sphere in pursuit of a more effective, hybrid approach.
Traditional computational image analysis typically relies on task-specific computer-vision architectures such as YOLO-style object detectors, Mask R-CNN segmentation models, or transformer-based vision–language models like BLIP-2 or Flamingo. These systems are designed to extract concrete visual features (e.g., objects, faces, scene layouts) and perform discrete tasks with high accuracy when trained on large, labeled datasets. However, they are not built to apply multi-layered interpretive schemas such as Rodriguez and Dimitrova’s four-tier model without extensive task-specific fine-tuning. Because our study evaluates GPT-4o in a zero-shot setting—asking a general-purpose vision-language model to follow a human-developed codebook—we position this approach as complementary to, rather than a replacement for, traditional CV pipelines.
AI changes related to the scanning, sampling, and quantization of visual images are rapidly evolving, rendering accurate summations of developments difficult (Zhang & Dahu, 2019). Complicating the quickly changing terrain is that machines are now producing most images, often for other machines rather than the human eye (Paglen, 2019). Nonetheless, AI remains data-driven, rather than image-driven, meaning that understanding image-data relationships should remain a priority (Anderson, 2017). Whether computerized or not, visual cultures influence and are influenced by human biases in both production and consumption of online messaging (Bridle, 2023; Sezen, 2020). Combining quantitative and qualitative analyses of big data can augment visual framing. Dondero, for example, maintains that large-scale quantitatively produced diagrams produced through quantitative and semiotic analysis can assist in identifying “contrasting areas, opposite areas, or superposing of images on the plane of expression” useful for further quantitative and qualitative analysis (Dondero, 2019, p. 140). Such a combined approach preserves the importance of visual context within a specified corpus and across the image components. In short, she advocates for scholars to use her process as a metavisual device for four reasons:
(1) these visualizations are images of images; (2) the parameters used to arrange them are visual; (3) the automatic distribution of the images is visualized spatially in a presentation governed by abscissas and ordinates; (4) the content analysis…remains within the realm of images (filiation, tradition, citation, genre, etc.) and not the abstract realm of verbal description (Dondero, 2019, p. 141).
Here, we agree with Dondero (2019) about the value of computerized quantitative analysis for assisting qualitative results. However, we add that quantitative human coding analysis, combined with statistical assessments, can further strengthen the tracked meaning of results. We begin by asking:
-
How effectively can a vision-language model (GPT-4o) apply a human-developed content-coding schema to Rodriguez and Dimitrova’s four-tiered visual framing analysis?
Large language models and NLP pipelines can efficiently process large text corpora, grouping semantically similar responses and surfacing recurring patterns. Gamieldien et al. (2023) find that transformer-based tools can generate highly granular codes across thousands of student reflections, substantially reducing human labor. This aligns with earlier infrastructure-oriented work showing NLP can automatically classify predictable text categories. Similarly, Morgan (2023) reports that ChatGPT performs well when themes are concrete and descriptive, requiring little interpretive inference. In hybrid interfaces, rule-based suggestions can be systematically extended to unseen data (Rietz & Maedche, 2021), increasing agreement rather than replacing human interpretation and underscoring that automation is most reliable for patterned, literal, and structurally evident meaning.
Despite scalability gains, current AI systems appear to underperform when meaning depends on tacit knowledge, ambiguity, or socio-cultural interpretations. Gamieldien et al. (2023) note the need for AI researcher oversight when semantic nuance matters. Studies of thematic automation note that disagreement among humans themselves reflects interpretive pluralism (Armstrong et al., 1997; Mackieson et al., 2019) — something models are poorly equipped to resolve. Transformer architectures excel at long-range dependencies (Lakretz et al., 2020), but still primarily attend to textual features rather than framing context, affective tone, or symbolic cues. These limitations indicate that interpretive coding requires judgment beyond probabilistic associations, particularly in indexical, connotative, or historically situated domains.
Across studies, researchers express caution about fully delegating thematic interpretation to automated systems. Marathe and Toyama (2018) report reluctance rooted in opacity, loss of theoretical accountability, and few questioning opportunities (Chen et al., 2018). Rietz and Maedche (2021) similarly find researchers use automated suggestions not to accelerate coding, but to reflect on needed codebook refinements. Such reflection aligns with iterative qualitative traditions emphasizing continuous interpretation (e.g., Braun & Clarke, 2006; Saldana, 2021 cited in Gamieldien et al., 2023). Thus, for tasks requiring contextual inference, socio-cultural reading, or interpretive framing, human coders continue to outperform computational models. Because visual framing often requires reading symbolism, composition, affect, and implied narratives, we ask:
Methodology
To assess the effectiveness of GPT-4o for applying a human-developed content-coding schema to visual framing analysis, we began by conducting a human-coded content analysis of 7,292 images from al-Qaeda and ISIS’s English and Arabic magazines or newsletters distributed 2009-2020 (see Table 1). For al-Qaeda, the English issues included Inspire (1-17) and Jihad Recollections (1-4), and the Arabic issues were al-Masra (1-57). ISIS’s English issues included Dabiq (1-15) and Rumiyah (1-13), and al-Naba (1-229) in Arabic. All items were publicly available through Google, Jihadology (Zelin, 2021), or archive.org.
Table 1:
Image Count in Al-Qaeda and ISIS Online Publications
We utilized 13 expert coders from Egypt, Afghanistan, Turkey, Saudi Arabia, Syria, Poland, Vietnam, and the United States to create, refine, and utilize a visual analysis codebook. Our human coders had doctorates or graduate training in Communication Studies, Psychology, Political Science, and Education. The pilot phase involved three US and Egyptian coders who created coding categories inductively from images in Dabiq’s first issue until intercoder reliability was higher than 0.80 on Cohen’s kappa for all variables. Coders met weekly to identify and resolve discrepancies and cross-cultural differences that produced unacceptable reliability levels. In cases of disagreements, a bias toward the Middle Eastern perspective prevailed in line with the primary target audience. With a reliable codebook, 13 coders received oral and written training and analyzed each image in the dataset. The average intercoder reliability score using Cohen’s kappa across all categories was 0.91 (see Table 2). A third coder resolved discrepancies for statistical analysis.
Table 2:
Intercoder reliability and description of the manual coding instrument
We sorted our human coding categories into four levels of visual framing. While Rodriguez and Dimitrova (2011) note that the four tiers can overlap, we focused on the level that three coders considered most suited to the images’ characteristics. The meaning of denotative elements (or objects that bore an indexical relationship with an individual, thing, or place) involves two interrelated processes. First, the coding process accounts for elements the viewers can see in the shot, rendering them “closer to the truth than other forms of communication” (Messaris & Abraham, 2001, p. 217). Fully gauging the denotative meaning, however, requires generating frames that derive from a second process that looks not only at the textual context (Rodriguez & Dimitrova, 2011), but also into the syntactic relationships between images. Compared to words, images lack an explicit prepositional syntax, or the ability to propagate clear causal relationships, similarities, or other forms of connections (Messaris & Abraham, 2001). Our denotative categories included military fighters, death, humans, leaders, flags, and destruction.
The second level of semiotics (or stylistic, technical, and subject conventions) focuses on “signs and symbols, sign systems, and sign processes” (Moriarty, 2002, p. 20). Visual semiotics fulfills three meta functions: compositional, representative, and interactive (Kress & Leeuwen, 1996, 2006), as it emphasizes four types of signification: arbitrary (by convention), memetic (by iconic representations), evidential (by cues and codes), and signaling (by recognition) (Moriarty, 2002). Metaphors or metaphorical thinking (Feng & O’Halloran, 2013), and visual metonymics linked to abstract concepts and objects/events are also associated with semiotics (Feng, 2017). Accordingly, our semiotic categories included viewer position, image position, viewer distance, eye contact, and facial expressions.
The third framing level of connotation involves visual symbols linked to ideas or concepts associated with individuals, things, or places. Turner insists symbology draws its data from “cultural genres or sub systems of expressive culture…as well as narrative genres, such as myth, epic, ballad, the novel and ideological systems [and t]hey would also include non-verbal forms” (Turner, 1979, p. 12). Symbols can allude to power, authority, faith, rituals, and death, among others, to achieve personal or group goals (Turner, 1974). At the iconographical level, visual symbols go beyond the depicted object or person to connote ideas and concepts; they begin to reveal ideological meanings derived from backgrounds and the surrounding context (Panofsky, 1955; Leeuwen, 2001). Our connotative categories included about to die, religion, and state, with the latter disaggregated into state-building, law enforcement, allegiance pledges, and media propaganda for a more fine-grained analysis (see Appendix A for category meanings; Table A1 for examples).
We removed several manual coding categories from our computational model. We excluded image size and position because the workflow operated on individual images rather than publication layouts. We omitted gender because the overwhelming number of individuals displayed were male, with females only as occasional outliers. We removed age and social infrastructure as neither produced significant results across more than a dozen papers addressing how message strategies intersected with situational factors.
To compare manual content coding with AI visual coding, we built a lightweight, fully reproducible inference pipeline that (1) ingested image files, (2) encoded each image and sent it to GPT-4o together with a structured labeling prompt derived from our codebook, and (3) compiled the model’s outputs into a standardized dataset for evaluation against human annotations. The pipeline did not fine-tune model weights; it relied on constrained prompting and schema validation to ensure consistent, interpretable results (see Figure 2).
Step 1: Image Preparation and Alignment
All images were extracted from the full corpus of al-Qaeda and ISIS publications and matched to their manual coding entries. Each file was saved using a standardized naming convention that included publication, issue number, and image number (e.g., D_12_04), allowing a one-to-one linkage between visual material and its metadata to ensure tracking capacity back to images’ sources and manually coded attributes.
Step 2: Preprocessing and Encoding
The repository contained PNG, JPG, JPEG, GIF, BMP, and WEBP formats. Images were maintained at their original resolution without resizing or alteration of embedded metadata to preserve visual detail integrity. Each image was converted into a text-based data format through base64 encoding, which allows secure transmission of visual information as text while retaining the original pixel structure (Lokmanoglu & Walter, 2025). Detailed logs documented each image’s processing status to ensure completeness and traceability.
Step 3: Model Inference and Structured Prompting
We provided each encoded image directly to GPT-4o (OpenAI, 2024) along with a structured prompt adapted from the visual framing codebook. The prompt specified the exact coding categories (e.g., military role, human figures, eye contact, state-building, and religious symbolism) and required the model to classify each image according to those predefined labels (see Appendix B). Category definitions were embedded in the prompt to guide consistent decision-making. The model’s responses were constrained to a fixed output schema composed of numeric identifiers corresponding to each category.
This study employed a zero-shot prompting design. The model received the structured labeling prompt and visual input simultaneously without prior exposure, calibration subset, or iterative refinement. The objective was not to train or improve GPT-4o’s performance but to evaluate how an off-the-shelf vision-language model applied an existing human content-coding schema. Each API call contained a single image and the associated codebook prompt without captions, metadata, or textual context.
To ensure the AI response integrity, we monitored outputs for potential misclassifications or omissions related to sensitive or graphic imagery. System logs were reviewed after each coding run to identify any uncoded images flagged as indeterminate or potentially due to ethical safeguards. No systematic filtering or suppression of graphic content occurred, but the logging process allowed for post hoc verification should future discrepancies arise.
Step 4: Output Aggregation and Reliability Assessment
The model’s outputs were compiled into a unified dataset and compared with the manual coding results using standard reliability and performance metrics, including precision (the proportion of correct positive predictions), recall (the proportion of actual positives correctly identified), and F1-score (a recall-weighted harmonic mean of precision and recall). We also report F2-scores, macro-averaged and per-class metrics, as well as two measures of intercoder reliability between AI and human coders. Reports of percent agreement served as an intuitive measure of alignment.
Following automated content analysis research standards, F1 scores above 0.80 indicated excellent agreement between AI and human coding, scores between
Findings
RQ1 asked how effectively a vision-language model (GPT-4o) could apply a human-developed content-coding schema to Rodriguez and Dimotrova’s four tiers of visual framing. The AI coding performance differed in substantial ways across variables and framing tiers (see Table 3 and Appendix B Table B1). Denotative variables showed the strongest alignment between AI and human coding. Humans, Destruction, Leaders, and Flags demonstrated the most consistent performance across metrics. Humans yielded an F1 of 0.80 and a Balanced Accuracy of 0.83. Destruction performed similarly (F1 = 0.79, Balanced Accuracy = 0.83). Leaders and Flags showed moderately strong performance, with F1 values generally ranging from 0.70 to 0.78 and Balanced Accuracy typically above 0.75. Krippendorff’s
Denotative variables requiring additional contextual interpretation showed more modest alignment. Military Role achieved an F1 of 0.38, a Balanced Accuracy of
Semiotic variables requiring interpretations of bodily, expressive, or relational cues showed weaker correspondence. Viewer Position, Viewer Distance, Eye Contact, Stance, and Facial Expression produced F1 values generally 0.60-0.72, Balanced Accuracy values 0.72- 0.76, and Krippendorff’s
Connotative variables tied to symbolic or ideological meanings showed the weakest correspondence. State-building performed poorly (F1
Table 3:
F1 and F2 scores, balanced accuracy, percent agreement, Krippendorff’s alpha, and Cohen’s kappa for each visual framing variable
!
RQ2 asked which dimensions of visual framing remain resistant to automation, and what these limitations reveal about the strengths of human coding in visual analysis.
Lesson #1: Denotative Interactions
Despite AI’s stronger performance at the denotative level, inaccuracies and omissions were present. Military roles, for example, showed only modest agreement levels, as manual coders examined taglines and accompanying text to distinguish al-Qaeda and ISIS militants from enemy fighters. Without these textual cues, AI required more training on choice of uniforms or human intervention. Additionally, AI could detect ISIS’s objects like coins, outdoor markets, and competing currencies, but required researchers to recognize their connotative meaning, such as ISIS’s desired frame of economic independence. Similarly, AI could detect bottles of alcohol, drugs, and cigarettes (see Figure 3), but required human coders to recognize them as critical components of ISIS’s moral policing apparatus. In short, the addition of human coding can render computational methods less likely to miss key objects or elements in big datasets and more likely to properly assess their functions.
Another key contribution of manual coding to AI processing involves aggregation of denotative elements. Our AI learning approaches generated disparate visual elements, such as militants fighting or training, raised index fingers, swords, maps, doctors treating patients, and sunsets, etc. Human coding, however, captured subtle visual relationships that revealed broader themes of military prowess and state-building, identified how relationships created cohesive narratives, showed how symmetry, repetition or alignment influenced viewer interpretations, and unveiled the visual syntax strategy. For example, human coders identified the interrelationship between images of beheadings, piles of cigarettes, and checkpoints as ISIS’s law enforcement apparatus, complete with punishments for alleged spies, the moral policing apparatus, and the access points for determining who could enter the caliphate. Entman’s framing associations were useful to unlock the messaging strategy. The visual law enforcement frame communicated that sins were rampant (problem definition) because of people smuggling contraband, committing treason, and ignoring Islamic rules (causal interpretation), which stained the society (moral evaluation), thus requiring punishments and crackdowns on contraband to ensure community safety (treatment recommendation). An unsupervised learning computational approach on its own would not fully reveal the visual narrative.
Figure 3:
Photo from the 10th issue of Dabiq magazine showing ISIS’s hisba agents burning cigarettes and alcohol – Released July 2015
Lesson #2: Semiotic Interactions
For the most part, AI performed poorly on visual semiotic elements, rendering human coding highly valuable in this domain. With Krippendorff’s
Human coding could also help narrow categories down to coding options most useful for understanding visual strategies. For example, semiotic understandings of viewer distance focus on four categories: intimate, personal, public, and social (Jewitt & Oyama, 2008). Each conveys specific meanings associated with standard human interactions (e.g., intimate distances associated with distraught photo subjects; photo subjects shot at a public distance conveying group rather than individual identities). However, human coding and statistical analysis revealed that significant findings primarily appeared only after combining intimate and personal distance (with photo subjects photographed from less than four feet) and comparing them with the combined categories of public and social distance (i.e., greater than four feet). Such findings help avoid overlooking important patterns that could be missed with strict adherence to viewer distance’s four standards.
Lesson #3: Connotative Interactions
While computational coding could efficiently identify key symbols, human content expertise helped interpret needed cultural genres and subsystems. The black flag, for example, is a transhistorical object al-Qaeda and ISIS used as a symbol of adherence to Islam and the Prophet Muhammad’s path. It typically features the words “No God but Allah” with a white circle beneath carrying the words “Muhammad is Allah’s Messenger.” Yet, al-Qaeda often deviated from standard black flag depictions, also featuring white and other emblems used in Afghanistan and elsewhere (see Figure 4). Without such insights, computational extractions of flags as denotative elements would miss other symbolic variations essential for understanding the nuance of extremist groups’ visual messaging. Another frequent symbol in al-Qaeda and ISIS photographs was militants raising their index fingers. The diverse makeup of our manual coding team, including Muslim researchers, identified the gesture as part of Islamic culture, connoting monotheism and the dedication of deeds to the one God. Culturally informed, human content expertise was instrumental for properly training and validating computational visual analyses to generate insightful media campaign understandings.
Figure 4:
Photo from the 17th issue of Al-Masra newspaper showing a militant holding a white flag with the same text that appears on the black banner – Released July 2016
When collaborating with AI, another beneficial area for human coders is adding useful insights about tropes and other visual commonplaces. A prime example in al-Qaeda and ISIS’s media was the excessive reliance on the about to die visual trope, which appeared in 75 percent of their images. Yet, AI missed many instances of the trope. About to die images assume three forms: presumed (i.e., showing implements of death like weapons and destruction), possible (i.e., showing photo subjects potentially dying without a confirmation their death occurred), or certain (i.e., showing photo subjects with accompanying text confirming their deaths) (Zelizer, 2010). An unsupervised learning method would likely group all three forms under a military visual frame, hence not distinguishing between the three constructs nor accounting for the groups’ emphasized use of the three disaggregated strategies. Instead, breaking down each form into its own core denotative components could facilitate training and validation. Labeled data for supervised learning could account for objects or elements, such as blood, swords, knives, guns, AK-47S, tanks, armored vehicles, rockets, ammunition, fire, smoke, explosions, and sniper crosshairs. Human coding could then complement the computational analysis by grouping the three about to die clusters, each with its own unique viewer interactions (Zelizer, 2010).
Lesson #4: Ideological Interactions
Increasingly, scholars have recognized the linkages between ideology and discourse. Fabiszak et al. define ideology as “systems of beliefs, shared by a social group, with the power to evaluate and explain the social world” (Fabiszak et al., 2021, p. 409). McGee adds that “ideology in practice is a political language, preserved in rhetorical documents, with the capacity to dictate decision and control public belief and behavior” (McGee, 1980, p. 5). Van Dijk (1998) explains that ideologies can influence the human mind’s cognitive structures. McGee (1980) posits that a full set of ideological propositions can be summed up in a single term, while Edwards and Winkler (1997) extend such reasoning to a small set of visual images.
Human coding can help AI users distinguish between visual markers of culture and other images not performing ideological functions. One example of how this process could work concerns ideographs. In his study of American culture, McGee (1980) defines a small subset of positive and negative words as ideographs (e.g., freedom, liberty, slavery, and terrorism). They serve as ordinary terms in political discourse, have abstract meanings that allow for collective commitment, warrant the use of power, guide behavior, and have culture-bound meanings. Consideration of the ideograph’s characteristics within social groups can assist AI users in narrowing large corpuses of visual images to specific objects that serve as cultural markers. With al-Qaeda and ISIS, for example, one visual ideological code is the group’s display of the monotheism gesture, whereby Muslims point their index fingers upwards toward heaven to connote their adherence to Allah. Al-Qaeda and ISIS frequently utilize images of the same gesture to signal potential Muslim recruits sympathetic to their groups’ causes.
However, a key function of human coders in the AI training process involves understanding the interactions of visual elements of ideographs within an image. Returning to the monotheism gesture example, AI would be able to scrape all images showing humans pointing their index finger upward. Such an approach, absent human coders’ insights, would initially yield many images of Muslims demonstrating their Islamic faith with no affiliation with extremism. AI might also retrieve images of athletes or other individuals raising their index finger to signify success and victory (see Figures 5 & 6). Rather than confound the results with too much noise to produce meaningful results, AI trainers could refine the scraping process by asking for the gesture along with the presence of militants, males, direct eye contact in photo subjects, personal distance, black flags, and number of humans, as each of these variables have a documented relationship with the monotheism gesture in extremism photographs (Winkler, 2022). By considering element constellations rather than single objects or elements, the dataset will become more accurate in discerning ideological frames.
Figure 5:
Photo from the 1st issue of Rumiyah magazine showing militants before an attack in Iraq, one of whom (on the right) signaling the monotheism gesture – Released September 2016
Figure 6:
Photo showing two indoor soccer players from the Moroccan national team celebrating by signaling the monotheism gesture – Released by Equipe du Maroc/Facebook September 2024
Lesson #5: Image-Context Interactions
Context is a multifaceted resource involving a myriad of forms that can influence interpretations of texts, whether discursive or nondiscursive (Linell, 1998). It both shapes and is shaped by its textual interactions. As McGee notes, “Failing to account for ‘context,’ or reducing ‘context’ to one or two of its parts, means quite simply that one is no longer dealing with discourse as it appears in the world” (McGee, 1990, p. 283).
The complicated interactions between images and contexts in relation to al-Qaeda and ISIS emphasized the need for human coders to supplement AI for meaningful, efficient messaging processing. To begin, human coding aided in the identification of image codes that corresponded to changes in context factors over time. As Appendix C, Table C1 shows, 18 of 30 variables in our manual codebook bore a significant relationship to context variables (e.g., troop withdrawal announcements, online account suspensions, attack lethality, etc.). Those relationships suggest a productive, efficient training regimen for AI, as the context factors are revealing about the groups’ response interactions over time. The remaining coding categories, while potentially useful for assessing image meaning or relationships with other images, did not significantly change in frequency over time, making them less of an AI priority.
Human coding can also help verify the appropriate AI scope for assessing impact of context variables on strategic image use. The table reveals that the relationships with image strategies vary according to the context factor under consideration. As a result, the outputs of human coding point to context variables that need verification checks prior to premature conclusions that any single context variable is responsible for shifts in the image characteristics. For example, since censorship of militant group online accounts, announcements of anticipated troop withdrawals, and the relative rise in standing of competing ideological groups all relate to significant changes in the use of photo subject distance, users of AI should consider whether each of these context factors overlap during the timespan under evaluation before drawing any conclusions about the potential influence of any single context element.
Human coding can also yield insights regarding appropriate AI disaggregation of context elements in relation to image characteristics over time. For example, our assessment of leader loss based on human coding revealed that the deaths of different levels and types of ISIS leaders resulted in different corresponding changes in the group’s image characteristics. Political and military leader deaths corresponded to different image changes. Leaders at different statuses within the media hierarchy corresponded to different changes in visual strategies. Thus, fine-tuned human coding analysis can aid in the development of a more robust, efficient AI system for analyzing the image-context strategies of groups like al-Qaeda and ISIS.
Conclusion
This analysis demonstrates the advantages of having humans and AI functioning together to understand the visual framing of extremist group messaging. Human coding yielded benefits on the retrieval, analysis, and validation of results related to denotative, semiotic, connotative, and ideological framing. It also aided in understanding how AI-human interactions can maximize text-context relationships. Yet, validation checks between the human and AI coding revealed that the percent agreement levels on coding categories varied considerably, suggesting the need for a more robust AI training process, particularly on subjective variables, like facial expressions and perceived distance, and identifying visual constellations linked to connotative and ideological framing.
Examining the extremist visual context provided an ideal opportunity to test and compare manual and computational coding for MENA-based violent groups, but it does not necessarily apply to other types of violent protest visuals. The generalizability of the findings derived from al-Qaeda and ISIS photographic campaigns should be tested in relation to other forms of political violence (e.g., electoral protests, climate activism, and public vigils). Future studies should determine if the reliability of AI visual framing is transferable to these other settings.
Additionally, variables with inherently nuanced or subjective definitions, such as stance or eye contact, pose significant challenges for consistent annotation. These complexities are reflected in the low recall and F1-scores observed in these categories, as the AI model struggles to align with human coders’ interpretations. The limitations of computational methods in capturing subtle cultural or contextual cues further exacerbate these discrepancies, particularly for categories like state, religion, and impending death that rely heavily on contextual understandings. Future studies should examine alternative training protocols to maximize efficient and reliable extraction processes.
Data Availability: Replication materials for this study are hosted on the Open Science Framework (OSF). Due to the presence of violent and potentially harmful imagery, the image data are archived as a restricted-access component on OSF and may be accessed upon request, subject to review and approval: https://osf.io/r9tjm, project DOI: 10.17605/OSF.IO/R9TJM. All non-sensitive replication materials—including the coding instrument, model prompts, variable definitions, documentation, and analysis scripts—are publicly available on OSF and are also mirrored on GitHub for ease of access and version control: https://github.com/aysedeniz09/Visual-Framing-in-the-AI-Era.