Fostering Linguistic Diversity : Enhancing AI Models for Inclusive Global Accessibility
Maximiliano Juang (Los Angeles Center for Enriched Studies), Nathanaël Moore (Daniel Webster Middle School), Carl Moore, Philo Juang
Abstract
As middle school students in bilingual education, we are keenly aware of the rapid integration of artificial intelligence (AI) into our lives. AI's limited proficiency in languages other than English presents a pressing challenge, impacting communication, information access, and education. This issue not only hampers AI's effectiveness but also perpetuates the dominance of English, deepening linguistic imbalances in global AI utilization. This is especially important for us, the first generation to witness AI's widespread impact, as it directly affects our ability to maximize these technological advancements.
Furthermore, the consequences of English-language dominance in AI usage are substantial. If AI models perform better in English, non-English speakers may increasingly rely on translation tools or default to English, reinforcing the language's prevalence. This cyclical pattern exacerbates linguistic disparities and further marginalizes non-English languages. Though this paper uses Midjourney for purposes of illustration, the same limitations extend to other AI models.
To address these challenges, we propose a novel model wherein governments and non-profit organizations collaborate to curate linguistically diverse and accurate datasets. Contributors, including newspapers, social media platforms, and universities, should be fairly compensated for their linguistic contributions. These collaborative efforts aim to create comprehensive datasets encompassing numerous languages, empowering AI models to understand and generate content proficiently across linguistic boundaries.
Context
Generative AI is a cutting-edge field of artificial intelligence that focuses on creating machines capable of generating content, such as text, images, or even music, that closely resembles human-created output. This technology leverages deep learning techniques to enable computers to produce creative and novel content, making it invaluable in various applications, from natural language generation to creative art and beyond.
Training data serves as the lifeblood of AI, profoundly shaping its capabilities and performance. The quality and diversity of training data directly influence an AI model's accuracy and ability to generalize to new situations. Biases in the data can result in biased predictions, underscoring the importance of careful data curation. Moreover, the quantity of data impacts the scale and complexity of AI models, with more data often leading to more robust and powerful systems. In essence, training data is the foundation upon which AI builds its understanding of the world, making it a pivotal factor in the development and behavior of artificial intelligence.
Current image generation services may exhibit limitations in their ability to generate culturally relevant and contextually accurate images for non-English languages due to a deficiency in the quantity and diversity of non-English training data. These services often rely heavily on English-centric datasets, which could result in a lack of understanding of linguistic nuances, cultural symbols, and regional contexts specific to non-English languages. Consequently, the dearth of non-English training data may hinder the capacity of AI services to cater effectively to a global audience, potentially leading to inaccuracies, misinterpretations, and underrepresentation in the visual content they produce for non-English-speaking users.
Background
The vast majority of training data for language models, including large-scale ones like GPT-3, are heavily biased towards English. This dominance limited the proficiency of these models in understanding and generating content in other languages, leading to disparities in performance across languages.
On Wikipedia, for example, as of August 2023, English has 2.56M articles. German is next with 808k, with French at 709k. Spanish is in 9th place with 402k articles. Estimates claim that of ChatGPT3’s training data, 90% is in English.
Many languages, especially those with fewer speakers or resources, are underrepresented in training data. This has made it challenging to develop AI models that could effectively work with low-resource languages, hindering global accessibility to AI technologies. Language models can inadvertently perpetuate biases present in training data, particularly for languages and cultures that were underrepresented. We must make an effort to mitigate these biases and improve the fairness of AI systems across languages.
Limitations in Generative AI
We show as our example language-constrained limitations in image generation via Generative AI, such as Dall-E and Midjourney.
Generative AI employs advanced neural network architectures like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) to generate images. These models learn intricate patterns, textures, and structures from large datasets of existing images. Using this learned knowledge, they can produce new and often highly realistic images by either sampling from a latent space (in VAEs) or engaging in an adversarial process where a generator network creates images to deceive a discriminator network (in GANs). Generative AI finds applications in diverse fields, from generating creative artworks and data augmentation for machine learning to simulating visual scenarios and enhancing data visualization.
Generative AI, including cutting-edge technologies like stable diffusion models, is instrumental in producing high-quality images. Stable diffusion models are a type of generative model that progressively refines images by iteratively adding noise to them. Through a sequence of transformations, these models can generate images with exceptional realism and diversity, capturing intricate details and textures. This approach is invaluable for tasks such as creating lifelike visual content, enhancing image generation quality, and even simulating realistic scenarios in applications like computer graphics and simulation.
Examples include DALL-E and Midjourney. DALL-E is an AI model created by OpenAI that combines elements of language and image generation. It is a sibling model to GPT-3 and is designed to generate images from textual descriptions. Unlike traditional image generation models that rely on datasets of existing images, DALL-E can create entirely new images based on textual prompts. For example, if given a prompt like "an armchair in the shape of an avocado," DALL-E can generate a unique and creative image that matches the description. DALL-E demonstrated the potential of AI to generate novel and imaginative visual content based on human-written text.
Midjourney is an AI-driven image generator in active development, gaining popularity for its capacity to create high-quality images from text descriptions. It employs various AI techniques, including diffusion models for gradual image generation, text-to-image transformers to understand user-provided descriptions, attention mechanisms for focusing on key details, contrastive learning to distinguish between similar and dissimilar images, and GANs (Generative Adversarial Networks) for enhancing image quality through adversarial competition.
Hypothesis
Current image generation services may face challenges in producing diverse and contextually accurate images for non-English languages due to a shortage of relevant training data. Image generation models heavily rely on expansive and varied datasets to comprehend visual patterns and generate lifelike images. However, a significant portion of existing datasets predominantly features English-centric content, posing obstacles such as limited linguistic understanding, difficulties in interpreting non-English textual descriptions, challenges in incorporating cultural nuances, and potential bias and underrepresentation in generated images.
To assess this hypothesis, experiments comparing the quality of images generated for non-English prompts using models trained on English-centric data versus those enriched with diverse non-English training data could reveal insights into the impact of training data diversity on the image generation capabilities of AI systems across languages.
Example
As an example, we show a passage from the Hunchback of Notre Dame, which was originally written in French but since has had many high quality translations in other languages.
We used Midjourney, as DALL-E produced only a small part of the scene. Using the same random seed (1831), we used four prompts : Original French, human translated French-to-English, human translated French-to-Spanish, Google-translated French-to-English.
See the appendix for the full text of the prompts. The images produced are below :
English (human translated)
Source : Project Gutenberg
Above our heads is a double ogive vault, panelled with wood carving, painted azure, and sown with golden fleurs-de-lis; beneath our feet a pavement of black and white marble, alternating…
French (original)
Source : Project Gutenberg
Au-dessus de nos têtes une double voûte en ogive, lambrissée en sculptures de bois, peinte d'azur, fleurdelysée en or; sous nos pieds, un pavé alternatif de marbre blanc et noir…
Spanish (human translated)
Source : WikiSource
Encima de nuestras cabezas, una doble bóveda ojiva, artesonada con esculuras de maderas, pintada de azul celeste, flordelisada de oro; debajo de nuestros pies un pavimento alternativo de mármol blanco y negro…
English (machine translated)
Source : Google Translate
Above our heads a double pointed vault, paneled with wooden sculptures, painted azure, fleur-de-lysée in gold; under our feet, an alternating pavement of white and black marble…
The passage describes the interior of a cathedral, and quite clearly the English versions are more accurate – despite being translated from the original French. French appears to be confused by the phrase “nos têtes une double” – nos têtes en double would mean “doubled/duplicated heads.” Spanish seems to think the principal colors are blue and gold (azul celeste, en oro).
Experimental Setup
To evaluate our hypothesis, we generated ~10 images. The prompts were conceived by the two principal authors, in English, and sought topical diversity across culture, education, sports, and entertainment.
The prompts tested were :
Inside a stadium full of people there is a team holding a flag of their nation.
A mech walking through a big destroyed city
Kids in science class with the teacher at the front
A kid at the World Series hitting a home run
Houses on fire and someone standing right in front of the fire
Taking a break during math class in middle school after 30 minutes of work
Me shooting at goal and hitting the crossbar into the goal
Katniss sitting in a tree with an arrow locked into her bow
A Jedi knight fighting two clone soldiers
A tree blooming with purple flowers on its leaves
The prompts were then translated to French and Spanish via Google Translate, to account for regional and individual differences. Prompts were checked by native speakers and re-run through Midjourney. The same random seed was used for all images, to ensure fairness and repeatability, and to avoid results caused by differing random seeds.
Note that both native speakers expressed concern about the quality of the translation, acknowledging the Google translations were technically understandable but not common methods of expression.
Inside a stadium full of people there is a team holding a flag of their nation. Seed 12345678.
Inside a stadium full of people there is a team holding a flag of their nation.
Dans un stade plein de monde, une équipe tient le drapeau de son pays
Dentro de un estadio lleno de gente hay un equipo sosteniendo una bandera de su nación
The results were generally good. In English, an American (or American style) flag was used, whereas in French the model produced French and French-looking flags. Spanish rendered flags vaguely corresponding to Spanish speaking countries (blue and white, yellow/blue).
This may indicate that English and French are strongly anchored on the United States and France, despite many countries using English and French as primary languages, whereas Spanish is less fixed to one country.
Equal results in all languages.
A mech walking through a big destroyed city. Seed 20110422.
A mech walking through a big destroyed city
Un robot traverse une grande ville détruite
Un robot caminando por una gran ciudad destruida.
“Mech” in this case was treated as “robot”. “Mech” in English has a strong connotation to movie–style combat robots (such as Transformers) whereas “robot” in French is linked more strongly to helpful or “toy” robots from the 20th century. Spanish produced results more closely to English, and closer to what was expected.
French also bypasses “ville détruite” several times, as there is likely stronger links between “mechs” and “destroyed cities” in English training data.
English and Spanish are the most accurate in this case.
Kids in science class with the teacher at the front. Seed 20110422.
Kids in science class with the teacher at the front
Enfants en classe de sciences avec le professeur devant
Niños en clase de ciencias con el profesor al frente
The results here were generally accurate, with English and Spanish versions focusing more on chemistry whereas the French versions harken toward electricity/lighting. This distinction is strengthened by the clothing, the English and Spanish versions are much more modern than the French versions, which appear more from the 19th century. Interestingly, the top two Spanish versions produced people that are visibly Hispanic.
A kid at the World Series hitting a home run. Seed 20110422.
A kid at the World Series hitting a home run
Un enfant aux World Series qui frappe un home run
Un niño en la Serie Mundial pegando un jonrón
Though both versions render a small child (“kid”), the English version captures a celebratory moment, in a large stadium, in all cases. The French version renders varied situations, however, including in a room or what looks to be a construction yard — this is likely because “World Series” is a very American thing, and may not have a lot of representation in French training data.
Spanish, however, renders a soccer game. Despite the presence of the word “jonrón,” it appears that “Mundial” is so strongly anchored to the World Cup that the model produced an image about soccer, though usually in a destroyed landscape for some reason.
English is the most accurate in this case.
Houses on fire and someone standing right in front of the fire. Seed 20110422.
Houses on fire and someone standing right in front of the fire
Des maisons en feu et quelqu'un debout juste devant le feu
Casas en llamas y alguien parado justo frente al fuego
Results were generally representative, though interestingly the English versions depict an American suburban/exurban house, the French versions anchored on older European architectural styles, and the Spanish versions more small town pueblo style. The French versions also typically placed multiple people in front, despite the prompt asking for the singular (someone/quelqu’un). Spanish skipped the person (alguien) in most cases.
English is the most accurate in this case.
Taking a break during math class in middle school after 30 minutes of work. Seed 20110730.
Taking a break during math class in middle school after 30 minutes of work
Faire une pause pendant un cours de mathématiques au collège après 30 minutes de travail
Tomar un descanso durante la clase de matemáticas en la escuela secundaria después de 30 minutos de trabajo
While all versions depict a single student, the English and Spanish versions render a typical classroom (“math class”) with students of different ages. However the Spanish version shows a student studying rather than taking a break.
The French versions differed the most, producing an older student studying (not taking a break) with curiously strong focus on the clock (“30 minutes de travail”). Likely the French version anchored on time (30 minutes) and work (“travail”), overriding the earlier part of the prompt regarding a break (“une pause”).
English is the most accurate in this case.
Me shooting at goal and hitting the crossbar into the goal. Seed 20110730.
Me shooting at goal and hitting the crossbar into the goal
Je tire au but et je frappe la barre transversale dans le but
Yo tiro a portería y pego al travesaño en la portería
Initially the results are confusing. In English, the scene is soccer as expected, though “hitting the crossbar” is not clearly depicted.
In French, however, the scene is not at all related to soccer – instead, the images produced referred to cafes or drinks; what we believe occurred is instead of the French verb “to hit” – frapper, or frappe as conjugated here – the “Frappe” drink first popularized by Starbucks dominated the training data, and thus overrode the verb.
Spanish produced equally peculiar images, rendering what appears to be something like the Wild West. It is unclear what the model fixated on as the theme for the image.
English is the most accurate in this case.
Katniss sitting in a tree with an arrow locked into her bow. Seed 20110730.
Katniss sitting in a tree with an arrow locked into her bow
Katniss assise dans un arbre avec une flèche verrouillée dans son arc
Katniss sentada en un árbol con una flecha clavada en su arco
Results were generally good in this case. Katniss (from “The Hunger Games”) is unique and the underlying training images are likely equally good in French, Spanish, and English.
A Jedi knight fighting two clone soldiers. Seed 20110730.
A Jedi knight fighting two clone soldiers
Un chevalier Jedi combattant deux soldats clones
Un caballero Jedi luchando contra dos soldados clon
Both English and French produce clone troopers, but only English adds a Jedi-like figure (identifiable by the lightsaber). Spanish did not produce any recognizable clone troopers, and moreover only produced one figure rather than the two requested. Clone soldiers are likely strongly identified with Star Wars. While not totally accurate, the French and Spanish versions are incomplete.
English is the most accurate in this case.
A tree blooming with purple flowers on its leaves. Seed 20110730.
A tree blooming with purple flowers on its leaves
Un arbre fleuri avec des fleurs violettes sur ses feuilles
Un árbol que florece con flores de color púrpura en sus hojas
Results in all languages were generally good. This is a basic query.
Analysis
Already in a few short training examples, English was always equal or more accurate to other languages. This is likely due to two reasons : abundance of training data, and development/evaluation models themselves.
English prompts are often considered superior in Generative AI due to the abundance of training data available in English. The vast volume and diversity of English-language datasets results in higher performance and more accurate responses compared to languages with less extensive training data. This advantage positions English as a dominant language for training AI models.
The majority of development and evaluation of AI models also typically occurs in English first, leading to a first-mover advantage as well as more time spent overall in English language AI features.
The result of this is that using Google translate to take a foreign language prompt and translate to English often produces superior results. This means that foreign language speakers will either rely on English or use Google Translate, which translates into Gen AI applications using English prompts getting better and better – and foreign languages being left farther and farther behind.
Recommendation
To address language inequality in AI applications, it is crucial for foreign governments to actively contribute high-quality training data in their languages, fostering inclusivity in AI development. By making such datasets accessible to AI companies, governments can play a pivotal role in ensuring accuracy across diverse languages, promoting fairness in global AI deployment. Public, governmental communications like speeches, press releases, and legislative documents provide rich language examples for training models, enhancing context-aware language processing in AI applications.
Universities and news organizations can further contribute to comprehensive AI model training by sharing academic papers, educational materials, and real-world news data. This collaborative effort between academia and media outlets enriches training datasets with diverse linguistic styles, ensuring AI models handle a broad range of topics and language nuances, contributing to a more equitable global impact.
To facilitate this collaboration, we recommend that a non-governmental organization (NGO) establishes transparent processes for receiving, cleaning, and storing training data from various sources. This NGO should create a secure and privacy-compliant data infrastructure, charging companies for access to this valuable data through a subscription or licensing model to ensure sustainable funding. Clear agreements and ethical guidelines will maintain integrity, trust, and provide incentives for data contributors, fostering responsible sharing and innovation in AI development.
Appendix
Hunchback of Notre Dame. Random seed : 1831.
English (human translated)
Source : Project Gutenberg
Above our heads is a double ogive vault, panelled with wood carving, painted azure, and sown with golden fleurs-de-lis; beneath our feet a pavement of black and white marble, alternating. A few paces distant, an enormous pillar, then another, then another; seven pillars in all, down the length of the hall, sustaining the spring of the arches of the double vault, in the centre of its width. Around four of the pillars, stalls of merchants, all sparkling with glass and tinsel; around the last three, benches of oak, worn and polished by the trunk hose of the litigants, and the robes of the attorneys. Around the hall, along the lofty wall, between the doors, between the windows, between the pillars, the interminable row of all the kings of France, from Pharamond down: the lazy kings, with pendent arms and downcast eyes; the valiant and combative kings, with heads and arms raised boldly heavenward. Then in the long, pointed windows, glass of a thousand hues; at the wide entrances to the hall, rich doors, finely sculptured; and all, the vaults, pillars, walls, jambs, panelling, doors, statues, covered from top to bottom with a splendid blue and gold illumination, which, a trifle tarnished at the epoch when we behold it, had almost entirely disappeared beneath dust and spiders in the year of grace, 1549, when du Breul still admired it from tradition.
French (original)
Source : Project Gutenberg
Au-dessus de nos têtes une double voûte en ogive, lambrissée en sculptures de bois, peinte d'azur, fleurdelysée en or; sous nos pieds, un pavé alternatif de marbre blanc et noir. À quelques pas de nous, un énorme pilier, puis un autre, puis un autre; en tout sept piliers dans la longueur de la salle, soutenant au milieu de sa largeur les retombées de la double voûte. Autour des quatre premiers piliers, des boutiques de marchands, tout étincelantes de verre et de clinquants; autour des trois derniers, des bancs de bois de chêne, usés et polis par le haut-de-chausses des plaideurs et la robe des procureurs. À l'entour de la salle, le long de la haute muraille, entre les portes, entre les croisées, entre les piliers, l'interminable rangée des statues de tous les rois de France depuis Pharamond; les rois fainéants, les bras pendants et les yeux baissés; les rois vaillants et bataillards, la tête et les mains hardiment levées au ciel. Puis, aux longues fenêtres ogives, des vitraux de mille couleurs; aux larges issues de la salle, de riches portes finement sculptées; et le tout, voûtes, piliers, murailles, chambranles, lambris, portes, statues, recouvert du haut en bas d'une splendide enluminure bleu et or, qui, déjà un peu ternie à l'époque où nous la voyons, avait presque entièrement disparu sous la poussière et les toiles d'araignée en l'an de grâce 1549, où Du Breul l'admirait encore par tradition.
Spanish (human translated)
Source : WikiSource
Encima de nuestras cabezas, una doble bóveda ojiva, artesonada con esculuras de maderas, pintada de azul celeste, flordelisada de oro; debajo de nuestros pies un pavimento alternativo de mármol blanco y negro. A pocos pasos de nosotros un enorme pilar, luego otro, y luego otro; total, siete pilares en la longitud de la sala, sosteniendo en su mayor latitud las recaidas de la doble bóveda.
Alrededor de los cuatro primeros pilares, puestos ambulantes, lucientes con sus vidrios y oropeles; alrededor de los cuatro últimos bancos de madera de encina, desgastados y pulimentados por las calzas de los litigantes y las togas de los procuradores. En torno de la sala, á lo largo de la alta pared, entre las puertas, entre las ventanas, entre los pilares, la interminable hilera de las estátuas de todos los reyes de Francia, desde Faramundo, los reyes holgazanes con los razos colgando y la vista baja; los reyes valientes y batalladores, la cabeza y las manos levantadas al cielo con osadia. Y en las largas ventanas ojivas, vidrios pintados de mil colores, en las anchas salidas de la sala, ricas puertas delicadamente esculpídas; y en el conjunto bóvedas, pilares, paredes, jambas, dinteles, artesones, puertas, estátuas, y todo ricamente iluminado de arriba abajo de oro y azul, colores que ya, algun tanto ajados en la época en que los vemos, habian desaparecido casi del todo bajo el polvo y las telarañas en el año de gracia 1549, en que Du Breul las admiraba por tradicion.
English (machine translated)
Source : Google Translate
Above our heads a double pointed vault, paneled with wooden sculptures, painted azure, fleur-de-lysée in gold; under our feet, an alternating pavement of white and black marble. A few steps from us, a huge pillar, then another, then another; in all seven pillars in the length of the hall, supporting in the middle of its width the overhangs of the double vault. Around the first four pillars, merchants' shops, all sparkling with glass and tinsel; around the last three, benches of oak wood, worn and polished by the breeches of the litigants and the robes of the attorneys. Around the room, along the high wall, between the doors, between the windows, between the pillars, the endless row of statues of all the kings of France since Pharamond; the lazy kings, their arms hanging down and their eyes lowered; the valiant and battalion kings, their heads and hands boldly raised to heaven. Then, in the long pointed windows, stained-glass windows of a thousand colors; at the broad exits of the room, rich doors finely carved; and the whole, vaults, pillars, walls, jambs, paneling, doors, statues, covered from top to bottom with a splendid blue and gold illumination, which, already a little tarnished at the time when we see it, had almost entirely disappeared under dust and cobwebs in the year of grace 1549, where Du Breul still admired it by tradition.
Special thanks to our native translators, Élodie Moore (French) and Madeli Castruita (Spanish).
Carl Moore employer : DirecTV. The opinions in this paper do not reflect those of DirecTV.
Philo Juang employer : Google. The opinions in this paper do not reflect those of Googel.



















