• AI for Impact
  • Posts
  • AI Tools Beyond English Language, PCDN AI for Impact Newsletter, June 18, 2025

AI Tools Beyond English Language, PCDN AI for Impact Newsletter, June 18, 2025

AI You’ll Actually Understand

Cut through the noise. The AI Report makes AI clear, practical, and useful—without needing a technical background.

Join 400,000+ professionals mastering AI in minutes a day.

Stay informed. Stay ahead.

No fluff—just results.

The Rise of Multilingual AI: Bridging the Digital Language Divide

While nearly half of all public web pages are in English, fewer than one in five people globally can comfortably read the language. This disparity has profoundly shaped mainstream Artificial Intelligence, which consequently learns far more about Silicon Valley slang than Yoruba proverbs or Quechua farming traditions. However, a new wave of community-built projects is demonstrating that multilingual AI can be lighter, cheaper, and more culturally attuned than offerings from major tech companies. These initiatives present enormous potential for improving everything from crop advice to financial services, but they also face significant challenges like chronic under-funding and data extractivism.

The implications of this linguistic imbalance extend far beyond simple translation issues. When AI systems are predominantly trained on English content, they absorb not just the language but the cultural assumptions, economic models, and social structures embedded within that content. This creates a form of digital colonialism where Western perspectives become the default lens through which AI interprets and responds to the world.

The "English-Only" Blind Spot

The dominance of English online creates a significant blind spot in AI development. With about half of the internet's content in English, a language spoken by only 17% of the global population, the Large Language Models (LLMs) that learn from this data develop a baked-in bias toward Anglophone concepts and norms. Researchers at Stanford have noted that AI performance drops sharply when applied to "low-resource" languages because the models have insufficient data to learn from. This has a steep cultural cost, as UNESCO warns that approximately 40% of the world's 7,000 languages are endangered, with one disappearing every two weeks.

The economic consequences are just as severe. An English-trained chatbot might give irrelevant advice to a cassava farmer in Benin by defaulting to data for Iowa's climate, or a credit algorithm could unfairly score an entrepreneur who writes a business plan in Nigerian Pidgin. These are not minor issues but fundamental design flaws stemming from a single-language worldview.

Consider the broader implications: when AI systems fail to understand local languages and contexts, they can perpetuate existing inequalities. A farmer in rural Guatemala might receive agricultural advice optimized for North American growing conditions, potentially leading to crop failures. A small business owner in Mumbai might be denied a loan because an English-centric algorithm can't properly assess their business plan written in Hindi mixed with local terminology.

The problem extends to cultural nuances as well. Many languages contain concepts that don't translate directly to English – like the Japanese concept of "ikigai" (life's purpose) or the Portuguese "saudade" (a deep emotional longing). When AI systems can't grasp these cultural subtleties, they lose the ability to provide truly relevant and meaningful assistance to speakers of these languages.

Global Innovations in Multilingual AI

Around the world, local communities and developers are building AI models that address these gaps with impressive results. These grassroots efforts are proving that effective AI doesn't always require massive datasets or billion-dollar budgets – sometimes it requires deep cultural understanding and community engagement.

Africa's Low-Power, High-Impact Models

In South Africa, Lelapa AI developed InkubaLM, a compact model that understands Swahili, Yoruba, Hausa, isiXhosa, and isiZulu while being efficient enough to run on a Raspberry Pi. This efficiency is crucial for deployment in areas with limited internet connectivity or computing resources. The model's small size means it can run locally on mobile devices, providing AI assistance even in remote areas without reliable internet access.

Nigeria is developing its first state-backed LLM through the Lagos startup Awarri, which created the LangEasy app. The app pays citizens to record sentences in languages like Hausa, Igbo, and Yoruba to create a model that doesn't treat "Nigerian English like an accent to be fixed." This approach recognizes that Nigerian English isn't broken English – it's a legitimate variety with its own grammar, vocabulary, and cultural significance.

The grassroots network Masakhane coordinates volunteer researchers to publish free datasets and translation models, working to prevent any single company from controlling African Natural Language Processing (NLP). The name "Masakhane" means "we build together" in isiZulu, reflecting the collaborative spirit driving this movement. The network has grown to include researchers from over 20 African countries, creating a truly pan-African approach to AI development.

Concrete gains are already visible across the continent: In Rwanda, health chatbots trained on Kinyarwanda cut malaria triage times by 30%, enabling faster treatment and potentially saving lives. Kenyan microlenders saw repayment rates rise after using a Swahili sentiment model to score local-language business plans, recognizing that entrepreneurs who could articulate their vision in their native language often had stronger community ties and local market understanding.

In Ghana, farmers using AI-powered agricultural advice in Twi have reported 25% higher crop yields compared to those relying on English-language resources. The system understands local farming terminology and can provide advice tailored to regional growing conditions and traditional practices.

Asia's Government-Scale Language Initiatives

India's Project Bhashini represents one of the most ambitious multilingual AI initiatives globally, offering public APIs for speech recognition and translation across all 22 of the nation's official languages. This allows courts to instantly translate judgments, civil service exams to be offered in multiple languages, and government services to be accessible to citizens regardless of their linguistic background.

The project's scope is staggering – India is home to over 1,600 languages, with 22 official languages and hundreds of dialects. Bhashini aims to break down language barriers that have historically prevented millions of Indians from accessing government services, educational resources, and economic opportunities.

The Jugalbandi chatbot, built on WhatsApp, allows villagers to ask about government services in their own dialects, using an NLU engine from the open-source lab AI4Bharat and reasoning layers from Microsoft Azure. The chatbot has processed over 2 million queries in its first year, with users praising its ability to understand regional variations and provide relevant, actionable information.

Philanthropy is also playing a key role, with Infosys co-founder Nandan Nilekani pledging significant funding to AI4Bharat to help bridge India's language divide. This funding supports research into low-resource Indian languages and the development of tools that can preserve and digitize linguistic heritage.

In China, despite the dominance of Mandarin, efforts are underway to preserve and digitize minority languages like Tibetan, Uyghur, and Mongolian. These projects face unique challenges, as they must navigate both technical hurdles and sensitive political considerations around cultural preservation and autonomy.

Latin America and Indigenous Data Sovereignty

In the Amazon, the Tainá chatbot works offline via Telegram, answering questions in Portuguese, Spanish, and local languages like Tikuna. Crucially, community elders decide which cultural stories and medicinal knowledge are uploaded, preventing the data scraping that has marked past research efforts. This approach recognizes that Indigenous knowledge isn't just data to be harvested – it's sacred information that belongs to specific communities.

The Tainá project represents a new model of AI development where communities maintain sovereignty over their cultural and linguistic data. Rather than extracting information for external use, the project ensures that Indigenous communities benefit directly from AI systems trained on their knowledge.

In New Zealand, Te Hiku Media used decades of Māori radio archives to train a highly accurate speech recognizer while ensuring the community retains full ownership of its data. The group has turned down lucrative corporate offers, establishing a model for other Indigenous tech cooperatives focused on data sovereignty. Their work has inspired similar projects across the Pacific, with Aboriginal communities in Australia and Native American tribes in North America developing their own language preservation AI initiatives.

Mexico's National Institute of Indigenous Languages has partnered with local universities to create AI tools for 68 Indigenous languages, including Nahuatl, Maya, and Zapotec. These tools help preserve oral traditions, translate government documents, and provide educational resources in native languages.

Europe's Multilingual Renaissance

Europe, despite its linguistic diversity, has also grappled with English dominance in AI. The European Union's Digital Single Market strategy includes significant investment in multilingual AI capabilities. Projects like the European Language Equality initiative aim to ensure that all EU languages have equal access to AI technologies by 2030.

In Ireland, efforts to revitalize Irish Gaelic include AI-powered language learning tools and automated translation systems. These tools help connect younger generations with their linguistic heritage while making Irish-language content more accessible.

Catalonia has developed AI systems that can distinguish between Catalan and Spanish, addressing the unique challenges faced by minority languages that exist alongside dominant national languages. This work has implications for similar linguistic situations worldwide, from Welsh in Wales to Basque in Spain and France.

The Future of Multilingual AI: Opportunities and Risks

The growth of inclusive AI presents both a promising future and potential dangers. As these technologies mature, they could either democratize access to information and services or create new forms of digital inequality.

What Could Go Right

Cultural Lifelines: Digitizing endangered languages can create a vital survival layer for cultures at risk. AI systems can help preserve oral traditions, create educational materials, and connect diaspora communities with their linguistic heritage. For languages with few remaining speakers, AI could serve as a bridge between generations, helping young people learn languages their grandparents spoke fluently.

Economic Lift-off: Local-language fintech tools can expand credit access to informal traders who were previously mislabeled as "high risk" by English-only models. When AI systems can understand local business practices, cultural contexts, and community relationships, they can make more accurate assessments of creditworthiness and economic potential.

Educational Revolution: AI tutors that speak students' native languages can provide personalized education at scale. In regions where qualified teachers are scarce, multilingual AI could help bridge educational gaps while preserving cultural and linguistic diversity.

Public Health Reach: Medical screening pilots have shown that allowing patients to describe symptoms in their own words leads to fewer misdiagnoses. AI systems that understand cultural concepts of health and illness can provide more effective healthcare support, particularly in underserved communities.

Democratic Participation: When government services and civic information are available in citizens' native languages, democratic participation increases. AI-powered translation and communication tools can help ensure that language barriers don't prevent people from engaging with their governments.

What Could Go Wrong

Funding Gap: Creating datasets for languages like Yoruba or Quechua is expensive and often relies on sporadic grants and volunteer work. Unlike English, which benefits from massive commercial investment, minority languages struggle to attract sustained funding for AI development. This creates a vicious cycle where lack of resources leads to poor AI performance, which in turn discourages further investment.

Persistent Bias: Even multilingual models can perpetuate harmful stereotypes, with one Stanford study finding that models associated negative traits with "African-sounding" names. Cultural biases embedded in training data can be amplified by AI systems, potentially reinforcing discrimination and prejudice.

Quality Degradation: Rushed efforts to create multilingual AI without proper community involvement can result in systems that are technically functional but culturally tone-deaf. Poor-quality AI tools might do more harm than good by providing inaccurate information or misrepresenting cultural concepts.

Hidden Trauma: The work of content moderation in many languages has led to psychological distress for workers, as seen in a lawsuit filed by moderators in Nairobi. Creating safe AI systems requires human oversight, but this work often falls disproportionately on workers from the Global South who may lack adequate mental health support.

Data Extractivism: Without strong legal safeguards, communities risk losing control over their linguistic data in a new form of "digital colonialism." Tech companies might harvest linguistic data from communities without providing fair compensation or ensuring that the resulting AI systems benefit those communities.

Homogenization Risk: Poorly designed multilingual AI might inadvertently contribute to language standardization, potentially erasing regional dialects and linguistic variations that are crucial to cultural identity.

Technical Challenges and Innovations

Building effective multilingual AI requires overcoming significant technical hurdles. Traditional machine learning approaches assume large, clean datasets – something that simply doesn't exist for most of the world's languages. Innovative approaches are emerging to address these challenges:

Transfer Learning: Researchers are developing techniques to transfer knowledge from high-resource languages to low-resource ones. This allows AI systems to leverage patterns learned from English or Mandarin to better understand languages with limited training data.

Multilingual Embeddings: New mathematical representations can capture similarities between languages, allowing AI systems to understand that concepts expressed in different languages might be related. This helps systems make intelligent guesses about unfamiliar languages based on their knowledge of similar ones.

Community-Driven Data Collection: Rather than relying on web scraping, many projects now work directly with communities to collect high-quality linguistic data. This approach produces better datasets while ensuring that communities maintain control over their linguistic heritage.

Federated Learning: This technique allows AI models to be trained across multiple devices and locations without centralizing sensitive data. For Indigenous communities concerned about data sovereignty, federated learning offers a way to contribute to AI development while maintaining control over their information.

Pathways Forward

To build a more equitable AI future, several key strategies have emerged from these community-led projects:

Community Ownership First: Models like Te Hiku demonstrate that data accuracy and sovereignty can coexist when licenses restrict commercial reuse. Communities should have the right to determine how their linguistic data is used and who benefits from AI systems trained on that data.

Open-Source by Design: Projects such as Masakhane and Bhashini allow anyone to audit and improve their code and data. Transparency is crucial for building trust and ensuring that AI systems serve community needs rather than corporate interests.

Targeted Philanthropy: Mission-driven funding, like Nilekani's support for AI4Bharat, can underwrite vital but unprofitable data collection efforts. Philanthropic organizations and impact investors have a crucial role to play in supporting multilingual AI development.

Bias Audits: Regular checks on multilingual models are essential to prevent stereotype leakage, especially in critical sectors like healthcare and finance. These audits should involve community members who can identify cultural biases that technical experts might miss.

Legal Frameworks: New intellectual property and data protection laws may be needed to protect linguistic communities from exploitation. Just as genetic resources are increasingly protected by international law, linguistic resources may require similar safeguards.

Capacity Building: Training local developers and researchers is essential for sustainable multilingual AI development. Rather than creating dependency on external expertise, successful projects invest in building local technical capacity.

Cross-Cultural Collaboration: The most successful multilingual AI projects involve genuine partnerships between technologists and communities. This requires patience, cultural sensitivity, and a willingness to prioritize community needs over technical convenience.

The future of AI doesn't have to be monolingual. With thoughtful investment, community engagement, and technical innovation, we can build AI systems that celebrate linguistic diversity rather than erasing it. The question isn't whether we can create truly multilingual AI – it's whether we have the will to do so equitably and sustainably.

Grow your Career and the Newsletter

We are growing rapidly and if you like the newsletter, help us scale more. If you use our super easy referral tool, earn amazing benefits when you help other subscribe. It only takes a few seconds to help us and grow your career.

What are your thoughts on AI for Good? Hit reply to share your thoughts or favorite resources or fill the super quick survey below.

Got tips, news, or feedback? Drop us a line at [email protected] or simply respond to this message or take 15 seconds to fill out the survey below. Your input makes this newsletter better every week.

Share your feedback on the AI for Impact Newsletter

Login or Subscribe to participate in polls.

AI for Impact Opportunities

Start learning AI in 2025

Keeping up with AI is hard – we get it!

That’s why over 1M professionals read Superhuman AI to stay ahead.

  • Get daily AI news, tools, and tutorials

  • Learn new AI skills you can use at work in 3 mins a day

  • Become 10X more productive

Make your Inbox and Career Awesome for less than the cost of a cup of coffee a month.

Consider Subscribing to the PCDN Career Digest.

200 + Awesome Job, Funding, Fellowship, Socent, Upskilling Opps and more a month

Learn more via the video below or here https://pcdnglobal.beehiiv.com/c/career-campus

Sponsored
Remote SourceThe leading source of content for 50,000+ remote workers. Open jobs, relevant news, must-have products, impactful trends, professional career advice, travel ideas, and more. Work Remote, Live Free!

News & Resources

😄 Joke of the Day
Why did the AI enroll in art school? Because it wanted to draw its own conclusions!

🌍 News

  • 🇺🇸 California police using AI cameras to monitor immigration protests
    Authorities are deploying surveillance systems powered by machine learning to track and analyze protest activity, raising concerns around civil liberties and bias.
    🔗 Read more on 404 Media

  • GenAI’s role in democratic elections
    A global study of 50 competitive elections in 2024 found 80% involved GenAI incidents—mostly deepfakes on audio/video and social media. Nearly half disappeared without attribution; 25% were by candidates and 20% by foreign actors.
    🔗 Read the IPIE report

  • Will AI create more jobs?
    The NYT Magazine argues that while AI will automate many roles, it also catalyzes new jobs—like AI trainers, ethics specialists, and prompt engineers—though the transition may be uneven.
    🔗 Read more on NYT

  • Samsung’s semiconductor division in turmoil
    Rest of World exposes internal dysfunction at Samsung's chip wing, with mismanagement and bureaucracy slowing its AI chip innovation just as global rivals race ahead.
    🔗 Read more on Rest of World

🎓 Career Resource
Probably Good Jobs — A curated global job board for impact-driven roles, including AI safety and governance opportunities. Offers job listings, career guides, and free coaching for high-impact careers.
🔗 Visit Probably Good Jobs

💼 Organization to Watch
IPIE (International Panel on the Information Environment) — A global research body analyzing AI’s impact on democratic processes. Their new report examines AI-generated content in 50 elections worldwide.
🔗 Explore IPIE

🔗 LinkedIn Connection
Daniel Susskind — Economist and author specializing in AI and the future of work. His writing and talks provide essential insights into how automation is transforming labor.
🔗 Connect with Daniel on LinkedIn