Mohamed Aymane Farhi's picture

Open to Work

Mohamed Aymane Farhi

ayymen

·

AI & ML interests

NLP

Recent Activity

updated a collection 5 days ago

Bitext Datasets

upvoted a collection 8 days ago

Simba Speech Series

liked a model 8 days ago

UBC-NLP/Simba-S

View all activity

Organizations

upvoted a collection 8 days ago

Simba Speech Series

Simba bridges the digital divide with a unified suite for African AI: the largest open-source speech benchmark and models covering 61 languages • 13 items • Updated Feb 12 • 1

upvoted a paper 10 days ago

Omnilingual MT: Machine Translation for 1,600 Languages

Paper • 2603.16309 • Published 17 days ago • 20

upvoted a collection 19 days ago

Paza

Paza is a collection of speech models & benchmarks for low resource languages by the Microsoft Research Africa - Nairobi Lab • 3 items • Updated Mar 2 • 3

upvoted a paper about 2 months ago

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Paper • 2511.01066 • Published Nov 2, 2025 • 1

upvoted a collection 3 months ago

Tamazight

https://huggingface.co/Tamazight-NLP • 2 items • Updated Dec 26, 2025 • 1

upvoted 2 papers 4 months ago

Faithful Persona-based Conversational Dataset Generation with Large Language Models

Paper • 2312.10007 • Published Dec 15, 2023 • 11

Awal -- Community-Powered Language Technology for Tamazight

Paper • 2510.27407 • Published Oct 31, 2025 • 1

upvoted a paper 5 months ago

M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks

Paper • 2407.03791 • Published Jul 4, 2024 • 2

upvoted 2 papers 6 months ago

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Paper • 2506.00469 • Published May 31, 2025 • 4

Less is More: Recursive Reasoning with Tiny Networks

Paper • 2510.04871 • Published Oct 6, 2025 • 513

upvoted a collection 6 months ago

OLDI and friends

This collection groups the datasets that have been featured as part of WMT’s Open Language Data Initiative shared task. • 5 items • Updated 9 days ago • 5

upvoted an article 6 months ago

Article

There is no such thing as a tokenizer-free lunch

Sep 25, 2025

•

95

upvoted a paper 7 months ago

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Paper • 2403.10691 • Published Mar 15, 2024 • 1

upvoted an article 8 months ago

Article

Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research

Jul 19, 2025

•

7

upvoted a paper 8 months ago

Synthetic Voice Data for Automatic Speech Recognition in African Languages

Paper • 2507.17578 • Published Jul 23, 2025 • 2

upvoted a collection 9 months ago

T5Gemma

32 items • Updated 22 days ago • 81

upvoted 2 papers 9 months ago

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

Paper • 2505.20564 • Published May 26, 2025 • 1

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26, 2025 • 78

upvoted a collection 11 months ago

MT Quality Estimation

Models for reference-free quality estimation of machine translation • 10 items • Updated Jan 29, 2025 • 4

upvoted a paper 11 months ago

Domain-Specific Translation with Open-Source Large Language Models: Resource-Oriented Analysis

Paper • 2412.05862 • Published Dec 8, 2024 • 1