

Denamo Markos
End-to-End Speech-to-Text Data Collection with Kafka, Airflow, and Spark
Introduction
In weeks 4 & 5 we built an AI model for the Amharic language. We have experienced a lack of quality data. Had we had a diverse and large training set, our model could have improved and our model could have transformed the lives of many.
Info: Recognizing the value of large data sets for speech-to-text data modeling, and seeing the opportunity in existing Amharic text corpus, we set out to design and build a robust, large-scale, fault-tolerant, highly available Kafka cluster.
Objective
The objective of this project is to provide a tool that can be used to:
- Publish and receive text and audio files from/to a data lake
- Perform distributed transformations
- Load data into a warehouse in a format suitable for training speech-to-text models
Technologies and Tools
Core Technologies
Technology | Purpose | Key Features |
---|---|---|
Kafka | Message Queue | Fault-tolerant, high-throughput, scalable |
Airflow | Workflow Management | Job scheduling, monitoring, orchestration |
Spark | Data Processing | Distributed processing, in-memory caching |
React | Frontend | User interface for data collection |
Flask | Backend | API endpoints and Kafka integration |
Data Exploratory Analysis and Insights
The Amharic news text classification dataset contains:
- News data from various sources
- Six columns: headline, category, date, views, article, link
- Six text categories: sport, politics, business, entertainment, national, international
Info: We focused on text category and article fields, as they provided the most appropriate sentences for audio recording after segmentation.
Implementation
End-to-End Data Pipeline
1# Kafka Producer Configuration 2from kafka import KafkaProducer 3import json 4 5producer = KafkaProducer( 6 bootstrap_servers=['localhost:9092'], 7 value_serializer=lambda v: json.dumps(v).encode('utf-8') 8) 9 10# Publish text to Kafka topic 11def publish_text(text, topic='amharic_text'): 12 producer.send(topic, value={ 13 'text': text, 14 'timestamp': time.time() 15 })
System Components
- Frontend (React)
- Fetch sentences
- Record audio
- Preview and validate recordings
- Submit text-audio pairs
- Backend (Flask)
- API endpoints for text retrieval
- Kafka integration
- Audio processing middleware
- Kafka Cluster
- Two brokers configuration
- Two zookeeper instances
- Kafdrop for visualization
1# Airflow DAG for Spark processing 2from airflow import DAG 3from airflow.operators.python_operator import PythonOperator 4 5def process_audio(): 6 # Initialize Spark session 7 spark = SparkSession.builder\ 8 .appName("AudioProcessing")\ 9 .getOrCreate() 10 11 # Load and process audio 12 df = spark.read.format("audio")\ 13 .load("path/to/audio") 14 15 # Clean and transform 16 cleaned_df = df.transform(remove_silence)\ 17 .transform(remove_noise) 18 19 # Save to S3 20 cleaned_df.write\ 21 .format("parquet")\ 22 .save("s3://bucket/cleaned_audio") 23 24with DAG('audio_processing', schedule_interval='@daily') as dag: 25 process_task = PythonOperator( 26 task_id='process_audio', 27 python_callable=process_audio 28 )
Data Processing Flow
Stage | Component | Action |
---|---|---|
Input | React Frontend | Text display and audio recording |
Transport | Kafka | Message queuing and delivery |
Processing | Spark | Audio cleaning and validation |
Storage | S3 | Final data warehouse |
Lesson Learned
Key learnings from the project:
- Setting up and managing Kafka clusters
- Creating and orchestrating Airflow DAGs
- Implementing Spark transformations for audio data
- Building end-to-end data pipelines
Future Plan
- Data Collection Enhancement
- Gather more Amharic text files
- Recruit volunteers for audio recording
- Improve data quality validation
- Technical Improvements
- Implement random text access from Kafka
- Enhance fault tolerance
- Optimize processing pipeline
Info: The project aims to address the scarcity of Amharic speech recognition data by creating a scalable collection pipeline.