diye.dev

Introduction

In weeks 4 & 5 we built an AI model for the Amharic language. We have experienced a lack of quality data. Had we had a diverse and large training set, our model could have improved and our model could have transformed the lives of many.

Info: Recognizing the value of large data sets for speech-to-text data modeling, and seeing the opportunity in existing Amharic text corpus, we set out to design and build a robust, large-scale, fault-tolerant, highly available Kafka cluster.

Objective

The objective of this project is to provide a tool that can be used to:

Publish and receive text and audio files from/to a data lake
Perform distributed transformations
Load data into a warehouse in a format suitable for training speech-to-text models

Technologies and Tools

Core Technologies

Technology	Purpose	Key Features
Kafka	Message Queue	Fault-tolerant, high-throughput, scalable
Airflow	Workflow Management	Job scheduling, monitoring, orchestration
Spark	Data Processing	Distributed processing, in-memory caching
React	Frontend	User interface for data collection
Flask	Backend	API endpoints and Kafka integration

Data Exploratory Analysis and Insights

The Amharic news text classification dataset contains:

News data from various sources
Six columns: headline, category, date, views, article, link
Six text categories: sport, politics, business, entertainment, national, international

Info: We focused on text category and article fields, as they provided the most appropriate sentences for audio recording after segmentation.

Implementation

End-to-End Data Pipeline

1# Kafka Producer Configuration
2from kafka import KafkaProducer
3import json
4
5producer = KafkaProducer(
6    bootstrap_servers=['localhost:9092'],
7    value_serializer=lambda v: json.dumps(v).encode('utf-8')
8)
9
10# Publish text to Kafka topic
11def publish_text(text, topic='amharic_text'):
12    producer.send(topic, value={
13        'text': text,
14        'timestamp': time.time()
15    })

System Components

Frontend (React)

Fetch sentences
Record audio
Preview and validate recordings
Submit text-audio pairs

Backend (Flask)

API endpoints for text retrieval
Kafka integration
Audio processing middleware

Kafka Cluster

Two brokers configuration
Two zookeeper instances
Kafdrop for visualization

1# Airflow DAG for Spark processing
2from airflow import DAG
3from airflow.operators.python_operator import PythonOperator
4
5def process_audio():
6    # Initialize Spark session
7    spark = SparkSession.builder\
8        .appName("AudioProcessing")\
9        .getOrCreate()
10    
11    # Load and process audio
12    df = spark.read.format("audio")\
13        .load("path/to/audio")
14    
15    # Clean and transform
16    cleaned_df = df.transform(remove_silence)\
17        .transform(remove_noise)
18    
19    # Save to S3
20    cleaned_df.write\
21        .format("parquet")\
22        .save("s3://bucket/cleaned_audio")
23
24with DAG('audio_processing', schedule_interval='@daily') as dag:
25    process_task = PythonOperator(
26        task_id='process_audio',
27        python_callable=process_audio
28    )

Data Processing Flow

Stage	Component	Action
Input	React Frontend	Text display and audio recording
Transport	Kafka	Message queuing and delivery
Processing	Spark	Audio cleaning and validation
Storage	S3	Final data warehouse

Lesson Learned

Key learnings from the project:

Setting up and managing Kafka clusters
Creating and orchestrating Airflow DAGs
Implementing Spark transformations for audio data
Building end-to-end data pipelines

Future Plan

Data Collection Enhancement

Gather more Amharic text files
Recruit volunteers for audio recording
Improve data quality validation

Technical Improvements

Implement random text access from Kafka
Enhance fault tolerance
Optimize processing pipeline

Info: The project aims to address the scarcity of Amharic speech recognition data by creating a scalable collection pipeline.

End-to-End Speech-to-Text Data Collection with Kafka, Airflow, and Spark

Introduction

Objective

Technologies and Tools

Core Technologies

Data Exploratory Analysis and Insights

Implementation

End-to-End Data Pipeline

System Components

Data Processing Flow

Lesson Learned

Future Plan

Reference

Share this post

Related Posts

Traffic Flow — Data Warehouse Migration from PostgreSQL to MySQL and Redash to Apache Superset

Traffic Flow — Data Warehouse with Postgres, DBT, Airflow, and Redash