End-to-End Speech-to-Text Data Collection with Kafka, Airflow, and Spark
Denamo Markos

Denamo Markos

Jul 09, 2022
5 min read

End-to-End Speech-to-Text Data Collection with Kafka, Airflow, and Spark

Data Engineering#Kafka#Airflow#Spark#Data Pipeline#Speech Recognition

Introduction

In weeks 4 & 5 we built an AI model for the Amharic language. We have experienced a lack of quality data. Had we had a diverse and large training set, our model could have improved and our model could have transformed the lives of many.

Info: Recognizing the value of large data sets for speech-to-text data modeling, and seeing the opportunity in existing Amharic text corpus, we set out to design and build a robust, large-scale, fault-tolerant, highly available Kafka cluster.

Objective

The objective of this project is to provide a tool that can be used to:

  • Publish and receive text and audio files from/to a data lake
  • Perform distributed transformations
  • Load data into a warehouse in a format suitable for training speech-to-text models

Technologies and Tools

Core Technologies

TechnologyPurposeKey Features
KafkaMessage QueueFault-tolerant, high-throughput, scalable
AirflowWorkflow ManagementJob scheduling, monitoring, orchestration
SparkData ProcessingDistributed processing, in-memory caching
ReactFrontendUser interface for data collection
FlaskBackendAPI endpoints and Kafka integration

Data Exploratory Analysis and Insights

The Amharic news text classification dataset contains:

  • News data from various sources
  • Six columns: headline, category, date, views, article, link
  • Six text categories: sport, politics, business, entertainment, national, international

Info: We focused on text category and article fields, as they provided the most appropriate sentences for audio recording after segmentation.

Implementation

End-to-End Data Pipeline

1# Kafka Producer Configuration 2from kafka import KafkaProducer 3import json 4 5producer = KafkaProducer( 6 bootstrap_servers=['localhost:9092'], 7 value_serializer=lambda v: json.dumps(v).encode('utf-8') 8) 9 10# Publish text to Kafka topic 11def publish_text(text, topic='amharic_text'): 12 producer.send(topic, value={ 13 'text': text, 14 'timestamp': time.time() 15 })

System Components

  1. Frontend (React)
  • Fetch sentences
  • Record audio
  • Preview and validate recordings
  • Submit text-audio pairs
  1. Backend (Flask)
  • API endpoints for text retrieval
  • Kafka integration
  • Audio processing middleware
  1. Kafka Cluster
  • Two brokers configuration
  • Two zookeeper instances
  • Kafdrop for visualization
1# Airflow DAG for Spark processing 2from airflow import DAG 3from airflow.operators.python_operator import PythonOperator 4 5def process_audio(): 6 # Initialize Spark session 7 spark = SparkSession.builder\ 8 .appName("AudioProcessing")\ 9 .getOrCreate() 10 11 # Load and process audio 12 df = spark.read.format("audio")\ 13 .load("path/to/audio") 14 15 # Clean and transform 16 cleaned_df = df.transform(remove_silence)\ 17 .transform(remove_noise) 18 19 # Save to S3 20 cleaned_df.write\ 21 .format("parquet")\ 22 .save("s3://bucket/cleaned_audio") 23 24with DAG('audio_processing', schedule_interval='@daily') as dag: 25 process_task = PythonOperator( 26 task_id='process_audio', 27 python_callable=process_audio 28 )

Data Processing Flow

StageComponentAction
InputReact FrontendText display and audio recording
TransportKafkaMessage queuing and delivery
ProcessingSparkAudio cleaning and validation
StorageS3Final data warehouse

Lesson Learned

Key learnings from the project:

  1. Setting up and managing Kafka clusters
  2. Creating and orchestrating Airflow DAGs
  3. Implementing Spark transformations for audio data
  4. Building end-to-end data pipelines

Future Plan

  1. Data Collection Enhancement
  • Gather more Amharic text files
  • Recruit volunteers for audio recording
  • Improve data quality validation
  1. Technical Improvements
  • Implement random text access from Kafka
  • Enhance fault tolerance
  • Optimize processing pipeline

Info: The project aims to address the scarcity of Amharic speech recognition data by creating a scalable collection pipeline.

Reference

Share this post

Related Posts