We are looking for a skilled Data Engineer to join our team (one of Alibaba group business) and play a key role in building the data infrastructure for our AI-powered Legal Assistant. The ideal candidate will have strong expertise in data pipelines, preprocessing, and preparing high-quality training datasets for language models (LLMs).
You will work closely with AI engineers, MLOps, and product teams to ensure that our legal AI models are trained on clean, structured, and reliable data.
Responsibilities:
- Design, build, and maintain scalable ETL/ELT pipelines for legal and judicial data.
- Collect, clean, normalize, and preprocess large volumes of unstructured text data.
- Prepare and manage training datasets for NLP and LLM models.
- Collaborate with AI Engineers to fine-tune models using annotated datasets.
- Implement automated data quality checks and validation processes.
- Manage databases, data storage, and optimize data access for ML pipelines.
- Ensure compliance with data privacy and security standards.
Requirements:
- Strong programming skills in Python (Pandas, PySpark, etc.).
- Experience with data pipeline frameworks (Airflow, Luigi, Prefect).
- Hands-on experience with databases (SQL/NoSQL, PostgreSQL, MongoDB, ElasticSearch).
- Familiarity with Spark NLP or other large-scale NLP frameworks.
- Solid understanding of text data preprocessing (tokenization, normalization, cleaning).
- Familiarity with NLP datasets and annotation workflows.
- Knowledge of data security and handling sensitive information.
Nice to Have:
- Experience with legal or domain-specific datasets.
- Familiarity with Persian (Farsi) NLP models (e.g., ParsBERT, HooshvareLab BERT, mBERT, XLM-R) for NER tasks.
- Understanding of MLOps workflows and integration with AI models.
Benefits:
- Be part of an innovative and impactful project in the legal-tech domain.
- Work in a collaborative and fast-learning environment.
- Opportunity to grow professionally with exposure to cutting-edge AI technologies.