Data Scientist
fa-ewjt-saasfaprod1
Bengaluru, Karnataka, India1 month ago
36 views10 saves5 applies
Prepare for this role
Job Type
full time
Description
- Design and implement entity resolution and record linkage pipelines across multiple data sources
- Build and evaluate matching algorithms using classical ML, statistical scoring, and fuzzy string-matching techniques
- Develop attribute fusion logic to construct canonical golden records from conflicting multi-source data
- Analyze data quality issues, document findings, and propose remediation strategies
- Data Source Evaluation
- Assess new external data sources (open and commercial) for coverage, quality, and applicability to Customer Master use cases
- Apply existing evaluation criteria and contribute additional quality metrics where relevant
- Produce structured evaluation reports with recommendations for adoption or rejection
- Analytics & Reporting
- Profile source datasets and track match quality metrics (precision, recall, F1, coverage)
- Build dashboards and analytical summaries to communicate pipeline performance to stakeholders
- Document data lineage, matching logic, and provenance for audit and reproducibility
- Design and implement entity resolution and record linkage pipelines across multiple data sources
- Build and evaluate matching algorithms using classical ML, statistical scoring, and fuzzy string-matching techniques
- Develop attribute fusion logic to construct canonical golden records from conflicting multi-source data
- Analyze data quality issues, document findings, and propose remediation strategies
- Data Source Evaluation
- Assess new external data sources (open and commercial) for coverage, quality, and applicability to Customer Master use cases
- Apply existing evaluation criteria and contribute additional quality metrics where relevant
- Produce structured evaluation reports with recommendations for adoption or rejection
- Analytics & Reporting
- Profile source datasets and track match quality metrics (precision, recall, F1, coverage)
- Build dashboards and analytical summaries to communicate pipeline performance to stakeholders
- Document data lineage, matching logic, and provenance for audit and reproducibility
- Python - Pandas, NumPy, scikit-learn, rapidfuzz / jellyfish
- SQL - Complex queries, window functions, aggregations; Hadoop/Hive or Presto/Trino
- Classical ML & Statistics - Supervised/unsupervised models, probabilistic scoring, clustering, feature engineering
- String matching & NLP - Fuzzy matching (Jaro-Winkler, Levenshtein, TF-IDF), text normalization, tokenization
- Entity Resolution - Record linkage concepts: blocking, scoring, deduplication, cluster evaluation
- Data Quality Assessment - Completeness, consistency, coverage metrics; source profiling
- Data Analysis - Exploratory analysis, hypothesis testing, statistical reasoning
This job is found at InterviewStack.io
Skills
algorithmsanalyticsdashboardspythonpandasnumpyscikit-learnsqlhadoophiveprestotrinostatisticsnlpdata analysisdata qualityfeature engineeringhypothesis testingdata lineage