Baking Quality into Your Data Pipeline - Ali Khalid

Baking Quality into Your Data Pipeline - Ali Khalid

Description:

Ensuring data quality is identified as one of the most challenging issues in Big Data. This starts with identifying scope of the Data Pipeline at each junction. Next is to pick the appropriate data quality dimensions relevant to business criticality and build automated checks providing insights into the quality of data. 

The session is designed to give participants an introduction to how a big data project is structured, how data flows, what quality checks are generally used and how to automate them. The main sections in the talk are:

  • Difference between Big data and conventional data usage
  • Sample technology stack for a big data project
  • Introduction to a data pipeline
  • The kind of tests and automation needed
  • Data quality dimensions (Why is data quality important, 6 dimensions explanation along with demo how to test them in our sample pipeline)
  • Automating data quality checks