In the ever-evolving landscape of big data, Twitter has taken a significant leap forward in ensuring the quality and reliability of its vast data resources. The social media giant has developed a comprehensive Data Quality Platform (DQP) to address the challenges associated with managing and validating thousands of datasets ingested daily.
The Need for Data Quality
Twitter’s data infrastructure processes an enormous volume of information, with employees running over 10 million queries monthly on nearly an exabyte of data in BigQuery. This massive scale necessitates a robust system to maintain data quality, which is crucial for:
- Confidence: Enhancing trust in data outputs and reducing risk in outcomes
- Productivity: Allowing teams to focus on core solutions rather than data validation
- Revenue Protection: Preventing potential revenue loss due to poor data-driven decisions
The Data Quality Platform Solution
Twitter’s DQP is a managed, config-driven, workflow-based solution designed to:
- Build and collect standard and custom quality metrics
- Alert on data validations
- Monitor metrics and statistics within Google Cloud Platform (GCP)
The platform leverages several key technologies:
- Open-source Great Expectations: For generating query logic
- Custom Stats Collector Library: As operators for resource querying
- Apache Airflow: For workflow and state management
- Google Dataflow: For data transportation into BigQuery
Technical Architecture
The DQP’s architecture follows a streamlined process:
- YAML configurations are uploaded to Google Cloud Storage (GCS) via CI/CD workflow
- Airflow workers initiate tests based on resource and cadence specifications
- Test results are sent to a PubSub queue
- Dataflow jobs transfer data from the queue to BigQuery tables
- Results are visualized in Looker for debugging and trend analysis
Impact on Key Work Streams
Revenue Analytics Platform
The implementation of DQP has resulted in:
- 20% reduction in roll-out time for new processing features
- Increased confidence in data delivered to advertisers through continuous measurement
Core Served Impressions Dataset
DQP has provided:
- Automated visibility into deviance between upstream and downstream datasets
- Alignment metrics for over 400 internal customers
Conclusion
Twitter’s Data Quality Platform represents a significant advancement in data management, leveraging open-source libraries and GCP services to create an end-to-end automated solution. This innovation ensures the accuracy and reliability of thousands of datasets ingested daily, ultimately increasing confidence in the data delivered to advertisers and internal stakeholders alike.
Read more such articles from our Newsletter here.