ATP Tennis Match Analysis and Anomaly Detection
This project focuses on analyzing ATP tennis match data using a deep learning model with joint embedding techniques. The objective is to detect anomalies in professional men's tennis tournament draws using advanced statistical and machine learning methods. The project employs PyTorch for building and training the neural network, Optuna for hyperparameter optimization, and DBSCAN for anomaly detection.
Table of Contents
- Overview
- Features
- Setup
- Usage
- Model Architecture
- Hyperparameter Optimization
- Anomaly Detection
- Results
- Contributing
- License
Overview
The project aims to identify irregularities in tennis matches by examining patterns and discrepancies in player rankings, ages, and other match-related features. This analysis can help detect potential biases or unusual outcomes in tournament draws.
Features
- Data Loading and Preprocessing: Handles ATP match data from multiple years, with preprocessing steps including encoding categorical features and handling missing values.
- Feature Engineering: Creates new features such as age difference and rank difference between players.
- Joint Embedding Neural Network: A PyTorch-based model that combines categorical and numerical features for robust prediction of match outcomes.
- Hyperparameter Tuning: Uses Optuna for efficient optimization of model hyperparameters.
- Anomaly Detection: Applies DBSCAN clustering to the embeddings generated by the model to identify anomalies in player performance.
Setup
Prerequisites
- Python 3.8 or later
- PyTorch
- Optuna
- Scikit-learn
- Matplotlib
- Pandas
- NumPy
Installation
Clone the repository:
git clone https://github.com/yourusername/atp-tennis-analysis.git cd atp-tennis-analysis
pip install -r requirements.txt
Download the ATP match data files and place them in the project directory. Ensure the files are named in the format atp_matches_.csv (e.g., atp_matches_2000.csv).
Run the main script to load data, preprocess it, and train the model: python main.py
Model Training
The script trains the model using the preprocessed data, optimizing hyperparameters with Optuna, and saves the best-performing model.
Anomaly Detection
The model’s predictions are used to perform anomaly detection, identifying unusual matches or player performances.
View Results
Results, including anomaly plots and metrics, will be saved in the output directory. CSV files summarizing the anomalies per player, year, and tournament will also be generated.
Model Architecture
The JointEmbeddedModel consists of:
Embeddings for Categorical Features: Each categorical variable (e.g., player IDs, tournament IDs) is embedded into a dense vector.
Fully Connected Layers: These layers combine embeddings and numerical features to predict match outcomes.
Dropout Layers: Used to prevent overfitting and improve model generalization.
Hyperparameter Optimization
The project uses Optuna to automatically search for the best combination of model parameters, including:
Embedding dimension
Hidden layer size
Learning rate
Batch size
Dropout rate
Anomaly Detection
Anomalies are detected by comparing expected and actual rank differences in matches using DBSCAN clustering. Anomalies can indicate unexpected match outcomes, potential biases, or errors in player rankings.
Results
Positive Anomalies: Matches where the predicted rank difference was significantly lower than expected. Negative Anomalies: Matches where the predicted rank difference was significantly higher than expected. The results are visualized using TSNE plots and saved as images and CSV files.
Contributions are welcome! Please feel free to submit a Pull Request or open an Issue for any improvements or bugs you encounter.
License
This project is licensed under the MIT License. See the LICENSE file for more details.