Fantasy Tour dé France

A fun project scraping and trying to build the historically best Fantasy Tour dé France team.

Presentation

github src

Rules

Pick (exactly) 9 riders
Each rider costs a fixed number of points
Team must cost <= 100 points
Must pick 2 All-Rounders, 2 Climbers, 1 Sprinter, 3 Unclassed Riders and 1 Wild Card
No team changes/substitutions after Tour starts
Scoring is based on stages and various other points each rider can collect along the course of the race
The winning (fantasy) team is not necessarily composed of the riders who came 1st-9th.
- https://www.velogames.com/tour-de-france/2017/rules.php

Goals

Come up with 3-4 team suggestions
Secretly enter the fantasy league
…..
Bragging rights

Collecting the Data

Gather Rider Data

Data to Collect
- For every available rider choice in 2017:
  - Previous year(s) costs, scores and categories
  - 2017 cost and category

Riders Breakdown

Christopher Froome 	| Team  Sky 		| 26 Points
Richie Porte 		| BMC Racing Team 	| 22 Points 
….
191 Total Riders

Velobets

Data Gathering Complications

Each rider’s cost and past performance live in separate web pages
Previous year web pages are in different formats
Some pages are behind a login (cookie/auth issues)

Data Gathering Solution

br = mechanize.Browser()
# ignore robots.txt
br.set_handle_robots(False)
# pretend to be mozilla
br.addheaders = [("User-agent","Mozilla/5.0")] 

# get the markup for the all players page
response = br.open(all_players)

assert response.code == 200

# get all the links for each players page
soup = BeautifulSoup(response.read(), 'html.parser')
player_links = [(link.get('href'),link.getText()) for link in soup.find_all('a') if 'riderprofile.php' in link.get('href')]

# go to each players page and retreive all the stage stats
for (player_link, player_name) in player_links:
	response = br.open(player_base + player_link)
	soup = BeautifulSoup(response.read(), 'html.parser')
	for tr in soup.find_all('tr'):
		row = player_name + ',' + year + ','
		for td in tr.find_all('td'):
			row += td.getText().strip('\r\n') + ','
		print row.encode('ascii', 'ignore')

Final Raw Datasets

velobet (master) $ head player_2017.csv.bak | column -s ',' -t

Category       Cost  Name                Team
All Rounder 1  26    Christopher Froome  Team Sky
All Rounder 1  22    Richie Porte        BMC Racing Team
All Rounder 1  16    Alberto Contador    Trek - Segafredo
All Rounder 1  16    Alejandro Valverde  Movistar Team
All Rounder 1  14    Thibaut Pinot       FDJ
All Rounder 1  12    Geraint Thomas      Team Sky
All Rounder 1  10    Andrey Amador       Movistar Team
All Rounder 1  10    Ion Izagirre        Bahrain Merida Pro Cycling Team
All Rounder 1  10    Diego Ulissi        UAE Team Emirates
...

velobet (master) $ head player_2014.csv.bak | column -s ',' -t

PlayerName       Year  Stage    STG  GC  PC  KOM  SPR  SUM  BKY  ASS  Total
Vincenzo Nibali  2014  Stage 1  -    -   -   -    -    -    -    -    0
Vincenzo Nibali  2014  Stage 2  150  25  2   -    -    -    -    4    181
Vincenzo Nibali  2014  Stage 3  -    25  -   -    -    -    -    4    29
Vincenzo Nibali  2014  Stage 4  -    25  -   -    -    -    -    4    29
Vincenzo Nibali  2014  Stage 5  100  25  2   -    -    -    -    14   141
Vincenzo Nibali  2014  Stage 6  -    25  -   -    -    -    -    10   35
Vincenzo Nibali  2014  Stage 7  -    25  -   -    -    -    -    10   35
Vincenzo Nibali  2014  Stage 8  100  25  -   -    -    -    -    10   135
Vincenzo Nibali  2014  Stage 9  -    22  -   -    -    -    -    6    28
...

Build the Model

Look at all the possible team combinations in the past that
- cost <= 100 points
- satisfy the category requirements
For each team_combination
- compute the final score the team would have scored in that year
Choose the historically highest scoring teams for each year

Dataset Size

wc -l cost_combinations.csv
1530652 cost_combinations.csv

There are ~1.5 Million combinations of costs that total <= 100 points. Combined with 191 Players, this is around ~400 Billion potential teams combinations to sort through.

Model v2

The same model idea - optimize the number of teams to sort through by reducing the total number of combinations.

Start with a cost combo

26,16,4,6,4,4,8,8,4,80

This is a combination of costs that is <= 100 points and satisfies the category requirements (each column is the required category - last column is total of the columns to the left).

Pairing down

With the current cost combo (26,16,4,6,4,4,8,8,4,80)
- Produce all player combinations that match that cost
  - Multiple players can have the same cost (use python set() ‘s to collect unique combinations) use the python yield construct for faster performance
- From the given potential players, choose the highest scoring players - create a team with these players.

	for combo in bucket_combo(best_avail_players):
		sc = set(combo)
		if len(combo) != len(sc):
			continue
		yield combo

def bucket_combo(l, depth=0) :
    ''' return all combinations of items in buckets '''
    for item in l[0] :
        if len(l) > 1 :
            for result in bucket_combo(l[1:], depth+1) :
                yield [item,] + result
        else :
            yield [item,]

Model Optimization Results

Tweaking the algorithm results in
- Load all 1.5M cost combinations into an indexed map: ~30s
- Generate the best team for each cost combination for each year individually, and average for all 3 years:
  - ~180s*

Velobets

Actual Results Day 1

Velobets

Actual Results Day 3

Velobets

Actual Results Day 17

Velobets

Final Results

15th place
Out of 19.