# Webscraping with Python Answer Key 
For the 2024 CSHL Advanced Sequencing Technologies & Bioinformatics Analysis course

## Exercise 1: use `requests` to extract data from an API

In [1]:
import requests

We are interested in using `requests` to query the [Ensembl API](https://rest.ensembl.org/documentation/info/symbol_lookup) to get back a record for a the TCERG1 gene. How would you determine this? Type your answer in the cell below.

In [2]:
gene_symbol = "TCERG1"              # The symbol you want to query
ensembl_gene_url = f"https://rest.ensembl.org/lookup/symbol/homo_sapiens/{gene_symbol}?content-type=application/json"         # The URL you will use to query
ensembl_gene_response = requests.get(ensembl_gene_url)    # Make the query with requests

#### 2. How would you view the JSON output for this request?

In [3]:
ensembl_gene_response.json()

{'canonical_transcript': 'ENST00000679501.2',
 'species': 'homo_sapiens',
 'start': 146447311,
 'logic_name': 'ensembl_havana_gene_homo_sapiens',
 'end': 146511961,
 'db_type': 'core',
 'version': 14,
 'strand': 1,
 'assembly_name': 'GRCh38',
 'seq_region_name': '5',
 'id': 'ENSG00000113649',
 'display_name': 'TCERG1',
 'biotype': 'protein_coding',
 'source': 'ensembl_havana',
 'object_type': 'Gene',
 'description': 'transcription elongation regulator 1 [Source:HGNC Symbol;Acc:HGNC:15630]'}

## Introduction to Web Scraping using `beautifulsoup4` 

#### We're now going to be using `beautifulsoup4` to practice web scraping from the course website: https://meetings.cshl.edu/courses.aspx?course=C-SEQTEC

In [4]:
from bs4 import BeautifulSoup

#### 1. How would you to get the list of invited speakers for the course? Type your code in the cell below:

In [5]:
url = 'https://meetings.cshl.edu/courses.aspx?course=C-SEQTEC'
response = requests.get(url)
html_content = response.text
cshl_webpage = BeautifulSoup(html_content, "html.parser")
cshl_webpage.find('div', class_='cspeakers16')

<div class="cspeakers16">
<div class="cspeakers16">
<p class="MsoNormal"><b>Katie Campbell, </b>University of California, Los Angles, Los Angles, CA<br/>
<b>Bimal Chaudhary, </b>Nationwide Children's, Powell, OH<br/>
<b>Justin Kinney, </b>Cold Spring Harbor Laboratory, Cold Spring Harbor, NY<br/>
<b>Yang Li, </b>Washington University in St. Louis, Saint Louis, MO<br/>
<b>Zachary Lippman, </b>CSHL/HHMI, Cold Spring Harbor, NY<br/>
<b>Jessica Mozersky, </b>Washington University in St Louis, St Louis, MO<br/>
<b>Adam Phillippy, </b>National Human Genome Research Institute, Bethesda, MA<br/>
<b>Alex Wagner,</b> Nationwide Children's Hospital, Dublin, OH<br/>
<span style="font-weight: bold;">Jason Williams</span>,Â <span style="font-size: 1rem;">Cold Spring Harbor Laboratory, Cold Spring Harbor, NY</span></p></div><br/>
</div>

#### 2. How could you convert this to a human-readable form?

In [6]:
instructors = cshl_webpage.find('div', class_='cspeakers16').find("div", class_="cspeakers16")
instructors = instructors.get_text().replace("\xa0", " ").strip()
print(instructors)

Katie Campbell, University of California, Los Angles, Los Angles, CA
Bimal Chaudhary, Nationwide Children's, Powell, OH
Justin Kinney, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
Yang Li, Washington University in St. Louis, Saint Louis, MO
Zachary Lippman, CSHL/HHMI, Cold Spring Harbor, NY
Jessica Mozersky, Washington University in St Louis, St Louis, MO
Adam Phillippy, National Human Genome Research Institute, Bethesda, MA
Alex Wagner, Nationwide Children's Hospital, Dublin, OH
Jason Williams, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY


#### 3. Suppose we want to extract the dates for the course, and we know that the dates are under the `cdate16` flag. Write a query to output the dates that uses the `get_text()` function

In [7]:
dates = cshl_webpage.find('div', class_='cdate16')
dates.get_text()

'November 10 - 23, 2024'