# Webscraping with Python Exercises 
For the 2024 CSHL Advanced Sequencing Technologies & Bioinformatics Analysis course

## Exercise 1: use `requests` to extract data from an API

In [1]:
import requests

We are interested in using `requests` to query the [Ensembl API](https://rest.ensembl.org/documentation/info/symbol_lookup) to get back a record for a the TCERG1 gene. How would you determine this? Type your answer in the cell below.

In [12]:
gene_symbol = "TCERG1"             # The symbol you want to query
ensembl_gene_url = f"https://rest.ensembl.org/lookup/symbol/homo_sapiens/{gene_symbol}?content-type=application/json"        # The URL you will use to query
ensembl_gene_response = requests.get(ensembl_gene_url)  # Make the query with requests

How would you check if your query was successful?

In [13]:
ensembl_gene_response.status_code

200

How would you view the response to your query in a JSON format?

In [14]:
ensembl_gene_response.json()

{'description': 'transcription elongation regulator 1 [Source:HGNC Symbol;Acc:HGNC:15630]',
 'biotype': 'protein_coding',
 'strand': 1,
 'start': 146447311,
 'assembly_name': 'GRCh38',
 'db_type': 'core',
 'species': 'homo_sapiens',
 'seq_region_name': '5',
 'source': 'ensembl_havana',
 'object_type': 'Gene',
 'display_name': 'TCERG1',
 'canonical_transcript': 'ENST00000679501.2',
 'end': 146511961,
 'logic_name': 'ensembl_havana_gene_homo_sapiens',
 'version': 14,
 'id': 'ENSG00000113649'}

## Introduction to Web Scraping using `beautifulsoup4` 

#### We're now going to be using `beautifulsoup4` to practice web scraping from the course website: https://meetings.cshl.edu/courses.aspx?course=C-SEQTEC

In [15]:
from bs4 import BeautifulSoup

#### 1. How would you to get the list of invited speakers for the course? What is `div` doing in this command?

In [16]:
url = 'https://meetings.cshl.edu/courses.aspx?course=C-SEQTEC'
response = requests.get(url)
html_content = response.text
cshl_webpage = BeautifulSoup(html_content, "html.parser")
cshl_webpage
#cshl_webpage.find('div', class_='fill in flag here')
#cshl_webpage


<!DOCTYPE html>

<html lang="en">
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-30723914-1"></script>
<script>
        window.dataLayer = window.dataLayer || [];
        function gtag(){dataLayer.push(arguments);}
        gtag('js', new Date());

        gtag('config', 'UA-30723914-1');
        gtag('config', 'G-85036B76HX');

    </script>
<title>
	Advanced Sequencing Technologies &amp; Bioinformatics Analysis 2024 | CSHL
</title><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="Cold Spring Harbor Meetings and Courses - Long Island, New York. Scientific Conferences and Courses For Research and Education" name="description"/><link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/><link href="https://cdn.jsdelivr.net/jquery.jssocials/1.4.0/jssocials.css" rel="stylesheet" type="text/css"/><link href="https://cdn.jsdelivr.

In [25]:
cshl_webpage.find('div', class_='fill in flag here')
#cshl_webpage

#### The following cell can be run to convert this to a human readable form:

In [27]:
instructors = cshl_webpage.find('div', class_='cspeakers16').find("div", class_="cspeakers16")
instructors = instructors.get_text().replace("\xa0", " ").strip()
print(instructors)

Katie Campbell, University of California, Los Angles, Los Angles, CA
Bimal Chaudhary, Nationwide Children's, Powell, OH
Justin Kinney, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
Yang Li, Washington University in St. Louis, Saint Louis, MO
Zachary Lippman, CSHL/HHMI, Cold Spring Harbor, NY
Jessica Mozersky, Washington University in St Louis, St Louis, MO
Adam Phillippy, National Human Genome Research Institute, Bethesda, MA
Alex Wagner, Nationwide Children's Hospital, Dublin, OH
Jason Williams, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY


#### 3. Suppose we want to extract the dates for the course, and we know that the dates are under the `cdate16` flag. Write a query to output the dates that uses the `get_text()` function

In [31]:
# Uncomment and type code here
dates = cshl_webpage.find('div', class_='cdate16')
dates = dates.get_text()
#instructors = instructors.get_text().replace("\xa0", " ").strip()
print(dates)

November 10 - 23, 2024
