If I have a DataFrame:
myDF = DataFrame(data=[[11,11],[22,'2A'],[33,33]], columns = ['A','B'])
Gives the following dataframe (Starting out on stackoverflow and don’t have enough reputation for an image of the DataFrame)
| A | B |
0 | 11 | 11 |
1 | 22 | 2A |
2 | 33 | 33 |
If i want to convert column B to int values and drop values that can’t be converted I have to do:
def convertToInt(cell):
try:
return int(cell)
except:
return None
myDF['B'] = myDF['B'].apply(convertToInt)
If I only do:
myDF[‘B’].apply(int)
the error obviously is:
C:\WinPython-32bit-2.7.5.3\python-2.7.5\lib\site-packages\pandas\lib.pyd
in pandas.lib.map_infer (pandas\lib.c:42840)()ValueError: invalid literal for int() with base 10: ‘2A’
Is there a way to add exception handling to myDF[‘B’].apply()
Thank you in advance!
asked Apr 3, 2014 at 19:36
RukTechRukTech
5,0635 gold badges22 silver badges23 bronze badges
0
I had the same question, but for a more general case where it was hard to tell if the function would generate an exception (i.e. you couldn’t explicitly check this condition with something as straightforward as isdigit
).
After thinking about it for a while, I came up with the solution of embedding the try/except
syntax in a separate function. I’m posting a toy example in case it helps anyone.
import pandas as pd
import numpy as np
x=pd.DataFrame(np.array([['a','a'], [1,2]]))
def augment(x):
try:
return int(x)+1
except:
return 'error:' + str(x)
x[0].apply(lambda x: augment(x))
answered Feb 22, 2017 at 15:39
atkat12atkat12
3,8507 gold badges22 silver badges22 bronze badges
2
A way to achieve that with lambda
:
myDF['B'].apply(lambda x: int(x) if str(x).isdigit() else None)
For your input:
>>> myDF
A B
0 11 11
1 22 2A
2 33 33
[3 rows x 2 columns]
>>> myDF['B'].apply(lambda x: int(x) if str(x).isdigit() else None)
0 11
1 NaN
2 33
Name: B, dtype: float64
answered Apr 3, 2014 at 19:54
AmitAmit
19.8k6 gold badges46 silver badges54 bronze badges
4
much better/faster to do:
In [1]: myDF = DataFrame(data=[[11,11],[22,'2A'],[33,33]], columns = ['A','B'])
In [2]: myDF.convert_objects(convert_numeric=True)
Out[2]:
A B
0 11 11
1 22 NaN
2 33 33
[3 rows x 2 columns]
In [3]: myDF.convert_objects(convert_numeric=True).dtypes
Out[3]:
A int64
B float64
dtype: object
This is a vectorized method of doing just this. The coerce
flag say to mark as nan
anything that cannot be converted to numeric.
You can of course do this to a single column if you’d like.
answered Apr 3, 2014 at 20:20
JeffJeff
126k21 gold badges220 silver badges189 bronze badges
1
error messages
- it’s always better to be more specific about the cause of an error:
x = -1
if not isinstance(x,str): ## check if x is a str
errstr = "x is of type "+type(x).__name__+", should be str"
raise TypeError(errstr)
TypeError: x is of type int, should be str
f-strings are a convenient way to construct error messages: anything inside curly brackets is interpreted as a Python expression. e.g.
x=1
print(f"x is of type {type(x).__name__}, should be str")
## x is of type int, should be str
So we could use
if not isinstance(x,str): ## check if x is a str
raise TypeError("x is of type {type(x).__name__}, should be str")
x = -1
if x<0:
raise ValueError(f"x should be non-negative, but it equals {x}")
ValueError: x should be non-negative, but it equals -1
handling errors
Now suppose you are getting an error and you don’t want your program to stop. “Wrapping” your code in a try:
clause will allow you to specify what to do in this case. pass
is a special Python statement called a “null operation” or a “no-op”; it does nothing except keep going.
try:
x= math.sqrt(-1)
except:
pass
## keep going (but x will not be set)
You can specify something you want to do with only a particular set of errors:
try:
x = math.sqrt(-1)
except ValueError:
print("a ValueError occurred")
except:
print("some other error occurred")
## keep going (but x will not be set)
## a ValueError occurred
If the error isn’t caught because it isn’t the right type, it will act like it normally does (without the try:
)
try:
z += 5 ## not defined yet
except ValueError:
print("a ValueError occurred")
NameError: name 'z' is not defined
We could catch this with a general-purpose except:
try:
z += 5 ## not defined yet
except ValueError:
print("a ValueError occurred")
except:
print("some other error occurred")
## some other error occurred
Or add another clause to catch it:
try:
z += 5 ## not defined yet
except ValueError:
print("a ValueError occurred")
except NameError:
print("a NameError occurred")
except:
print("some other error occurred")
## a NameError occurred
Data frames
- rectangular data structure, looks a lot like an array.
- each column is a Series; each column can be of a different type
- rows and columns act differently
- can index by (column) labels as well as positions
- handles missing data (
NaN
) - convenient plotting
- fast operations with keys
- lots of facilities for input/output
import pandas as pd ## standard abbreviation
# The initial set of baby names and birth rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
## initialize DataFrame with a *dictionary*
p = pd.DataFrame({'Name': names, 'Count': births})
print(p)
## Name Count
## 0 Bob 968
## 1 Jessica 155
## 2 Mary 77
## 3 John 578
## 4 Mel 973
What can we do with it?
- “Simple” indexing
- Indexing (a single value) selects a column by its key
- key could be a number, if column names weren’t given when setting up the data frame
- Slicing selects rows by number
- indexing with a list gives multiple columns
.iloc
gives row/column indices (like an array)
p["Count"] ## extract a column = Series (by *name*)
p[2:3] ## slice one row (3-2 = 1)
p[2:5] ## slice multiple rows
p[["Name","Count"]] ## extract multiple columns (data frame)
p.iloc[1,1] ## index with row/column integers like an array
p.iloc[0:5,:] ## can also slice
Indexing by name
p["Name"][4] ## 5th element of Name
p.Name ## attribute!
p.loc[1:2,"Name"] ## index by *label*, _inclusive_
Measles data
Download US measles data from Project Tycho.
read_csv
reads a CSV file as a data frame; it automatically interprets the first row as headingsdf.iloc[]
indexes the result as though it were an arraydf.head()
shows just at the beginning;df.tail()
shows just the end
Let’s look at the first few rows of a data set on measles in US states:
## "Weekly Measles Cases, 1909-2001"
## ...
## "Data provided by Project Tycho, Data Version 1.0.0, released 28 Novem...
## "YEAR","WEEK","ALABAMA","ALASKA","AMERICAN SAMOA","ARIZONA","ARKANSAS"...
## 1909,1,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,2,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,3,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
fn = "../data/MEASLES_Cases_1909-2001_20150322001618.csv"
p = pd.read_csv(fn,skiprows=2,na_values=["-"]) ## read in data
p.head() ## look at the first little bit
## YEAR WEEK ALABAMA ALASKA ... WEST VIRGINIA WISCONSIN WYOMING Unnamed: 61
## 0 1909 1 NaN NaN ... NaN NaN NaN NaN
## 1 1909 2 NaN NaN ... NaN NaN NaN NaN
## 2 1909 3 NaN NaN ... NaN NaN NaN NaN
## 3 1909 4 NaN NaN ... NaN NaN NaN NaN
## 4 1909 5 NaN NaN ... NaN NaN NaN NaN
##
## [5 rows x 62 columns]
Mostly NaN
values at the beginning! (NaN
= “not a number”: similar to nan
from math
or numpy
)
Selecting
- Like
numpy
array indexing, but a little different … - Pandas doc, indexing and selecting
- extract by name:
df.loc[:,"MASSACHUSETTS":"NEVADA"]
(index by label; includes endpoint) - extract by integer index:
iloc
method,df.iloc[:,range]
(index by integer; doesn’t include endpoint)
- extract by name:
p.loc[:,"MASSACHUSETTS":"NEVADA"]
## MASSACHUSETTS MICHIGAN MINNESOTA ... MONTANA NEBRASKA NEVADA
## 0 NaN NaN NaN ... NaN NaN NaN
## 1 NaN NaN NaN ... NaN NaN NaN
## 2 NaN NaN NaN ... NaN NaN NaN
## 3 NaN NaN NaN ... NaN NaN NaN
## 4 NaN NaN NaN ... NaN NaN NaN
## ... ... ... ... ... ... ... ...
## 4856 NaN NaN NaN ... NaN NaN NaN
## 4857 NaN NaN NaN ... NaN NaN NaN
## 4858 NaN NaN NaN ... NaN NaN NaN
## 4859 NaN NaN NaN ... NaN NaN NaN
## 4860 NaN NaN NaN ... NaN NaN NaN
##
## [4861 rows x 8 columns]
This is the same:
pc = list(p.columns) ## list of colum names
print(pc[:5])
## find the locations of these two state names
## ['YEAR', 'WEEK', 'ALABAMA', 'ALASKA', 'AMERICAN SAMOA']
mass_ind = list(pc).index("MASSACHUSETTS")
neva_ind = list(pc).index("NEVADA")
## index using `.iloc` (with extended range)
p.iloc[:,mass_ind:neva_ind+1]
## MASSACHUSETTS MICHIGAN MINNESOTA ... MONTANA NEBRASKA NEVADA
## 0 NaN NaN NaN ... NaN NaN NaN
## 1 NaN NaN NaN ... NaN NaN NaN
## 2 NaN NaN NaN ... NaN NaN NaN
## 3 NaN NaN NaN ... NaN NaN NaN
## 4 NaN NaN NaN ... NaN NaN NaN
## ... ... ... ... ... ... ... ...
## 4856 NaN NaN NaN ... NaN NaN NaN
## 4857 NaN NaN NaN ... NaN NaN NaN
## 4858 NaN NaN NaN ... NaN NaN NaN
## 4859 NaN NaN NaN ... NaN NaN NaN
## 4860 NaN NaN NaN ... NaN NaN NaN
##
## [4861 rows x 8 columns]
More examples
You can also refer to individual columns as attributes (i.e. just p.<name>
)
p.ARIZONA[:5]
## 0 NaN
## 1 NaN
## 2 NaN
## 3 NaN
## 4 NaN
## Name: ARIZONA, dtype: float64
p.ARIZONA.head()
## 0 NaN
## 1 NaN
## 2 NaN
## 3 NaN
## 4 NaN
## Name: ARIZONA, dtype: float64
.drop()
gets rid of elements
pp = p.drop(["YEAR","WEEK"],axis=1)
## equivalent to
pp2 = p.iloc[2:,]
pp3 = p.loc[:,"ARIZONA"]
Always use name-indexing whenever you can!
.index
is a special attribute of data frames that governs searching, plotting, etc.. Here we’ll set it to a decimal date value:
pp.index = p.YEAR+(p.WEEK-1)/52
Filtering
Choosing specific rows of a data frame; &
, |
,~
correspond to and
, or
, not
(individual elements must be in parentheses)
ariz = p.ARIZONA ## pull out a column (attribute)
ariz[(p.YEAR==1970) & (ariz>50)] ## *must* use parentheses!
## 3196 69.0
## 3197 57.0
## 3198 62.0
## 3200 56.0
## 3203 73.0
## 3205 54.0
## 3209 55.0
## Name: ARIZONA, dtype: float64
Basic plotting
pandas
will automatically plot data frames in a (reasonably) sensible way
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
## pp.plot()
pp.plot(legend=False,logy=True) ## plot method (non-Pythonic)
plt.savefig("pix/measles1.png")
Or we can create our own (less complex) plots
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(pp.index,np.log10(pp.ARIZONA))
Column and row manipulations
- totals by week
ptot = pp.sum(axis=1)
df.min
,df.max
,df.mean
all work too …
Aggregation
ptotweek = ptot.groupby(p.WEEK)
ptotweekmean = ptotweek.aggregate(np.mean)
ptotweekmean.plot()
Dates and times
reference
- (Another) complex subject.
- Lots of possible date formats
- Basic idea: something like
%Y-%m-%d
; separators just match whatever’s in your data (usually “/” or “-”). Results need to be unambiguous, and ambiguity is dangerous (how is day of month specified? lower case, capital? etc.) pandas
tries to guess, but you shouldn’t let it.
print(pd.to_datetime("05-01-2004"))
## 2004-05-01 00:00:00
print(pd.to_datetime("05-01-2004",format="%m-%d-%Y"))
## 2004-05-01 00:00:00
- Time zones and daylight savings time can be a nightmare
- May need to have the right number of digits, especially in the absence of separators:
import pandas as pd
print(pd.to_datetime("1212004",format="%m%d%Y"))
## 2004-12-01 00:00:00
print(pd.to_datetime("12012004",format="%m%d%Y"))
## 2004-12-01 00:00:00
For our measles data we have week of year, so things get a little complicated
yearstr = p.YEAR.apply(format)
weekstr = p.WEEK.apply(format,args=["02"])
datestr = p.YEAR.astype(str)+"-"+weekstr+"-0"
dateindex = pd.to_datetime(datestr,format="%Y-%U-%w")
Binning results
- turn a quantitative variable into categories
pd.cut(x,bins=...)
; decide on binspd.qcut(x,n)
; decide on number of bins (equal occupancy)
Weather data
## fancy stuff: automatically look for index and convert it to a date/time
p = pd.read_csv("../data/eng2.csv",skiprows=14,encoding="latin1",index_col="Date/Time",parse_dates=True)
## rename columns
p.columns = [
'Year', 'Month', 'Day', 'Time', 'Data Quality', 'Temp (C)',
'Temp Flag', 'Dew Point Temp (C)', 'Dew Point Temp Flag',
'Rel Hum (%)', 'Rel Hum Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag',
'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)', 'Visibility Flag',
'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx', 'Hmdx Flag', 'Wind Chill',
'Wind Chill Flag', 'Weather']
## drop columns that are *all* NA
p = p.dropna(axis=1,how='all')
p["Temp (C)"].plot()
## get rid of columns (axis=1) we don't want
p = p.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)
Now pull out the temperature and take the median by hour:
temp = p[['Temp (C)']]
temp["Hour"] = temp.index.hour
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
temphr = temp.groupby('Hour')
medtmp = temphr.aggregate(np.median)
maxtmp = temphr.aggregate(np.max)
mintmp = temphr.aggregate(np.min)
Now plot these …
На чтение 2 мин. Просмотров 10k. Опубликовано
При использовании библиотеки Pandas очень удобно загружать данные из разных источников, например из файлов с помощью функции read_csv. Все работает из коробки, много разных опций. Но если в данных, которые нужно загрузить, закралась ошибка, то тут каши не сваришь. Конечно, для анализа можно пренебречь некоторой частью данных при загрузке, то есть запустить функцию с параметром error_bad_lines=False. Тогда все строки, в которых есть ошибка, будут проигнорированы. Такой способ подходит для быстрого анализа или когда количество строк с ошибками несущественно по сравнению с размером данных. Но для точного анализа нужно загрузить все данные, то есть обработать ошибочные строки и запихнуть их в датафрейм.
Одна из возможных ошибок выглядит так
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 20 fields in line 47773, saw 22 |
что говорит нам следующее: количество столбцов в строке не совпадает с начальным, который pandas взял за основу (по первой строке).
Мне очень понравилось решение, которое я нашел на сайте stackoverflow.com
Смысл его в том, что каждую ошибку, на которой спотыкается pandas, ловим через Exception, обрабатываем и записываем строку, на которой произошел сбой. После отработки pandas всего файла, берутся все строки с ошибками и обрабатываются по своему желанию.
В данной реализации хорошо, что, во-первых, делается максимум работы с помощью встроенной функции read_csv, во-вторых, можно настроить парсинг на определенные ошибки, в-третьих, можно впихнуть свою функцию для парсинга данных.
Отработка ошибок делается в лоб: открывается файл, читаются только строки с ошибками, разбираются через свою функцию парсера, возвращается все в виде массива. После конкатируются два датафрейма.
Не стоит забывать, что в python3 нет нормального решения для чтения из файла определенной строки, поэтому сделано все по-простому через открытие файла и чтение строки через функцию readline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# python3 extended function to read csv in Pandas def pandasReadCsvExtended(files,sep=‘,’, func_parser=», func_afterparser = »): dataRez = pd.DataFrame() for file in files: line = [] dataFile = pd.DataFrame() cont = True while cont == True: try: dataFile = pd.read_csv(file, sep=sep, skiprows=line) cont = False except Exception as e: errormsg = e.args[0] errortype = errormsg.split(‘.’)[0].strip() if errortype == ‘Error tokenizing data’: cerror = errormsg.split(‘:’)[1].strip().replace(‘,’, ») nums = [n for n in cerror.split(‘ ‘) if str.isdigit(n)] line.append(int(nums[1]) — 1) else: print(‘Unknown Error: {}’.format(errormsg)) if line != [] and callable(func_parser): fileIO = open(file, ‘r’) toDataFrame = [] ln_prev = 0 # count of columns in dataFrame header_count = dataFile.shape[1] for ln in line: lineSteps = ln — ln_prev tmp = ln_prev for i in range(lineSteps): tmp += 1 fileIO.readline() parts = func_parser(fileIO.readline().rstrip(), header_count) toDataFrame.append(parts) ln_prev = tmp + 1 dataErrors = pd.DataFrame.from_records(toDataFrame, columns=dataFile.columns.values.tolist()) dataFile = pd.concat([dataFile, dataErrors], ignore_index=True) if callable(func_afterparser): func_afterparser(dataFile, file) dataRez = pd.concat([dataRez, dataFile], ignore_index=True) return dataRez |
Please mind you, I’m new to Pandas/Python and I don’t know what I’m doing.
I’m working with CSV files and I basically filter currencies.
Every other day, the exported CSV file may contain or not contain certain currencies.
I have several such cells of codes—
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
When the CSV doesn’t contain AUD, I get ValueError: cannot set a frame with no defined columns.
What I’d like to produce is a function or an if statement or a loop that checks if the column contains AUD, and if it does, it runs the above code, and if it doesn’t, it simply skips it and proceeds to the next line of code for the next currency.
Any idea how I can accomplish this?
Thanks in advance.
asked Mar 22, 2021 at 6:07
1
This can be done in 2 ways:
- You can create a try and except statement, this will try and look for the given currency and if a ValueError occurs it will skip and move on:
try:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
except ValueError:
pass
- You can create an if statement which looks for the currencies presence first:
currency_set = set(list(df['Currency'].values))
if 'AUD' in currency_set:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
answered Mar 22, 2021 at 6:19
3
1.Worst way to skip over the error/exception:
try:
<Your Code>
except:
pass
The above is probably the worst way because you want to know when an exception occur. using generic Except statements is bad practice because you want to avoid «catch em all» code. You want to be catching exceptions that you know how to handle. You want to know what specific exception occurred and you need to handle them on an exception-by-exception basis. Writing Generic except statements leads to missed bugs and tends to mislead while running the code to test.
-
Slightly worse way to handle the exception:
try: <Your Code> except Exception as e: <Some code to handle an exception>
Still not optimal as it is still generic handling
-
Average way to handle it for your case:
try: <Your Code> except ValueError: <Some code to handle this exception>
Other suggestion — Much Better Ways to deal with this:
1.You can get a set of the available columns at run time and aggregate based on if ‘AUD’ is in the list.
2.Clean your data set
answered Mar 22, 2021 at 6:29
Sandeep_RaoSandeep_Rao
1011 silver badge10 bronze badges
2
You can use try and except where
try:
#your code here
except:
#some print statement
pass
answered Mar 22, 2021 at 6:17
1
Pandas is a powerful library for data analysis in Python. It offers a variety of functions to perform various operations on data. One of the most useful operations is the apply() function. It is used to apply a function to each row or column of a DataFrame. However, sometimes the data may contain errors or inconsistencies, which can cause the apply() function to fail. To handle such errors, we can use the try-except block within the apply() function.
Understanding the Try-Except Block
The try-except block is used to handle exceptions in Python. Exceptions are errors that occur during the execution of a program. When an exception occurs, the program stops executing and raises an error message. To prevent the program from crashing, we can use a try-except block to catch the exception and handle it gracefully.
The basic syntax of the try-except block is as follows:
try:
# Code block that may raise an exception
except ExceptionType:
# Code block to handle the exception
In the above syntax, we try to execute a code block that may raise an exception. If an exception occurs, the code block within the except block is executed to handle the exception.
Embedding Try-Except in Pandas Apply Operation
To embed the try-except block in Pandas apply() function, we can define a function that contains the try-except block and then pass this function to the apply() function. Here’s an example:
import pandas as pd
def my_function(x):
try:
# Code block that may raise an exception
except:
# Code block to handle the exception
return result
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.apply(my_function)
In the above code, we define a function called my_function that contains the try-except block. We then pass this function to the apply() function of the DataFrame. The apply() function applies the function to each row or column of the DataFrame. If an exception occurs in any row or column, the try-except block within the function handles it gracefully.
Conclusion
In conclusion, the try-except block in Pandas apply() function is a powerful tool to handle exceptions that occur during data analysis. By embedding the try-except block within the apply() function, we can handle errors and inconsistencies in the data, and prevent the program from crashing. By following the guidelines in this beginner’s guide, you can easily embed the try-except block in your Pandas code and take full advantage of the power of the apply() function.